Diagnostics
Evaluate model fit, compare competing models, and validate predictive performance using INLA's built-in criteria.
Quick Reference
Enable diagnostics via the control argument. All criteria except marginal likelihood are off by default.
import numpy as np
import pandas as pd
from pyinla import pyinla
# Synthetic data: y = 1 + 0.5 x + noise
rng = np.random.default_rng(0)
n = 100
x = rng.normal(size=n)
y = 1.0 + 0.5 * x + rng.normal(0, 0.3, size=n)
df = pd.DataFrame({"y": y, "x": x})
res = pyinla(
model={"response": "y", "fixed": ["1", "x"]},
family='gaussian',
data=df,
control={
'compute': {
'mlik': True, # Marginal likelihood (default: True)
'dic': True, # Deviance Information Criterion
'cpo': True, # CPO and PIT values
'waic': True, # WAIC (also enables 'po' internally)
}
}
)
Marginal Likelihood
The marginal likelihood \(\pi(\mathbf{y})\) is the probability of observing the data under a given model, integrating over all parameters. It serves as a natural criterion for Bayesian model comparison.
What it measures
The marginal likelihood balances model fit against complexity. Models that fit the data well but use fewer effective parameters will generally have higher marginal likelihoods. This built-in Occam's razor penalises overfitting.
Interpretation
- Higher values are better (on log scale, less negative is better).
- Compare models fit to the same data only.
- Differences of 3-5 on the log scale indicate substantial evidence for one model.
- Bayes factors: \(\text{BF}_{12} = \exp(\log\pi_1(\mathbf{y}) - \log\pi_2(\mathbf{y}))\).
import numpy as np
import pandas as pd
from pyinla import pyinla
# Synthetic data
rng = np.random.default_rng(0)
n = 100
x = rng.normal(size=n)
y = 1.0 + 0.5 * x + rng.normal(0, 0.3, size=n)
df = pd.DataFrame({"y": y, "x": x})
# Two competing models: full vs intercept-only
res1 = pyinla(model={"response": "y", "fixed": ["1", "x"]},
family="gaussian", data=df,
control={"compute": {"mlik": True}})
res2 = pyinla(model={"response": "y", "fixed": ["1"]},
family="gaussian", data=df,
control={"compute": {"mlik": True}})
# res.mlik is a pandas Series with two entries:
# "log marginal-likelihood (integration)" (integration-based, preferred)
# "log marginal-likelihood (Gaussian)" (Gaussian-approximation fallback)
print(res1.mlik)
log_mlik_1 = res1.mlik["log marginal-likelihood (integration)"]
log_mlik_2 = res2.mlik["log marginal-likelihood (integration)"]
# Bayes factor: BF_12 = pi_1(y) / pi_2(y)
bf = np.exp(log_mlik_1 - log_mlik_2)
print(f"Bayes Factor (model1 / model2): {bf:.2f}")
Conditional Predictive Ordinates (CPO)
CPO values measure leave-one-out predictive performance. For each observation \(y_i\), the CPO is the posterior predictive density evaluated at the observed value, computed using all other observations:
What it measures
CPO quantifies how well the model predicts each observation when that observation is held out. Low CPO values flag observations that are surprising or poorly predicted by the model, potential outliers or regions where the model struggles.
Interpretation
- Higher CPO values are better (observation is well predicted).
- Very low CPO values flag potential outliers or influential points.
- Sum of log-CPO gives the log pseudo-marginal likelihood: \(\text{LPML} = \sum_i \log(\text{CPO}_i)\).
- LPML can itself be used for model comparison (higher is better).
import numpy as np
import pandas as pd
from pyinla import pyinla
# Synthetic data
rng = np.random.default_rng(0)
n = 100
x = rng.normal(size=n)
y = 1.0 + 0.5 * x + rng.normal(0, 0.3, size=n)
df = pd.DataFrame({"y": y, "x": x})
res = pyinla(
model={"response": "y", "fixed": ["1", "x"]},
family="gaussian", data=df,
control={"compute": {"cpo": True}},
)
# res.cpo is a dict with three vectors aligned to observations
cpo_values = np.asarray(res.cpo['cpo']) # leave-one-out predictive density per obs
pit_values = np.asarray(res.cpo['pit']) # posterior predictive CDF at each obs
failure = np.asarray(res.cpo['failure']) # per-obs numerical failure indicator
# Log pseudo-marginal likelihood (LPML)
lpml = np.sum(np.log(cpo_values[cpo_values > 0]))
print(f"LPML: {lpml:.2f}")
# Flag potential outliers (bottom 5% CPO)
outliers = np.where(cpo_values < np.percentile(cpo_values, 5))[0]
print(f"Potential outliers at indices: {outliers}")
Probability Integral Transform (PIT)
PIT values assess calibration of the predictive distributions. For each observation, PIT is the posterior predictive CDF evaluated at the observed value:
What it measures
If the model is well calibrated, PIT values should be approximately uniform on \([0, 1]\). Deviations from uniformity reveal systematic issues: U-shaped histograms suggest overdispersion, inverse-U shapes suggest underdispersion.
Diagnostic plots
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
from pyinla import pyinla
# Synthetic data
rng = np.random.default_rng(0)
n = 100
x = rng.normal(size=n)
y = 1.0 + 0.5 * x + rng.normal(0, 0.3, size=n)
df = pd.DataFrame({"y": y, "x": x})
res = pyinla(
model={"response": "y", "fixed": ["1", "x"]},
family="gaussian", data=df,
control={"compute": {"cpo": True}},
)
pit_values = np.asarray(res.cpo['pit'])
fig, axes = plt.subplots(1, 2, figsize=(10, 4))
# Histogram should be approximately uniform
axes[0].hist(pit_values, bins=20, density=True, edgecolor='black', alpha=0.7)
axes[0].axhline(y=1, color='red', linestyle='--', label='Uniform')
axes[0].set_xlabel('PIT')
axes[0].set_ylabel('Density')
axes[0].set_title('PIT histogram')
axes[0].legend()
# Q-Q plot against uniform
stats.probplot(pit_values, dist="uniform", plot=axes[1])
axes[1].set_title('PIT Q-Q plot')
plt.tight_layout()
plt.show()
Deviance Information Criterion (DIC)
DIC extends classical information criteria to hierarchical models. It balances goodness of fit (measured by deviance) against model complexity (measured by the effective number of parameters):
where \(\bar{D}\) is the posterior mean deviance and \(p_D\) is the effective number of parameters.
What it measures
DIC estimates out-of-sample predictive accuracy using an approximation based on the deviance. The \(p_D\) term penalises complexity, preventing overfitting. Models with more parameters than effectively supported by the data will have inflated \(p_D\).
Interpretation
- Lower DIC values are better.
- Differences smaller than 2-3 are not meaningful.
- Differences of 5-10 indicate substantial evidence.
- The effective number of parameters \(p_D\) should be sensible for the model.
- Negative \(p_D\) can occur and may indicate model misspecification.
import numpy as np
import pandas as pd
from pyinla import pyinla
# Synthetic data
rng = np.random.default_rng(0)
n = 100
x = rng.normal(size=n)
y = 1.0 + 0.5 * x + rng.normal(0, 0.3, size=n)
df = pd.DataFrame({"y": y, "x": x})
# Two competing models: full vs intercept-only
res1 = pyinla(model={"response": "y", "fixed": ["1", "x"]},
family="gaussian", data=df,
control={"compute": {"dic": True}})
res2 = pyinla(model={"response": "y", "fixed": ["1"]},
family="gaussian", data=df,
control={"compute": {"dic": True}})
# res.dic is a dict; the headline keys are:
# 'dic' : the DIC value
# 'p.eff' : effective number of parameters (p_D)
# 'mean.deviance' : posterior mean deviance (D-bar)
# 'deviance.mean' : deviance at the posterior mean (D(theta-bar))
# Optional extras when present: 'local.dic', 'local.p.eff',
# 'family.p.eff', and '.sat' variants.
print(res1.dic['dic'], res1.dic['p.eff'], res1.dic['mean.deviance'])
# Compare two models
print(f"Model 1 DIC: {res1.dic['dic']:.1f} (p.eff = {res1.dic['p.eff']:.1f})")
print(f"Model 2 DIC: {res2.dic['dic']:.1f} (p.eff = {res2.dic['p.eff']:.1f})")
delta = res1.dic['dic'] - res2.dic['dic']
print(f"Delta DIC: {delta:+.1f} (negative favors Model 1)")
Widely Applicable Information Criterion (WAIC)
WAIC is a more robust alternative to DIC that uses the full posterior predictive distribution rather than point estimates:
where lppd is the log pointwise predictive density and \(p_{\text{WAIC}}\) measures effective complexity.
What it measures
WAIC asymptotically equals leave-one-out cross-validation. Unlike DIC, it fully accounts for posterior uncertainty and is invariant to parameterisation, which makes it more reliable for hierarchical and mixture models.
Interpretation
- Lower WAIC values are better.
- Interpretation is similar to DIC and AIC.
- More stable than DIC for complex hierarchical models.
- Look at differences alongside the per-observation contributions in
local.waicfor context.
import numpy as np
import pandas as pd
from pyinla import pyinla
# Synthetic data
rng = np.random.default_rng(0)
n = 100
x = rng.normal(size=n)
y = 1.0 + 0.5 * x + rng.normal(0, 0.3, size=n)
df = pd.DataFrame({"y": y, "x": x})
# Two competing models: full vs intercept-only
res1 = pyinla(model={"response": "y", "fixed": ["1", "x"]},
family="gaussian", data=df,
control={"compute": {"waic": True}})
res2 = pyinla(model={"response": "y", "fixed": ["1"]},
family="gaussian", data=df,
control={"compute": {"waic": True}})
# res.waic is a dict with:
# 'waic' : the WAIC value
# 'p.eff' : effective number of parameters (p_WAIC)
# 'local.waic' : per-observation WAIC contributions (vector)
# 'local.p.eff' : per-observation effective parameters (vector)
print(res1.waic['waic'], res1.waic['p.eff'])
waic1 = res1.waic['waic']
waic2 = res2.waic['waic']
print(f"Model 1 WAIC: {waic1:.1f}")
print(f"Model 2 WAIC: {waic2:.1f}")
print(f"Difference: {waic1 - waic2:+.1f} (negative favors Model 1)")
Which Criterion to Use?
General recommendations
- Start with the marginal likelihood for initial model screening.
- Use CPO / PIT to check model assumptions and identify problematic observations.
- Report WAIC for final model comparisons (more robust than DIC).
- Use multiple criteria: when they disagree, investigate why.