Model checks

Diagnostics

Evaluate model fit, compare competing models, and validate predictive performance using INLA's built-in criteria.

Quick Reference

Enable diagnostics via the control argument. All criteria except marginal likelihood are off by default.

import numpy as np
import pandas as pd
from pyinla import pyinla

# Synthetic data: y = 1 + 0.5 x + noise
rng = np.random.default_rng(0)
n  = 100
x  = rng.normal(size=n)
y  = 1.0 + 0.5 * x + rng.normal(0, 0.3, size=n)
df = pd.DataFrame({"y": y, "x": x})

res = pyinla(
    model={"response": "y", "fixed": ["1", "x"]},
    family='gaussian',
    data=df,
    control={
        'compute': {
            'mlik': True,    # Marginal likelihood (default: True)
            'dic':  True,    # Deviance Information Criterion
            'cpo':  True,    # CPO and PIT values
            'waic': True,    # WAIC (also enables 'po' internally)
        }
    }
)

Criterion	Access	Enable with	Default
Marginal Likelihood	`res.mlik`	`'mlik': True`	on
CPO	`res.cpo['cpo']`	`'cpo': True`	off
PIT	`res.cpo['pit']`	`'cpo': True`	off
DIC	`res.dic`	`'dic': True`	off
WAIC	`res.waic`	`'waic': True`	off

Marginal Likelihood

The marginal likelihood \(\pi(\mathbf{y})\) is the probability of observing the data under a given model, integrating over all parameters. It serves as a natural criterion for Bayesian model comparison.

What it measures

The marginal likelihood balances model fit against complexity. Models that fit the data well but use fewer effective parameters will generally have higher marginal likelihoods. This built-in Occam's razor penalises overfitting.

Interpretation

Higher values are better (on log scale, less negative is better).
Compare models fit to the same data only.
Differences of 3-5 on the log scale indicate substantial evidence for one model.
Bayes factors: \(\text{BF}_{12} = \exp(\log\pi_1(\mathbf{y}) - \log\pi_2(\mathbf{y}))\).

import numpy as np
import pandas as pd
from pyinla import pyinla

# Synthetic data
rng = np.random.default_rng(0)
n  = 100
x  = rng.normal(size=n)
y  = 1.0 + 0.5 * x + rng.normal(0, 0.3, size=n)
df = pd.DataFrame({"y": y, "x": x})

# Two competing models: full vs intercept-only
res1 = pyinla(model={"response": "y", "fixed": ["1", "x"]},
              family="gaussian", data=df,
              control={"compute": {"mlik": True}})
res2 = pyinla(model={"response": "y", "fixed": ["1"]},
              family="gaussian", data=df,
              control={"compute": {"mlik": True}})

# res.mlik is a pandas Series with two entries:
#   "log marginal-likelihood (integration)"  (integration-based, preferred)
#   "log marginal-likelihood (Gaussian)"     (Gaussian-approximation fallback)
print(res1.mlik)

log_mlik_1 = res1.mlik["log marginal-likelihood (integration)"]
log_mlik_2 = res2.mlik["log marginal-likelihood (integration)"]

# Bayes factor: BF_12 = pi_1(y) / pi_2(y)
bf = np.exp(log_mlik_1 - log_mlik_2)
print(f"Bayes Factor (model1 / model2): {bf:.2f}")

Note: pyINLA exposes both estimates so you can fall back to the Gaussian approximation if the integration-based value is unavailable. Prefer the integration-based estimate when both are present.

Conditional Predictive Ordinates (CPO)

CPO values measure leave-one-out predictive performance. For each observation \(y_i\), the CPO is the posterior predictive density evaluated at the observed value, computed using all other observations:

\text{CPO}_i = \pi(y_i \mid \mathbf{y}_{-i})

What it measures

CPO quantifies how well the model predicts each observation when that observation is held out. Low CPO values flag observations that are surprising or poorly predicted by the model, potential outliers or regions where the model struggles.

Interpretation

Higher CPO values are better (observation is well predicted).
Very low CPO values flag potential outliers or influential points.
Sum of log-CPO gives the log pseudo-marginal likelihood: \(\text{LPML} = \sum_i \log(\text{CPO}_i)\).
LPML can itself be used for model comparison (higher is better).

import numpy as np
import pandas as pd
from pyinla import pyinla

# Synthetic data
rng = np.random.default_rng(0)
n  = 100
x  = rng.normal(size=n)
y  = 1.0 + 0.5 * x + rng.normal(0, 0.3, size=n)
df = pd.DataFrame({"y": y, "x": x})

res = pyinla(
    model={"response": "y", "fixed": ["1", "x"]},
    family="gaussian", data=df,
    control={"compute": {"cpo": True}},
)

# res.cpo is a dict with three vectors aligned to observations
cpo_values = np.asarray(res.cpo['cpo'])      # leave-one-out predictive density per obs
pit_values = np.asarray(res.cpo['pit'])      # posterior predictive CDF at each obs
failure    = np.asarray(res.cpo['failure'])  # per-obs numerical failure indicator

# Log pseudo-marginal likelihood (LPML)
lpml = np.sum(np.log(cpo_values[cpo_values > 0]))
print(f"LPML: {lpml:.2f}")

# Flag potential outliers (bottom 5% CPO)
outliers = np.where(cpo_values < np.percentile(cpo_values, 5))[0]
print(f"Potential outliers at indices: {outliers}")

Probability Integral Transform (PIT)

PIT values assess calibration of the predictive distributions. For each observation, PIT is the posterior predictive CDF evaluated at the observed value:

\text{PIT}_i = \Pr(Y_i \leq y_i \mid \mathbf{y}_{-i})

What it measures

If the model is well calibrated, PIT values should be approximately uniform on \([0, 1]\). Deviations from uniformity reveal systematic issues: U-shaped histograms suggest overdispersion, inverse-U shapes suggest underdispersion.

Diagnostic plots

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
from pyinla import pyinla

# Synthetic data
rng = np.random.default_rng(0)
n  = 100
x  = rng.normal(size=n)
y  = 1.0 + 0.5 * x + rng.normal(0, 0.3, size=n)
df = pd.DataFrame({"y": y, "x": x})

res = pyinla(
    model={"response": "y", "fixed": ["1", "x"]},
    family="gaussian", data=df,
    control={"compute": {"cpo": True}},
)

pit_values = np.asarray(res.cpo['pit'])

fig, axes = plt.subplots(1, 2, figsize=(10, 4))

# Histogram should be approximately uniform
axes[0].hist(pit_values, bins=20, density=True, edgecolor='black', alpha=0.7)
axes[0].axhline(y=1, color='red', linestyle='--', label='Uniform')
axes[0].set_xlabel('PIT')
axes[0].set_ylabel('Density')
axes[0].set_title('PIT histogram')
axes[0].legend()

# Q-Q plot against uniform
stats.probplot(pit_values, dist="uniform", plot=axes[1])
axes[1].set_title('PIT Q-Q plot')

plt.tight_layout()
plt.show()

Deviance Information Criterion (DIC)

DIC extends classical information criteria to hierarchical models. It balances goodness of fit (measured by deviance) against model complexity (measured by the effective number of parameters):

\text{DIC} = \bar{D} + p_D = D(\bar{\theta}) + 2p_D

where \(\bar{D}\) is the posterior mean deviance and \(p_D\) is the effective number of parameters.

What it measures

DIC estimates out-of-sample predictive accuracy using an approximation based on the deviance. The \(p_D\) term penalises complexity, preventing overfitting. Models with more parameters than effectively supported by the data will have inflated \(p_D\).

Interpretation

Lower DIC values are better.
Differences smaller than 2-3 are not meaningful.
Differences of 5-10 indicate substantial evidence.
The effective number of parameters \(p_D\) should be sensible for the model.
Negative \(p_D\) can occur and may indicate model misspecification.

import numpy as np
import pandas as pd
from pyinla import pyinla

# Synthetic data
rng = np.random.default_rng(0)
n  = 100
x  = rng.normal(size=n)
y  = 1.0 + 0.5 * x + rng.normal(0, 0.3, size=n)
df = pd.DataFrame({"y": y, "x": x})

# Two competing models: full vs intercept-only
res1 = pyinla(model={"response": "y", "fixed": ["1", "x"]},
              family="gaussian", data=df,
              control={"compute": {"dic": True}})
res2 = pyinla(model={"response": "y", "fixed": ["1"]},
              family="gaussian", data=df,
              control={"compute": {"dic": True}})

# res.dic is a dict; the headline keys are:
#   'dic'             : the DIC value
#   'p.eff'           : effective number of parameters (p_D)
#   'mean.deviance'   : posterior mean deviance (D-bar)
#   'deviance.mean'   : deviance at the posterior mean (D(theta-bar))
# Optional extras when present: 'local.dic', 'local.p.eff',
# 'family.p.eff', and '.sat' variants.
print(res1.dic['dic'], res1.dic['p.eff'], res1.dic['mean.deviance'])

# Compare two models
print(f"Model 1 DIC: {res1.dic['dic']:.1f} (p.eff = {res1.dic['p.eff']:.1f})")
print(f"Model 2 DIC: {res2.dic['dic']:.1f} (p.eff = {res2.dic['p.eff']:.1f})")
delta = res1.dic['dic'] - res2.dic['dic']
print(f"Delta DIC: {delta:+.1f} (negative favors Model 1)")

Caution: DIC has known limitations for models with many random effects or when the posterior is far from Gaussian. Consider WAIC or marginal likelihood for such cases.

Widely Applicable Information Criterion (WAIC)

WAIC is a more robust alternative to DIC that uses the full posterior predictive distribution rather than point estimates:

\text{WAIC} = -2 \cdot \text{lppd} + 2 \cdot p_{\text{WAIC}}

where lppd is the log pointwise predictive density and \(p_{\text{WAIC}}\) measures effective complexity.

What it measures

WAIC asymptotically equals leave-one-out cross-validation. Unlike DIC, it fully accounts for posterior uncertainty and is invariant to parameterisation, which makes it more reliable for hierarchical and mixture models.

Interpretation

Lower WAIC values are better.
Interpretation is similar to DIC and AIC.
More stable than DIC for complex hierarchical models.
Look at differences alongside the per-observation contributions in local.waic for context.

import numpy as np
import pandas as pd
from pyinla import pyinla

# Synthetic data
rng = np.random.default_rng(0)
n  = 100
x  = rng.normal(size=n)
y  = 1.0 + 0.5 * x + rng.normal(0, 0.3, size=n)
df = pd.DataFrame({"y": y, "x": x})

# Two competing models: full vs intercept-only
res1 = pyinla(model={"response": "y", "fixed": ["1", "x"]},
              family="gaussian", data=df,
              control={"compute": {"waic": True}})
res2 = pyinla(model={"response": "y", "fixed": ["1"]},
              family="gaussian", data=df,
              control={"compute": {"waic": True}})

# res.waic is a dict with:
#   'waic'        : the WAIC value
#   'p.eff'       : effective number of parameters (p_WAIC)
#   'local.waic'  : per-observation WAIC contributions (vector)
#   'local.p.eff' : per-observation effective parameters (vector)
print(res1.waic['waic'], res1.waic['p.eff'])

waic1 = res1.waic['waic']
waic2 = res2.waic['waic']
print(f"Model 1 WAIC: {waic1:.1f}")
print(f"Model 2 WAIC: {waic2:.1f}")
print(f"Difference:   {waic1 - waic2:+.1f} (negative favors Model 1)")

Which Criterion to Use?

Criterion	Best for	Limitations
Marginal Likelihood	Formal Bayesian model comparison; nested and non-nested models	Sensitive to prior specification
CPO / LPML	Predictive model assessment; identifying outliers	Computationally heavier; can have numerical issues
DIC	Quick comparison of similar models; familiar to frequentists	Can be unstable for complex random effects
WAIC	Robust comparison; hierarchical models	Requires careful interpretation of effective parameters

General recommendations

Start with the marginal likelihood for initial model screening.
Use CPO / PIT to check model assumptions and identify problematic observations.
Report WAIC for final model comparisons (more robust than DIC).
Use multiple criteria: when they disagree, investigate why.