Model checks

Diagnostics

Evaluate model fit, compare competing models, and validate predictive performance using INLA's built-in criteria.

Quick Reference

Enable diagnostics via the control argument. All criteria except marginal likelihood are off by default.

import numpy as np
import pandas as pd
from pyinla import pyinla

# Synthetic data: y = 1 + 0.5 x + noise
rng = np.random.default_rng(0)
n  = 100
x  = rng.normal(size=n)
y  = 1.0 + 0.5 * x + rng.normal(0, 0.3, size=n)
df = pd.DataFrame({"y": y, "x": x})

res = pyinla(
    model={"response": "y", "fixed": ["1", "x"]},
    family='gaussian',
    data=df,
    control={
        'compute': {
            'mlik': True,    # Marginal likelihood (default: True)
            'dic':  True,    # Deviance Information Criterion
            'cpo':  True,    # CPO and PIT values
            'waic': True,    # WAIC (also enables 'po' internally)
        }
    }
)
Criterion Access Enable with Default
Marginal Likelihood res.mlik 'mlik': True on
CPO res.cpo['cpo'] 'cpo': True off
PIT res.cpo['pit'] 'cpo': True off
DIC res.dic 'dic': True off
WAIC res.waic 'waic': True off

Marginal Likelihood

The marginal likelihood \(\pi(\mathbf{y})\) is the probability of observing the data under a given model, integrating over all parameters. It serves as a natural criterion for Bayesian model comparison.

What it measures

The marginal likelihood balances model fit against complexity. Models that fit the data well but use fewer effective parameters will generally have higher marginal likelihoods. This built-in Occam's razor penalises overfitting.

Interpretation

import numpy as np
import pandas as pd
from pyinla import pyinla

# Synthetic data
rng = np.random.default_rng(0)
n  = 100
x  = rng.normal(size=n)
y  = 1.0 + 0.5 * x + rng.normal(0, 0.3, size=n)
df = pd.DataFrame({"y": y, "x": x})

# Two competing models: full vs intercept-only
res1 = pyinla(model={"response": "y", "fixed": ["1", "x"]},
              family="gaussian", data=df,
              control={"compute": {"mlik": True}})
res2 = pyinla(model={"response": "y", "fixed": ["1"]},
              family="gaussian", data=df,
              control={"compute": {"mlik": True}})

# res.mlik is a pandas Series with two entries:
#   "log marginal-likelihood (integration)"  (integration-based, preferred)
#   "log marginal-likelihood (Gaussian)"     (Gaussian-approximation fallback)
print(res1.mlik)

log_mlik_1 = res1.mlik["log marginal-likelihood (integration)"]
log_mlik_2 = res2.mlik["log marginal-likelihood (integration)"]

# Bayes factor: BF_12 = pi_1(y) / pi_2(y)
bf = np.exp(log_mlik_1 - log_mlik_2)
print(f"Bayes Factor (model1 / model2): {bf:.2f}")
Note: pyINLA exposes both estimates so you can fall back to the Gaussian approximation if the integration-based value is unavailable. Prefer the integration-based estimate when both are present.

Conditional Predictive Ordinates (CPO)

CPO values measure leave-one-out predictive performance. For each observation \(y_i\), the CPO is the posterior predictive density evaluated at the observed value, computed using all other observations:

$$\text{CPO}_i = \pi(y_i \mid \mathbf{y}_{-i})$$

What it measures

CPO quantifies how well the model predicts each observation when that observation is held out. Low CPO values flag observations that are surprising or poorly predicted by the model, potential outliers or regions where the model struggles.

Interpretation

import numpy as np
import pandas as pd
from pyinla import pyinla

# Synthetic data
rng = np.random.default_rng(0)
n  = 100
x  = rng.normal(size=n)
y  = 1.0 + 0.5 * x + rng.normal(0, 0.3, size=n)
df = pd.DataFrame({"y": y, "x": x})

res = pyinla(
    model={"response": "y", "fixed": ["1", "x"]},
    family="gaussian", data=df,
    control={"compute": {"cpo": True}},
)

# res.cpo is a dict with three vectors aligned to observations
cpo_values = np.asarray(res.cpo['cpo'])      # leave-one-out predictive density per obs
pit_values = np.asarray(res.cpo['pit'])      # posterior predictive CDF at each obs
failure    = np.asarray(res.cpo['failure'])  # per-obs numerical failure indicator

# Log pseudo-marginal likelihood (LPML)
lpml = np.sum(np.log(cpo_values[cpo_values > 0]))
print(f"LPML: {lpml:.2f}")

# Flag potential outliers (bottom 5% CPO)
outliers = np.where(cpo_values < np.percentile(cpo_values, 5))[0]
print(f"Potential outliers at indices: {outliers}")

Probability Integral Transform (PIT)

PIT values assess calibration of the predictive distributions. For each observation, PIT is the posterior predictive CDF evaluated at the observed value:

$$\text{PIT}_i = \Pr(Y_i \leq y_i \mid \mathbf{y}_{-i})$$

What it measures

If the model is well calibrated, PIT values should be approximately uniform on \([0, 1]\). Deviations from uniformity reveal systematic issues: U-shaped histograms suggest overdispersion, inverse-U shapes suggest underdispersion.

Diagnostic plots

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
from pyinla import pyinla

# Synthetic data
rng = np.random.default_rng(0)
n  = 100
x  = rng.normal(size=n)
y  = 1.0 + 0.5 * x + rng.normal(0, 0.3, size=n)
df = pd.DataFrame({"y": y, "x": x})

res = pyinla(
    model={"response": "y", "fixed": ["1", "x"]},
    family="gaussian", data=df,
    control={"compute": {"cpo": True}},
)

pit_values = np.asarray(res.cpo['pit'])

fig, axes = plt.subplots(1, 2, figsize=(10, 4))

# Histogram should be approximately uniform
axes[0].hist(pit_values, bins=20, density=True, edgecolor='black', alpha=0.7)
axes[0].axhline(y=1, color='red', linestyle='--', label='Uniform')
axes[0].set_xlabel('PIT')
axes[0].set_ylabel('Density')
axes[0].set_title('PIT histogram')
axes[0].legend()

# Q-Q plot against uniform
stats.probplot(pit_values, dist="uniform", plot=axes[1])
axes[1].set_title('PIT Q-Q plot')

plt.tight_layout()
plt.show()

Deviance Information Criterion (DIC)

DIC extends classical information criteria to hierarchical models. It balances goodness of fit (measured by deviance) against model complexity (measured by the effective number of parameters):

$$\text{DIC} = \bar{D} + p_D = D(\bar{\theta}) + 2p_D$$

where \(\bar{D}\) is the posterior mean deviance and \(p_D\) is the effective number of parameters.

What it measures

DIC estimates out-of-sample predictive accuracy using an approximation based on the deviance. The \(p_D\) term penalises complexity, preventing overfitting. Models with more parameters than effectively supported by the data will have inflated \(p_D\).

Interpretation

import numpy as np
import pandas as pd
from pyinla import pyinla

# Synthetic data
rng = np.random.default_rng(0)
n  = 100
x  = rng.normal(size=n)
y  = 1.0 + 0.5 * x + rng.normal(0, 0.3, size=n)
df = pd.DataFrame({"y": y, "x": x})

# Two competing models: full vs intercept-only
res1 = pyinla(model={"response": "y", "fixed": ["1", "x"]},
              family="gaussian", data=df,
              control={"compute": {"dic": True}})
res2 = pyinla(model={"response": "y", "fixed": ["1"]},
              family="gaussian", data=df,
              control={"compute": {"dic": True}})

# res.dic is a dict; the headline keys are:
#   'dic'             : the DIC value
#   'p.eff'           : effective number of parameters (p_D)
#   'mean.deviance'   : posterior mean deviance (D-bar)
#   'deviance.mean'   : deviance at the posterior mean (D(theta-bar))
# Optional extras when present: 'local.dic', 'local.p.eff',
# 'family.p.eff', and '.sat' variants.
print(res1.dic['dic'], res1.dic['p.eff'], res1.dic['mean.deviance'])

# Compare two models
print(f"Model 1 DIC: {res1.dic['dic']:.1f} (p.eff = {res1.dic['p.eff']:.1f})")
print(f"Model 2 DIC: {res2.dic['dic']:.1f} (p.eff = {res2.dic['p.eff']:.1f})")
delta = res1.dic['dic'] - res2.dic['dic']
print(f"Delta DIC: {delta:+.1f} (negative favors Model 1)")
Caution: DIC has known limitations for models with many random effects or when the posterior is far from Gaussian. Consider WAIC or marginal likelihood for such cases.

Widely Applicable Information Criterion (WAIC)

WAIC is a more robust alternative to DIC that uses the full posterior predictive distribution rather than point estimates:

$$\text{WAIC} = -2 \cdot \text{lppd} + 2 \cdot p_{\text{WAIC}}$$

where lppd is the log pointwise predictive density and \(p_{\text{WAIC}}\) measures effective complexity.

What it measures

WAIC asymptotically equals leave-one-out cross-validation. Unlike DIC, it fully accounts for posterior uncertainty and is invariant to parameterisation, which makes it more reliable for hierarchical and mixture models.

Interpretation

import numpy as np
import pandas as pd
from pyinla import pyinla

# Synthetic data
rng = np.random.default_rng(0)
n  = 100
x  = rng.normal(size=n)
y  = 1.0 + 0.5 * x + rng.normal(0, 0.3, size=n)
df = pd.DataFrame({"y": y, "x": x})

# Two competing models: full vs intercept-only
res1 = pyinla(model={"response": "y", "fixed": ["1", "x"]},
              family="gaussian", data=df,
              control={"compute": {"waic": True}})
res2 = pyinla(model={"response": "y", "fixed": ["1"]},
              family="gaussian", data=df,
              control={"compute": {"waic": True}})

# res.waic is a dict with:
#   'waic'        : the WAIC value
#   'p.eff'       : effective number of parameters (p_WAIC)
#   'local.waic'  : per-observation WAIC contributions (vector)
#   'local.p.eff' : per-observation effective parameters (vector)
print(res1.waic['waic'], res1.waic['p.eff'])

waic1 = res1.waic['waic']
waic2 = res2.waic['waic']
print(f"Model 1 WAIC: {waic1:.1f}")
print(f"Model 2 WAIC: {waic2:.1f}")
print(f"Difference:   {waic1 - waic2:+.1f} (negative favors Model 1)")

Which Criterion to Use?

Criterion Best for Limitations
Marginal Likelihood Formal Bayesian model comparison; nested and non-nested models Sensitive to prior specification
CPO / LPML Predictive model assessment; identifying outliers Computationally heavier; can have numerical issues
DIC Quick comparison of similar models; familiar to frequentists Can be unstable for complex random effects
WAIC Robust comparison; hierarchical models Requires careful interpretation of effective parameters

General recommendations

  1. Start with the marginal likelihood for initial model screening.
  2. Use CPO / PIT to check model assumptions and identify problematic observations.
  3. Report WAIC for final model comparisons (more robust than DIC).
  4. Use multiple criteria: when they disagree, investigate why.