Learn PyINLA

Predictions & Missing Values

Learn how pyINLA handles missing values (NAs) in responses and covariates, and how to generate predictions for unobserved data.

Overview

Key Concepts

In Bayesian inference with INLA, missing values are handled naturally through the posterior predictive distribution. When the response \(y_i\) is missing (NA), INLA computes the predictive distribution \(\pi(y_i \mid \mathbf{y}_{obs})\) for that observation.

pyINLA supports missing values in two contexts:

Missing In Python Syntax How INLA Handles It
Response (y) np.nan Computes predictive distribution for missing values
Covariates (x) np.nan Accepted without error, but the missing entries are mechanically propagated and contaminate every coefficient. Pre-impute or drop incomplete rows upstream — see warning below.

Missing Values in the Response

When the response variable contains missing values (NA/NaN), INLA treats them as unobserved data and computes their posterior predictive distribution:

$$\pi(y_i^{miss} \mid \mathbf{y}^{obs}) = \int \pi(y_i^{miss} \mid \boldsymbol{\eta}_i, \boldsymbol{\theta}) \cdot \pi(\boldsymbol{\eta}, \boldsymbol{\theta} \mid \mathbf{y}^{obs}) \, d\boldsymbol{\eta} \, d\boldsymbol{\theta}$$

This is the foundation for:

Basic Usage

Set missing response values to np.nan and read the fitted-value summary:

import numpy as np
import pandas as pd
from pyinla import pyinla

# Create data with missing response values
df = pd.DataFrame({
    'y': [1.2, np.nan, 3.4, np.nan, 5.6, 6.7, np.nan, 8.9],  # NaN = missing
    'x': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8]
})

result = pyinla(
    model={'response': 'y', 'fixed': ['1', 'x']},
    family='gaussian',
    data=df,
)

# summary_fitted_values is populated for ALL observations, including the
# missing rows; no extra control flag is needed in the current pyinla.
print(result.summary_fitted_values)

Accessing Predictions

The fitted values for missing observations contain the predictive distribution summary:

# Find indices of missing values
missing_idx = np.where(df['y'].isna())[0]
print(f"Missing indices: {missing_idx}")

# Get predictions for missing values
predictions = result.summary_fitted_values.iloc[missing_idx]
print(predictions[['mean', 'sd', '0.025quant', '0.975quant']])

Output Columns

meanPosterior mean of the prediction
sdPosterior standard deviation
0.025quant2.5% quantile (lower bound of 95% CI)
0.5quantPosterior median
0.975quant97.5% quantile (upper bound of 95% CI)
modePosterior mode

Missing Values in Covariates

pyINLA accepts np.nan in covariate columns without raising an error, but the engine does not perform principled multiple-imputation or partial-information regression on its own. The handling is mechanical: any row whose covariates contain NaN contributes to the fit with that NaN propagated into the design matrix as zero, which contaminates every coefficient (intercept and other slopes included), not just the slope on the affected covariate.

What actually happens (empirical)

Comparing two fits on the eight-row example below:

  • All eight rows kept (with two NaNs in x2): (Intercept) sd ≈ 1.6, x1 sd ≈ 2.1.
  • Drop incomplete rows first (six rows): (Intercept) sd ≈ 0.85, x1 sd ≈ 1.5.

The all-rows fit has larger posterior uncertainty for every parameter and meaningfully different posterior means — the opposite of what "partial information improves on complete-case analysis" would imply. The takeaway: don't rely on the engine to do the right thing here. Either impute the missing covariates upstream (e.g. with scikit-learn's IterativeImputer) or drop incomplete rows explicitly.

Example: Covariate with Missing Values

import numpy as np
import pandas as pd
from pyinla import pyinla

# Data with missing covariate values
df = pd.DataFrame({
    'y': [5.0, 3.0, 4.0, 2.0, 6.0, 4.5, 3.5, 5.5],
    'x1': [0.5, 0.8, 0.3, 0.4, 0.9, 0.6, 0.7, 0.2],
    'x2': [1.0, np.nan, 3.0, 4.0, np.nan, 2.5, 3.5, 1.5]  # Missing values
})

# pyinla accepts the NaNs and runs without error -- but see the warning
# box above about how the missing entries contaminate the fit.
result = pyinla(
    model={'response': 'y', 'fixed': ['1', 'x1', 'x2']},
    family='gaussian',
    data=df,
)
print("Fixed Effects (all 8 rows, includes contaminated estimates):")
print(result.summary_fixed[['mean', 'sd']])

# Recommended pattern: drop incomplete rows explicitly and re-fit.
result_complete = pyinla(
    model={'response': 'y', 'fixed': ['1', 'x1', 'x2']},
    family='gaussian',
    data=df.dropna(),
)
print("\nFixed Effects (complete-case fit, 6 rows):")
print(result_complete.summary_fixed[['mean', 'sd']])

Specification in pyINLA

Allowed Parameters of control['predictor']

The predictor block accepts exactly two keys. The source of truth is safety/control.py; other keys are rejected with pyinla safety check: unsupported keys in control['predictor']: ....

ParameterRequiredTypeDescription
compute No bool Hint to the engine to compute the fitted-value summary. In the current pyinla, result.summary_fitted_values is populated whether this is set or not, so the flag is effectively a no-op for that attribute.
link No None or 1 Selects the output scale of summary_fitted_values. Omit / None returns the linear predictor; 1 applies the likelihood's inverse link to return the response scale.

Per-Parameter Runnable Examples

Click any parameter (or scenario) to expand a minimal runnable snippet. All snippets share the shared-setup block below.

# shared setup for every snippet below
import numpy as np
import pandas as pd
from pyinla import pyinla
compute = True   (accepted spelling; no-op in pyinla)
# The fitted-value summary is populated regardless of this kwarg;
# this snippet exists to document the spelling, not a runtime effect.
df = pd.DataFrame({
    'y': [1.2, np.nan, 3.4, np.nan, 5.6, 6.7, np.nan, 8.9],
    'x': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8],
})
result = pyinla(
    model={'response': 'y', 'fixed': ['1', 'x']},
    family='gaussian', data=df,
    control={'predictor': {'compute': True}},
)
print(result.summary_fitted_values.shape)
print(list(result.summary_fitted_values.columns))
link = 1   return predictions on the response scale
# Poisson with NaN responses. link=None returns log-rate (eta);
# link=1 applies the inverse link (exp) so means are expected counts.
df = pd.DataFrame({
    'y': [3, np.nan, 5, 2, np.nan, 8, 4, 1],
    'x': [0.5, 0.6, 0.7, 0.3, 0.9, 1.0, 0.4, 0.2],
})
res_eta = pyinla(model={'response': 'y', 'fixed': ['1', 'x']},
                  family='poisson', data=df,
                  control={'predictor': {'compute': True}})
res_mu  = pyinla(model={'response': 'y', 'fixed': ['1', 'x']},
                  family='poisson', data=df,
                  control={'predictor': {'compute': True, 'link': 1}})

missing = np.where(df['y'].isna())[0]
print("linear-predictor mean (eta) :",
      res_eta.summary_fitted_values.iloc[missing]['mean'].values.round(3))
print("response-scale mean (lambda):",
      res_mu.summary_fitted_values.iloc[missing]['mean'].values.round(3))

Common Scenarios

NaN in the response column   per-row predictive distribution
# Missing y entries (np.nan) are treated as unobserved; the engine
# returns the full posterior predictive summary at those rows.
df = pd.DataFrame({
    'y': [1.2, np.nan, 3.4, np.nan, 5.6, 6.7, np.nan, 8.9],
    'x': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8],
})
result = pyinla(model={'response': 'y', 'fixed': ['1', 'x']},
                 family='gaussian', data=df,
                 control={'predictor': {'compute': True}})

missing = np.where(df['y'].isna())[0]
print(f"missing indices: {missing}")
print(result.summary_fitted_values.iloc[missing][['mean', 'sd',
                                                     '0.025quant', '0.975quant']])
NaN in a covariate column   partial-information estimation
# Rows with NaN in x2 still contribute to the (Intercept) and x1
# estimates; only the x2 coefficient is informed solely by complete rows.
df = pd.DataFrame({
    'y':  [5.0, 3.0, 4.0, 2.0, 6.0, 4.5, 3.5, 5.5],
    'x1': [0.5, 0.8, 0.3, 0.4, 0.9, 0.6, 0.7, 0.2],
    'x2': [1.0, np.nan, 3.0, 4.0, np.nan, 2.5, 3.5, 1.5],
})
result = pyinla(model={'response': 'y', 'fixed': ['1', 'x1', 'x2']},
                 family='gaussian', data=df,
                 control={'predictor': {'compute': True}})
print(result.summary_fixed[['mean', 'sd']])