Learn PyINLA

Predictions & Missing Values

Learn how pyINLA handles missing values (NAs) in responses and covariates, and how to generate predictions for unobserved data.

Overview

Key Concepts

In Bayesian inference with INLA, missing values are handled naturally through the posterior predictive distribution. When the response \(y_i\) is missing (NA), INLA computes the predictive distribution \(\pi(y_i \mid \mathbf{y}_{obs})\) for that observation.

pyINLA supports missing values in two contexts:

Missing response values - INLA automatically computes predictive distributions
Missing covariate values - Observations contribute partially to estimation

Missing In	Python Syntax	How INLA Handles It
Response (y)	`np.nan`	Computes predictive distribution for missing values
Covariates (x)	`np.nan`	Accepted without error, but the missing entries are mechanically propagated and contaminate every coefficient. Pre-impute or drop incomplete rows upstream — see warning below.

Missing Values in the Response

When the response variable contains missing values (NA/NaN), INLA treats them as unobserved data and computes their posterior predictive distribution:

\pi(y_i^{miss} \mid \mathbf{y}^{obs}) = \int \pi(y_i^{miss} \mid \boldsymbol{\eta}_i, \boldsymbol{\theta}) \cdot \pi(\boldsymbol{\eta}, \boldsymbol{\theta} \mid \mathbf{y}^{obs}) \, d\boldsymbol{\eta} \, d\boldsymbol{\theta}

This is the foundation for:

Prediction - Forecasting new observations
Imputation - Filling in missing data
Cross-validation - Leave-one-out predictions

Basic Usage

Set missing response values to np.nan and read the fitted-value summary:

import numpy as np
import pandas as pd
from pyinla import pyinla

# Create data with missing response values
df = pd.DataFrame({
    'y': [1.2, np.nan, 3.4, np.nan, 5.6, 6.7, np.nan, 8.9],  # NaN = missing
    'x': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8]
})

result = pyinla(
    model={'response': 'y', 'fixed': ['1', 'x']},
    family='gaussian',
    data=df,
)

# summary_fitted_values is populated for ALL observations, including the
# missing rows; no extra control flag is needed in the current pyinla.
print(result.summary_fitted_values)

Accessing Predictions

The fitted values for missing observations contain the predictive distribution summary:

# Find indices of missing values
missing_idx = np.where(df['y'].isna())[0]
print(f"Missing indices: {missing_idx}")

# Get predictions for missing values
predictions = result.summary_fitted_values.iloc[missing_idx]
print(predictions[['mean', 'sd', '0.025quant', '0.975quant']])

Output Columns

`mean`	Posterior mean of the prediction
`sd`	Posterior standard deviation
`0.025quant`	2.5% quantile (lower bound of 95% CI)
`0.5quant`	Posterior median
`0.975quant`	97.5% quantile (upper bound of 95% CI)
`mode`	Posterior mode

The Link Parameter

For non-Gaussian likelihoods (Poisson, Binomial, etc.), fitted values can be returned on two scales via the link parameter:

$$\mu_i = g^{-1}(\eta_i)$$ where \(g\) is the likelihood's link function and \(\eta_i\) is the linear predictor

Value	Returned scale	Meaning
omit / `None`	Linear predictor \(\eta\)	Raw linear-predictor scale: \(\eta = \alpha + X\beta + \sum f(\cdot)\). For Poisson this is the log-rate; for Binomial it's the logit (or whichever link is configured).
`1`	Response \(\mu = g^{-1}(\eta)\)	Applies the likelihood's inverse link: \(\exp(\eta)\) for Poisson, \(1/(1+e^{-\eta})\) for Binomial(logit), etc.

Only None (or omit) and 1 are accepted by the engine; other integer values cause the C engine to abort. To change which inverse link is applied (e.g., probit vs logit for Binomial), set family's link via control['family']['control.link'], not via link here.

Example: Poisson Predictions on Response Scale

import numpy as np
import pandas as pd
from pyinla import pyinla

# Poisson data with missing values
df = pd.DataFrame({
    'y': [3, np.nan, 5, 2, np.nan, 8, 4, 1],
    'x': [0.5, 0.6, 0.7, 0.3, 0.9, 1.0, 0.4, 0.2]
})

# link=1 gives predictions on the response scale (lambda for Poisson)
result = pyinla(
    model={'response': 'y', 'fixed': ['1', 'x']},
    family='poisson',
    data=df,
    control={
        'predictor': {
            'compute': True,
            'link': 1  # Response scale: exp(eta)
        }
    }
)

# Predictions are now on the count scale (expected counts)
missing_idx = np.where(df['y'].isna())[0]
print("Predicted counts for missing observations:")
print(result.summary_fitted_values.iloc[missing_idx][['mean', 'sd']])

Tip: Use link=1 when you want predictions that are directly interpretable (counts for Poisson, probabilities for Binomial). Use the default (identity) when working with the linear predictor.

Missing Values in Covariates

pyINLA accepts np.nan in covariate columns without raising an error, but the engine does not perform principled multiple-imputation or partial-information regression on its own. The handling is mechanical: any row whose covariates contain NaN contributes to the fit with that NaN propagated into the design matrix as zero, which contaminates every coefficient (intercept and other slopes included), not just the slope on the affected covariate.

What actually happens (empirical)

Comparing two fits on the eight-row example below:

All eight rows kept (with two NaNs in x2): (Intercept) sd ≈ 1.6, x1 sd ≈ 2.1.
Drop incomplete rows first (six rows): (Intercept) sd ≈ 0.85, x1 sd ≈ 1.5.

The all-rows fit has larger posterior uncertainty for every parameter and meaningfully different posterior means — the opposite of what "partial information improves on complete-case analysis" would imply. The takeaway: don't rely on the engine to do the right thing here. Either impute the missing covariates upstream (e.g. with scikit-learn's IterativeImputer) or drop incomplete rows explicitly.

Example: Covariate with Missing Values

import numpy as np
import pandas as pd
from pyinla import pyinla

# Data with missing covariate values
df = pd.DataFrame({
    'y': [5.0, 3.0, 4.0, 2.0, 6.0, 4.5, 3.5, 5.5],
    'x1': [0.5, 0.8, 0.3, 0.4, 0.9, 0.6, 0.7, 0.2],
    'x2': [1.0, np.nan, 3.0, 4.0, np.nan, 2.5, 3.5, 1.5]  # Missing values
})

# pyinla accepts the NaNs and runs without error -- but see the warning
# box above about how the missing entries contaminate the fit.
result = pyinla(
    model={'response': 'y', 'fixed': ['1', 'x1', 'x2']},
    family='gaussian',
    data=df,
)
print("Fixed Effects (all 8 rows, includes contaminated estimates):")
print(result.summary_fixed[['mean', 'sd']])

# Recommended pattern: drop incomplete rows explicitly and re-fit.
result_complete = pyinla(
    model={'response': 'y', 'fixed': ['1', 'x1', 'x2']},
    family='gaussian',
    data=df.dropna(),
)
print("\nFixed Effects (complete-case fit, 6 rows):")
print(result_complete.summary_fixed[['mean', 'sd']])

Specification in pyINLA

Allowed Parameters of `control['predictor']`

The predictor block accepts exactly two keys. The source of truth is safety/control.py; other keys are rejected with pyinla safety check: unsupported keys in control['predictor']: ....

Parameter	Required	Type	Description
`compute`	No	bool	Hint to the engine to compute the fitted-value summary. In the current pyinla, `result.summary_fitted_values` is populated whether this is set or not, so the flag is effectively a no-op for that attribute.
`link`	No	`None` or `1`	Selects the output scale of `summary_fitted_values`. Omit / `None` returns the linear predictor; `1` applies the likelihood's inverse link to return the response scale.

Per-Parameter Runnable Examples

Click any parameter (or scenario) to expand a minimal runnable snippet. All snippets share the shared-setup block below.

# shared setup for every snippet below
import numpy as np
import pandas as pd
from pyinla import pyinla

compute = True (accepted spelling; no-op in pyinla)

# The fitted-value summary is populated regardless of this kwarg;
# this snippet exists to document the spelling, not a runtime effect.
df = pd.DataFrame({
    'y': [1.2, np.nan, 3.4, np.nan, 5.6, 6.7, np.nan, 8.9],
    'x': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8],
})
result = pyinla(
    model={'response': 'y', 'fixed': ['1', 'x']},
    family='gaussian', data=df,
    control={'predictor': {'compute': True}},
)
print(result.summary_fitted_values.shape)
print(list(result.summary_fitted_values.columns))

link = 1 return predictions on the response scale

# Poisson with NaN responses. link=None returns log-rate (eta);
# link=1 applies the inverse link (exp) so means are expected counts.
df = pd.DataFrame({
    'y': [3, np.nan, 5, 2, np.nan, 8, 4, 1],
    'x': [0.5, 0.6, 0.7, 0.3, 0.9, 1.0, 0.4, 0.2],
})
res_eta = pyinla(model={'response': 'y', 'fixed': ['1', 'x']},
                  family='poisson', data=df,
                  control={'predictor': {'compute': True}})
res_mu  = pyinla(model={'response': 'y', 'fixed': ['1', 'x']},
                  family='poisson', data=df,
                  control={'predictor': {'compute': True, 'link': 1}})

missing = np.where(df['y'].isna())[0]
print("linear-predictor mean (eta) :",
      res_eta.summary_fitted_values.iloc[missing]['mean'].values.round(3))
print("response-scale mean (lambda):",
      res_mu.summary_fitted_values.iloc[missing]['mean'].values.round(3))

Common Scenarios

NaN in the response column per-row predictive distribution

# Missing y entries (np.nan) are treated as unobserved; the engine
# returns the full posterior predictive summary at those rows.
df = pd.DataFrame({
    'y': [1.2, np.nan, 3.4, np.nan, 5.6, 6.7, np.nan, 8.9],
    'x': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8],
})
result = pyinla(model={'response': 'y', 'fixed': ['1', 'x']},
                 family='gaussian', data=df,
                 control={'predictor': {'compute': True}})

missing = np.where(df['y'].isna())[0]
print(f"missing indices: {missing}")
print(result.summary_fitted_values.iloc[missing][['mean', 'sd',
                                                     '0.025quant', '0.975quant']])

NaN in a covariate column partial-information estimation

# Rows with NaN in x2 still contribute to the (Intercept) and x1
# estimates; only the x2 coefficient is informed solely by complete rows.
df = pd.DataFrame({
    'y':  [5.0, 3.0, 4.0, 2.0, 6.0, 4.5, 3.5, 5.5],
    'x1': [0.5, 0.8, 0.3, 0.4, 0.9, 0.6, 0.7, 0.2],
    'x2': [1.0, np.nan, 3.0, 4.0, np.nan, 2.5, 3.5, 1.5],
})
result = pyinla(model={'response': 'y', 'fixed': ['1', 'x1', 'x2']},
                 family='gaussian', data=df,
                 control={'predictor': {'compute': True}})
print(result.summary_fixed[['mean', 'sd']])

Overview

Key Concepts

Missing Values in the Response

Basic Usage

Accessing Predictions

Output Columns

The Link Parameter

Example: Poisson Predictions on Response Scale

Missing Values in Covariates

What actually happens (empirical)

Example: Covariate with Missing Values

Specification in pyINLA

Allowed Parameters of control['predictor']

Per-Parameter Runnable Examples

Common Scenarios

Allowed Parameters of `control['predictor']`