Predictions & Missing Values
Learn how pyINLA handles missing values (NAs) in responses and covariates, and how to generate predictions for unobserved data.
Overview
Key Concepts
In Bayesian inference with INLA, missing values are handled naturally through the posterior predictive distribution. When the response \(y_i\) is missing (NA), INLA computes the predictive distribution \(\pi(y_i \mid \mathbf{y}_{obs})\) for that observation.
pyINLA supports missing values in two contexts:
- Missing response values - INLA automatically computes predictive distributions
- Missing covariate values - Observations contribute partially to estimation
| Missing In | Python Syntax | How INLA Handles It |
|---|---|---|
| Response (y) | np.nan |
Computes predictive distribution for missing values |
| Covariates (x) | np.nan |
Accepted without error, but the missing entries are mechanically propagated and contaminate every coefficient. Pre-impute or drop incomplete rows upstream — see warning below. |
Missing Values in the Response
When the response variable contains missing values (NA/NaN), INLA treats them as unobserved data and computes their posterior predictive distribution:
This is the foundation for:
- Prediction - Forecasting new observations
- Imputation - Filling in missing data
- Cross-validation - Leave-one-out predictions
Basic Usage
Set missing response values to np.nan and read the fitted-value summary:
import numpy as np
import pandas as pd
from pyinla import pyinla
# Create data with missing response values
df = pd.DataFrame({
'y': [1.2, np.nan, 3.4, np.nan, 5.6, 6.7, np.nan, 8.9], # NaN = missing
'x': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8]
})
result = pyinla(
model={'response': 'y', 'fixed': ['1', 'x']},
family='gaussian',
data=df,
)
# summary_fitted_values is populated for ALL observations, including the
# missing rows; no extra control flag is needed in the current pyinla.
print(result.summary_fitted_values)
Accessing Predictions
The fitted values for missing observations contain the predictive distribution summary:
# Find indices of missing values
missing_idx = np.where(df['y'].isna())[0]
print(f"Missing indices: {missing_idx}")
# Get predictions for missing values
predictions = result.summary_fitted_values.iloc[missing_idx]
print(predictions[['mean', 'sd', '0.025quant', '0.975quant']])
Output Columns
mean | Posterior mean of the prediction |
sd | Posterior standard deviation |
0.025quant | 2.5% quantile (lower bound of 95% CI) |
0.5quant | Posterior median |
0.975quant | 97.5% quantile (upper bound of 95% CI) |
mode | Posterior mode |
The Link Parameter
For non-Gaussian likelihoods (Poisson, Binomial, etc.), fitted values can be returned on two scales via the link parameter:
where \(g\) is the likelihood's link function and \(\eta_i\) is the linear predictor
| Value | Returned scale | Meaning |
|---|---|---|
omit / None |
Linear predictor \(\eta\) | Raw linear-predictor scale: \(\eta = \alpha + X\beta + \sum f(\cdot)\). For Poisson this is the log-rate; for Binomial it's the logit (or whichever link is configured). |
1 |
Response \(\mu = g^{-1}(\eta)\) | Applies the likelihood's inverse link: \(\exp(\eta)\) for Poisson, \(1/(1+e^{-\eta})\) for Binomial(logit), etc. |
Only None (or omit) and 1 are accepted by the engine; other integer values cause the C engine to abort. To change which inverse link is applied (e.g., probit vs logit for Binomial), set family's link via control['family']['control.link'], not via link here.
Example: Poisson Predictions on Response Scale
import numpy as np
import pandas as pd
from pyinla import pyinla
# Poisson data with missing values
df = pd.DataFrame({
'y': [3, np.nan, 5, 2, np.nan, 8, 4, 1],
'x': [0.5, 0.6, 0.7, 0.3, 0.9, 1.0, 0.4, 0.2]
})
# link=1 gives predictions on the response scale (lambda for Poisson)
result = pyinla(
model={'response': 'y', 'fixed': ['1', 'x']},
family='poisson',
data=df,
control={
'predictor': {
'compute': True,
'link': 1 # Response scale: exp(eta)
}
}
)
# Predictions are now on the count scale (expected counts)
missing_idx = np.where(df['y'].isna())[0]
print("Predicted counts for missing observations:")
print(result.summary_fitted_values.iloc[missing_idx][['mean', 'sd']])
link=1 when you want predictions that are directly interpretable (counts for Poisson, probabilities for Binomial). Use the default (identity) when working with the linear predictor.
Missing Values in Covariates
pyINLA accepts np.nan in covariate columns without raising an error, but the engine does not perform principled multiple-imputation or partial-information regression on its own. The handling is mechanical: any row whose covariates contain NaN contributes to the fit with that NaN propagated into the design matrix as zero, which contaminates every coefficient (intercept and other slopes included), not just the slope on the affected covariate.
What actually happens (empirical)
Comparing two fits on the eight-row example below:
- All eight rows kept (with two
NaNs inx2):(Intercept)sd ≈ 1.6,x1sd ≈ 2.1. - Drop incomplete rows first (six rows):
(Intercept)sd ≈ 0.85,x1sd ≈ 1.5.
The all-rows fit has larger posterior uncertainty for every parameter and meaningfully different posterior means — the opposite of what "partial information improves on complete-case analysis" would imply. The takeaway: don't rely on the engine to do the right thing here. Either impute the missing covariates upstream (e.g. with scikit-learn's IterativeImputer) or drop incomplete rows explicitly.
Example: Covariate with Missing Values
import numpy as np
import pandas as pd
from pyinla import pyinla
# Data with missing covariate values
df = pd.DataFrame({
'y': [5.0, 3.0, 4.0, 2.0, 6.0, 4.5, 3.5, 5.5],
'x1': [0.5, 0.8, 0.3, 0.4, 0.9, 0.6, 0.7, 0.2],
'x2': [1.0, np.nan, 3.0, 4.0, np.nan, 2.5, 3.5, 1.5] # Missing values
})
# pyinla accepts the NaNs and runs without error -- but see the warning
# box above about how the missing entries contaminate the fit.
result = pyinla(
model={'response': 'y', 'fixed': ['1', 'x1', 'x2']},
family='gaussian',
data=df,
)
print("Fixed Effects (all 8 rows, includes contaminated estimates):")
print(result.summary_fixed[['mean', 'sd']])
# Recommended pattern: drop incomplete rows explicitly and re-fit.
result_complete = pyinla(
model={'response': 'y', 'fixed': ['1', 'x1', 'x2']},
family='gaussian',
data=df.dropna(),
)
print("\nFixed Effects (complete-case fit, 6 rows):")
print(result_complete.summary_fixed[['mean', 'sd']])
Specification in pyINLA
Allowed Parameters of control['predictor']
The predictor block accepts exactly two keys. The source of truth is safety/control.py; other keys are rejected with pyinla safety check: unsupported keys in control['predictor']: ....
| Parameter | Required | Type | Description |
|---|---|---|---|
compute |
No | bool | Hint to the engine to compute the fitted-value summary. In the current pyinla, result.summary_fitted_values is populated whether this is set or not, so the flag is effectively a no-op for that attribute. |
link |
No | None or 1 |
Selects the output scale of summary_fitted_values. Omit / None returns the linear predictor; 1 applies the likelihood's inverse link to return the response scale. |
Per-Parameter Runnable Examples
Click any parameter (or scenario) to expand a minimal runnable snippet. All snippets share the shared-setup block below.
# shared setup for every snippet below
import numpy as np
import pandas as pd
from pyinla import pyinla
compute = True (accepted spelling; no-op in pyinla)
# The fitted-value summary is populated regardless of this kwarg;
# this snippet exists to document the spelling, not a runtime effect.
df = pd.DataFrame({
'y': [1.2, np.nan, 3.4, np.nan, 5.6, 6.7, np.nan, 8.9],
'x': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8],
})
result = pyinla(
model={'response': 'y', 'fixed': ['1', 'x']},
family='gaussian', data=df,
control={'predictor': {'compute': True}},
)
print(result.summary_fitted_values.shape)
print(list(result.summary_fitted_values.columns))
link = 1 return predictions on the response scale
# Poisson with NaN responses. link=None returns log-rate (eta);
# link=1 applies the inverse link (exp) so means are expected counts.
df = pd.DataFrame({
'y': [3, np.nan, 5, 2, np.nan, 8, 4, 1],
'x': [0.5, 0.6, 0.7, 0.3, 0.9, 1.0, 0.4, 0.2],
})
res_eta = pyinla(model={'response': 'y', 'fixed': ['1', 'x']},
family='poisson', data=df,
control={'predictor': {'compute': True}})
res_mu = pyinla(model={'response': 'y', 'fixed': ['1', 'x']},
family='poisson', data=df,
control={'predictor': {'compute': True, 'link': 1}})
missing = np.where(df['y'].isna())[0]
print("linear-predictor mean (eta) :",
res_eta.summary_fitted_values.iloc[missing]['mean'].values.round(3))
print("response-scale mean (lambda):",
res_mu.summary_fitted_values.iloc[missing]['mean'].values.round(3))
Common Scenarios
NaN in the response column per-row predictive distribution
# Missing y entries (np.nan) are treated as unobserved; the engine
# returns the full posterior predictive summary at those rows.
df = pd.DataFrame({
'y': [1.2, np.nan, 3.4, np.nan, 5.6, 6.7, np.nan, 8.9],
'x': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8],
})
result = pyinla(model={'response': 'y', 'fixed': ['1', 'x']},
family='gaussian', data=df,
control={'predictor': {'compute': True}})
missing = np.where(df['y'].isna())[0]
print(f"missing indices: {missing}")
print(result.summary_fitted_values.iloc[missing][['mean', 'sd',
'0.025quant', '0.975quant']])
NaN in a covariate column partial-information estimation
# Rows with NaN in x2 still contribute to the (Intercept) and x1
# estimates; only the x2 coefficient is informed solely by complete rows.
df = pd.DataFrame({
'y': [5.0, 3.0, 4.0, 2.0, 6.0, 4.5, 3.5, 5.5],
'x1': [0.5, 0.8, 0.3, 0.4, 0.9, 0.6, 0.7, 0.2],
'x2': [1.0, np.nan, 3.0, 4.0, np.nan, 2.5, 3.5, 1.5],
})
result = pyinla(model={'response': 'y', 'fixed': ['1', 'x1', 'x2']},
family='gaussian', data=df,
control={'predictor': {'compute': True}})
print(result.summary_fixed[['mean', 'sd']])