Student's t-Distribution Regression

A tutorial on robust Bayesian regression using the Student's t-distribution, which provides resistance to outliers through heavier tails than the Gaussian.

Introduction

The Student's t-distribution extends the normal distribution with heavier tails, providing robustness against outliers. When data may contain extreme values or when modeling errors with heavier tails than Gaussian, the t-distribution is a natural choice.

As the degrees of freedom $\nu$ increases, the t-distribution approaches the normal distribution. With smaller $\nu$, the distribution has heavier tails.

The Model

For each observation $i = 1, \ldots, n$, the response $y_i$ follows a reparameterized Student's t-distribution:

$$ \sqrt{s_i \tau}(y_i - \eta_i) \sim T_{\nu} $$

where:

Hyperparameters

The t-distribution has two hyperparameters:

The constraint $\nu > 2$ ensures the variance exists. The PC prior penalizes deviation from the Gaussian ($\nu = \infty$).

Dataset

The dataset contains $n = 300$ observations:

ColumnDescriptionType
yContinuous response (t-distributed errors)float
xCovariatefloat

Implementation in pyINLA

import pandas as pd
from pyinla import pyinla

# Load data
df = pd.read_csv('dataset_t_regression.csv')

# Define model: y ~ 1 + x
model = {
    'response': 'y',
    'fixed': ['1', 'x']
}

# Fit with Student's t family
result = pyinla(
    model=model,
    family='T',  # uppercase T
    data=df
)

# View results
print(result.summary_fixed)
print(result.summary_hyperpar)

Custom Priors

You can customize priors on both hyperparameters:

# Custom priors on precision and degrees of freedom
control = {
    'family': {
        'hyper': [
            {
                'id': 'prec',
                'prior': 'loggamma',
                'param': [1, 0.01],
            },
            {
                'id': 'dof',
                'prior': 'pc.dof',
                'param': [10, 0.5],   # P(nu > 10) = 0.5
            },
        ]
    }
}

result = pyinla(
    model=model,
    family='T',
    data=df,
    control=control
)

Results and Diagnostics

The posterior summaries recover the true parameters. Outlier observations visible in the residuals are naturally handled by the heavy-tailed t-distribution.

Observed values versus fitted values with identity line
Observed values $y_i$ versus fitted values $\hat{\eta}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i$.
Residuals versus fitted values
Residuals $y_i - \hat{\eta}_i$ versus fitted values.

To reproduce these figures locally, download the render_t_plots.py script and run it alongside the CSV dataset.

Data Generation (Reference)

The dataset was simulated with:

$$ \beta_0 = 1, \quad \beta_1 = 2, \quad \nu = 5, \quad \tau = 1 $$
import numpy as np
import pandas as pd

rng = np.random.default_rng(2026)
n = 300
nu = 5  # degrees of freedom

x = rng.normal(0, 1, size=n)
eta = 1 + 2 * x

# t-distributed errors scaled for unit variance
t_errors = rng.standard_t(df=nu, size=n)
y = eta + t_errors / np.sqrt(nu / (nu - 2))

df = pd.DataFrame({'y': y, 'x': x})
# df.to_csv('dataset_t_regression.csv', index=False)