Student's t-Distribution: Observation Scale

A tutorial on heteroscedastic robust regression using the Student's t-distribution with observation-specific scale parameters.

Introduction

The Student's t-distribution provides robustness against outliers through its heavier tails. When combined with observation-specific scale parameters, it can also handle heteroscedasticity - situations where different observations have different variances.

This tutorial demonstrates how to fit a t-distribution regression model with observation-specific scale factors in pyINLA, providing both outlier robustness and proper variance modeling.

The Model

For each observation $i = 1, \ldots, n$, the response follows a scaled Student's t-distribution:

$$ \sqrt{s_i \tau}(y_i - \eta_i) \sim T_{\nu} $$

where $T_{\nu}$ is a reparameterized Student's t-distribution with unit variance for all values of $\nu > 2$.

The parameters are:

The effective variance for each observation is:

$$ \text{Var}(y_i) = \frac{1}{s_i \tau} $$

Observations with larger $s_i$ have smaller variance (more precise), while observations with smaller $s_i$ have larger variance (less precise).

When to Use Scale Parameters

Combining the t-distribution with scale parameters is useful when you have:

The scale parameter acts as a weight: larger $s_i$ gives more weight to observation $i$, while the t-distribution automatically downweights extreme residuals.

Dataset

The dataset contains $n = 400$ observations with three columns:

ColumnDescriptionType
yContinuous response variablefloat
xCovariate (predictor variable)float
scaleObservation-specific scale factor $s_i$float (positive)

Implementation in pyINLA

To incorporate observation-specific scales with the t-distribution, pass the scale vector to the scale argument:

import pandas as pd
from pyinla import pyinla

# Load data with scale column
df = pd.read_csv('dataset_t_scale.csv')

# Define model: y ~ 1 + x
model = {
    'response': 'y',
    'fixed': ['1', 'x']
}

# Specify control options
control = {
    'compute': {
        'dic': True,
        'cpo': True,
        'mlik': True
    }
}

# Fit the model with observation-specific scales
result = pyinla(
    model=model,
    family='T',  # uppercase T
    data=df,
    scale=df['scale'].to_numpy(),  # Pass scale vector
    control=control
)

# View posterior summaries
print(result.summary_fixed)
print(result.summary_hyperpar)

Important: The scale argument must be a numpy array (or array-like) with the same length as the number of observations. Each element $s_i$ should be positive.

Expected Output

Running the model produces posterior summaries that account for both the heteroscedastic structure and heavy tails:

Fixed Effects

ParameterMeanSD2.5%97.5%
(Intercept)0.4430.0370.3700.516
x1.4670.0371.3951.539

Hyperparameters

ParameterMeanSD2.5%97.5%
Precision for t1.1330.1340.8921.418
Degrees of freedom4.9011.1173.2937.619

The true parameters used for data generation were $\beta_0 = 0.5$, $\beta_1 = 1.5$, $\tau = 1$, and $\nu = 4$. The model successfully recovers these values.

Custom Priors

You can customize priors on the precision and degrees of freedom hyperparameters:

# Custom priors with scale
control = {
    'family': {
        'hyper': [
            {
                'id': 'prec',
                'prior': 'loggamma',
                'param': [1, 0.01],
            },
            {
                'id': 'dof',
                'prior': 'pc.dof',
                'param': [10, 0.5],   # P(nu > 10) = 0.5
            },
        ]
    },
    'compute': {
        'dic': True
    }
}

result = pyinla(
    model=model,
    family='T',
    data=df,
    scale=df['scale'].to_numpy(),
    control=control
)

Data Generation (Reference)

The dataset was simulated with heteroscedastic t-distributed errors:

$$ \beta_0 = 0.5, \quad \beta_1 = 1.5, \quad \tau = 1, \quad \nu = 4 $$
import numpy as np
import pandas as pd

rng = np.random.default_rng(2028)
n = 400

# Generate covariate
x = rng.normal(0, 1, size=n)

# Linear predictor
eta = 0.5 + 1.5 * x

# Generate heteroscedastic scales
scale = rng.uniform(0.5, 2.0, size=n)

# Degrees of freedom
nu = 4

# Generate t-distributed response with scaling
# With scale s, effective precision is s*tau
# The t is reparameterized for unit variance
t_errors = rng.standard_t(df=nu, size=n)
y = eta + t_errors / (np.sqrt(scale) * np.sqrt(nu / (nu - 2)))

df = pd.DataFrame({
    'y': y,
    'x': np.round(x, 6),
    'scale': np.round(scale, 6)
})
# df.to_csv('dataset_t_scale.csv', index=False)

Interpretation Notes

Results and Diagnostics

The posterior summaries recover the true parameters: $\beta_0 = 0.5$, $\beta_1 = 1.5$, $\tau = 1$, and $\nu = 4$. The diagnostic plots below compare observed values with fitted values and show the residual pattern.

Observed values versus fitted values with identity line
Observed values $y_i$ versus fitted values $\hat{\eta}_i$.
Residuals versus fitted values
Residuals $y_i - \hat{\eta}_i$ versus fitted values.

To reproduce these figures locally, download the render_t_scale_plots.py script and run it alongside the CSV dataset.