Logistic Distribution Regression

A complete tutorial on Bayesian linear regression with Logistic likelihood, providing robust modeling of continuous data with heavier tails than the Gaussian distribution.

Introduction

In this tutorial, we fit a Bayesian linear regression model where the response variable follows a Logistic distribution. The Logistic distribution is a continuous probability distribution with a symmetric, bell-shaped curve but heavier tails compared to the Gaussian distribution.

This makes the Logistic distribution particularly useful when modeling data that may contain outliers, as it provides a more robust alternative to normal errors while maintaining computational tractability.

The Model

We consider a linear regression model with $n$ observations. For each observation $i = 1, \ldots, n$, the response $y_i$ is modeled as:

$$ y_i = \beta_0 + \beta_1 z_i + \varepsilon_i, \quad \varepsilon_i \stackrel{\text{iid}}{\sim} \text{Logistic}(0, \kappa^{-1}) $$

where:

Linear Predictor

The linear predictor $\eta_i$ links the mean of the response to the covariates through a linear combination:

$$ \eta_i = \beta_0 + \beta_1 z_i $$

For the Logistic likelihood with identity link, the location parameter equals the linear predictor:

$$ \mathbb{E}[y_i \mid \eta_i] = \mu_i = \eta_i $$

In matrix notation, we can write $\boldsymbol{\eta} = \mathbf{X}\boldsymbol{\beta}$, where $\mathbf{X}$ is the design matrix with a column of ones (for the intercept) and a column containing the covariate values.

Likelihood Function

The probability density function for the Logistic distribution is:

$$ f(y_i \mid \mu_i, \kappa) = \frac{\kappa \exp(-\kappa (y_i - \mu_i))}{(1 + \exp(-\kappa (y_i - \mu_i)))^{2}} $$

where $\kappa = \tau s \pi / \sqrt{3}$. The variance of each observation is:

$$ \text{Var}(y_i) = \frac{1}{s \tau} $$

The likelihood for the full dataset $\mathbf{y} = (y_1, \ldots, y_n)^\top$ is the product of individual densities:

$$ p(\mathbf{y} \mid \boldsymbol{\beta}, \tau) = \prod_{i=1}^{n} \frac{\kappa \exp(-\kappa (y_i - \eta_i))}{(1 + \exp(-\kappa (y_i - \eta_i)))^{2}} $$

Prior on the Precision

The precision parameter $\tau$ controls the spread of the distribution. In pyINLA, we work with the log-transformed precision $\theta = \log(\tau)$ to ensure positivity:

$$ \theta = \log(\tau) \in \mathbb{R} $$

The default prior is a log-gamma distribution with parameters $a = 1$ and $b = 5 \times 10^{-5}$:

$$ p(\theta) \propto \exp\bigl(a \cdot \theta - b \cdot e^{\theta}\bigr) $$

This corresponds to a Gamma$(a, b)$ prior on $\tau$ itself, which is vague enough to let the data inform the posterior while ensuring proper behavior.

Dataset

The dataset contains $n = 1000$ observations with two columns:

ColumnDescriptionType
yContinuous response variable (Logistic distributed)float
zCovariate (predictor variable)float

Download the CSV file and place it in your working directory.

Implementation in pyINLA

We now fit the model using pyINLA. The key steps are:

  1. Load the data into a pandas DataFrame
  2. Define the model structure (response and fixed effects)
  3. Specify the likelihood family as 'logistic'
  4. Call the pyinla() function
import pandas as pd
from pyinla import pyinla

# Load data
df = pd.read_csv('dataset_logistic.csv')

# Define model: y ~ 1 + z (intercept + slope)
model = {
    'response': 'y',
    'fixed': ['1', 'z']
}

# Specify control options
control = {
    'compute': {
        'dic': True,
        'cpo': True,
        'mlik': True,
        'return_marginals': True
    }
}

# Fit the model with logistic family
result = pyinla(
    model=model,
    family='logistic',
    data=df,
    control=control
)

# View posterior summaries
print(result.summary_fixed)
print(result.summary_hyperpar)

Extracting Results

The result object contains all posterior inference. Here are the key attributes:

# Fixed effects: posterior summaries for beta_0 and beta_1
print(result.summary_fixed)

# Extract posterior means
intercept = result.summary_fixed.loc['(Intercept)', 'mean']
slope = result.summary_fixed.loc['z', 'mean']

# Hyperparameters: posterior summary for precision tau
print(result.summary_hyperpar)

# Access the marginal likelihood for model comparison
print(result.mlik)

The summary_fixed DataFrame includes columns for mean, sd, 0.025quant, 0.5quant, 0.975quant, and mode.

Results and Diagnostics

The posterior summaries provide estimates for $\beta_0$, $\beta_1$, and the precision $\tau$. The true parameters used for data generation were $\beta_0 = 1$, $\beta_1 = 1$, and $\tau = 1$.

Scatter plot of y versus z with fitted regression line
Observed data $(z_i, y_i)$ with the posterior mean regression line.
Residuals versus fitted values
Residuals $y_i - \hat{\eta}_i$ versus fitted values.
Posterior density for the estimated precision
Posterior density for the estimated precision $\tau$.

To reproduce these figures locally, download the render_logistic_plots.py script and run it alongside the CSV dataset.

Data Generation (Reference)

For reproducibility, the dataset was simulated from the following data-generating process:

$$ z_i \stackrel{\text{iid}}{\sim} \mathcal{N}(0, 0.1), \quad \eta_i = \beta_0 + \beta_1 z_i, \quad y_i \sim \text{Logistic}(\eta_i, 1) $$

with true parameter values: $\beta_0 = 1$, $\beta_1 = 1$, $\tau = 1$ (i.e., $\sigma = 1$).

# Data generation script (for reference only)
import numpy as np
import pandas as pd

def rlogistic(n, mean=0, sd=1, rng=None):
    """Generate random samples from a logistic distribution."""
    if rng is None:
        rng = np.random.default_rng()
    p = rng.uniform(size=n)
    A = np.pi / np.sqrt(3)
    tauA = A / sd**2
    return (tauA * mean - np.log((1 - p) / p)) / tauA

rng = np.random.default_rng(2026)
n = 1000

# Generate covariate
z = rng.normal(0, 0.1, size=n)

# Linear predictor
eta = 1 + z

# Generate logistic response
y = rlogistic(n, mean=eta, sd=1, rng=rng)

df = pd.DataFrame({'y': y, 'z': np.round(z, 6)})
# df.to_csv('dataset_logistic.csv', index=False)