Advanced

Multiple Likelihoods

Fit a single joint model where different observations follow different likelihoods (e.g. Gaussian + Poisson) but share latent structure such as regression coefficients or random effects.

Why Joint Likelihoods?

Some datasets contain multiple response types that share underlying structure: a continuous outcome alongside a count, a normally-distributed measurement alongside a binary indicator, or several biomarkers measured on different scales. Fitting them jointly (rather than as independent models) lets pyINLA pool information through shared covariates, intercepts, or random effects.

Key idea

pyINLA spreads the responses across separate columns (Y1, Y2, ...) and uses np.nan to mark which family each row belongs to: rows that belong to family 1 have a value in Y1 and NaN in Y2, and vice versa. The engine reads the non-NaN column to decide which likelihood is active for each observation.

Mathematical Formulation

For two likelihoods \(\mathcal{L}_1\) and \(\mathcal{L}_2\) (e.g. Gaussian and Poisson):

$$y_i \;\sim\; \begin{cases} \mathcal{L}_1\!\left(g_1^{-1}(\eta_i),\; \boldsymbol{\theta}_1\right), & i \in \mathcal{I}_1 \\ \mathcal{L}_2\!\left(g_2^{-1}(\eta_i),\; \boldsymbol{\theta}_2\right), & i \in \mathcal{I}_2 \end{cases}$$

The linear predictor \(\eta_i\) is shared across families:

$$\eta_i \;=\; \alpha_{f(i)} + \mathbf{x}_i^\top \boldsymbol{\beta} + \sum_k f_k(\cdot)$$

Each family has its own intercept \(\alpha_{f(i)}\) (selected by the indicator column) and its own hyperparameters \(\boldsymbol{\theta}_k\); fixed and random effects can be shared, family-specific, or any mix.

Data Layout: How to Build the DataFrame

For \(K\) likelihoods, build a DataFrame with these columns:

ColumnPurpose
Y1, Y2, ..., YK Response columns, one per likelihood. Row i uses likelihood k iff Yk[i] is non-NaN and every other Yj[i] is NaN.
I1, I2, ..., IK Family-indicator columns: Ik[i] = 1 if row i belongs to likelihood k, NaN otherwise. These are referenced in fixed to give each family its own intercept.
Shared covariates (e.g. x) Defined for all rows. The coefficient is shared across families.
Per-family kwargs (e.g. Ntrials for Binomial) If the kwarg is only meaningful for one family, set the other family's rows to a safe value (1 for Ntrials) since the engine reads them regardless.

Worked example: what each column does in fixed=['-1', 'I1', 'I2', 'x']

For 30 Gaussian rows followed by 20 Poisson rows the DataFrame looks like this (one row per observation, five columns):

rowfamilyY1Y2I1I2x
0Gaussian1.23NaN1NaN0.50
1Gaussian-0.40NaN1NaN-0.20
.....................
29Gaussian0.80NaN1NaN1.10
30PoissonNaN12NaN10.10
31PoissonNaN8NaN1-0.30
.....................
49PoissonNaN11NaN10.70

Notice that I1 and I2 are single vectors of length 50 (not 2-column matrices). Each one is "1 where its family is active, NaN otherwise". x is fully populated on every row.

What each item of fixed=['-1', 'I1', 'I2', 'x'] does:

ItemRoleAffects which rows
'-1' Not a coefficient -- a flag that tells pyINLA to not add an implicit global (Intercept) column. Required whenever you use family-specific intercepts (otherwise the global intercept is collinear with I1/I2 and the fit is unidentifiable). none directly
'I1' Coefficient on the I1 column. pyINLA replaces NaN entries with 0 in the design matrix, so I1 becomes [1,...,1, 0,...,0]. The coefficient βI1 only contributes to the linear predictor on rows where it's 1: the Gaussian family-specific intercept. Gaussian rows only
'I2' Coefficient on I2 (which is [0,...,0, 1,...,1] after NaN-as-0). βI2 is the Poisson family-specific intercept. Poisson rows only
'x' Coefficient on the shared covariate x, defined on all 50 rows. βx is a single slope estimated from all observations -- the shared effect that justifies fitting jointly rather than fitting each likelihood alone. Both families

The full linear predictor before NaN-substitution is

\[\eta_i \;=\; \beta_{I_1}\, I_{1,i} \;+\; \beta_{I_2}\, I_{2,i} \;+\; \beta_x\, x_i.\]

After pyINLA replaces NaN with 0 in the design columns, each family ends up seeing a different reduced form:

\[\begin{aligned} \text{Gaussian rows: }& \quad \eta_i \;=\; \beta_{I_1} \;+\; \beta_x\, x_i \\ \text{Poisson rows: } & \quad \eta_i \;=\; \beta_{I_2} \;+\; \beta_x\, x_i \end{aligned}\]

Each likelihood gets its own intercept (one I-coefficient per family), and they share the slope on x -- exactly as intended.

Reading the dataset summary

NaN in different columns means different things:

  • NaN in a response column (Y1, Y2): "this row's outcome belongs to the other likelihood". pyINLA does not try to predict what Y1 would have been on a Poisson row -- that row is firmly a Poisson observation.
  • NaN in an indicator column (I1, I2): silently substituted to 0 in the design matrix -- pure routing, no imputation.
  • NaN in a shared covariate (x): avoid -- the NaN gets substituted to 0 and contaminates every coefficient (see the predictions & NAs page for the empirical demonstration).

Specification in pyINLA

Signature changes

Compared with a single-likelihood call, two arguments become lists:

ArgumentSingle likelihoodMultiple likelihoods
model['response'] 'y' ['Y1', 'Y2'] -- one column name per likelihood
family 'gaussian' ['gaussian', 'poisson'] -- same length as response
control['family'] {...} [{...}, {...}] -- per-family dict; pass {} for defaults
model['fixed'] ['1', 'x'] ['-1', 'I1', 'I2', 'x'] -- use '-1' to drop the global intercept, then add per-family indicators

Per-scenario runnable examples

Each scenario below is a complete, runnable snippet. The first two share a synthetic Gaussian + Poisson dataset; the third uses a Gaussian + Binomial dataset with per-row Ntrials.

Scenario 1   Basic Gaussian + Poisson with a shared covariate
import numpy as np
import pandas as pd
from pyinla import pyinla

rng = np.random.default_rng(314)
d1  = rng.normal(size=30)
d2  = rng.poisson(10, size=20)
Y1  = np.concatenate([d1, np.full(20, np.nan)])
Y2  = np.concatenate([np.full(30, np.nan), d2.astype(float)])
I1  = np.concatenate([np.ones(30),       np.full(20, np.nan)])
I2  = np.concatenate([np.full(30, np.nan), np.ones(20)])
x   = rng.normal(size=50)
df  = pd.DataFrame({'Y1': Y1, 'Y2': Y2,
                    'I1': I1, 'I2': I2, 'x': x})

res = pyinla(
    model={'response': ['Y1', 'Y2'],
           'fixed':    ['-1', 'I1', 'I2', 'x']},
    family=['gaussian', 'poisson'],
    data=df,
)
print(res.summary_fixed)
print(res.summary_hyperpar)
Scenario 2   Per-family control['family']: Gaussian precision fixed, Poisson default
import numpy as np
import pandas as pd
from pyinla import pyinla

# Same Gaussian + Poisson dataset as Scenario 1.
rng = np.random.default_rng(314)
d1  = rng.normal(size=30)
d2  = rng.poisson(10, size=20)
df  = pd.DataFrame({
    'Y1': np.concatenate([d1, np.full(20, np.nan)]),
    'Y2': np.concatenate([np.full(30, np.nan), d2.astype(float)]),
    'I1': np.concatenate([np.ones(30),       np.full(20, np.nan)]),
    'I2': np.concatenate([np.full(30, np.nan), np.ones(20)]),
    'x':  rng.normal(size=50),
})

# control['family'] is a list of dicts aligned with the family list.
# {} = use defaults for that family.
res = pyinla(
    model={'response': ['Y1', 'Y2'],
           'fixed':    ['-1', 'I1', 'I2', 'x']},
    family=['gaussian', 'poisson'],
    data=df,
    control={
        'family': [
            {'hyper': [{'initial': 10, 'fixed': True}]},  # Gaussian: prec = exp(10), held fixed
            {},                                              # Poisson: defaults
        ],
    },
)
print(res.summary_hyperpar)  # Gaussian prec absent (held fixed); Poisson empty
Scenario 3   Gaussian + Binomial with per-row Ntrials
import numpy as np
import pandas as pd
from pyinla import pyinla

rng      = np.random.default_rng(314)
n_g, n_b = 40, 30
y_g      = rng.normal(loc=2.0, scale=0.5, size=n_g)
y_b      = rng.binomial(n=10, p=0.3, size=n_b)
df = pd.DataFrame({
    'Y1': np.concatenate([y_g, np.full(n_b, np.nan)]),
    'Y2': np.concatenate([np.full(n_g, np.nan), y_b.astype(float)]),
    'I1': np.concatenate([np.ones(n_g),       np.full(n_b, np.nan)]),
    'I2': np.concatenate([np.full(n_g, np.nan), np.ones(n_b)]),
    'x':  rng.normal(size=n_g + n_b),
})

# Ntrials must be a positive integer EVERYWHERE the engine reads it,
# even on Gaussian rows. Use 1 as a safe filler.
Nt = np.concatenate([np.ones(n_g), np.full(n_b, 10.0)])

res = pyinla(
    model={'response': ['Y1', 'Y2'],
           'fixed':    ['-1', 'I1', 'I2', 'x']},
    family=['gaussian', 'binomial'],
    data=df,
    Ntrials=Nt,
)
print(res.summary_fixed)

Common Pitfalls