Introduction to Generalized Linear Models for Psychology

Generalized Linear Models Workshop

Last modified: 29-01-2026

Linear Regression

Linear Regression

Estimate the expected (average) outcome given predictors.

The expected value of a random variable with a finite number of outcomes is a weighted average of all possible outcomes.

Linear Regression

A constant change in x leads to a constant change in the expected outcome.

Normal linear regression: lm(y~x)

For a fixed xix_i, the model describes the distribution of possible outcomes around the mean μi\mu_i.

The expected value of a random variable with a finite number of outcomes is a weighted average of all possible outcomes.

The Normal distribution

Normal distribution

A Normal distribution has parameters μ\mu (mean) and σ2\sigma^2 (variance):

f(yμ,σ2)=12πσ2exp((yμ)22σ2)f(y \mid \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( - \frac{(y - \mu)^2}{2\sigma^2} \right)


Normal distribution

A Normal distribution has parameters μ\mu (mean) and σ2\sigma^2 (variance):

f(yμ,σ2)=12πσ2exp((yμ)22σ2)f(y \mid \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( - \frac{(y - \mu)^2}{2\sigma^2} \right)


Normal distribution

A Normal distribution has parameters μ\mu (mean) and σ2\sigma^2 (variance):

f(yμ,σ2)=12πσ2exp((yμ)22σ2)f(y \mid \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( - \frac{(y - \mu)^2}{2\sigma^2} \right)


Normal linear regression


yixi𝒩(μi,σ2),μi=β0+β1xiy_i \mid x_i \sim \mathcal{N}(\mu_i,\sigma^2), \qquad \mu_i = \beta_0 + \beta_1 x_i

Normal distribution

  • Support: ,+-\infty, +\infty
  • Mean (expected value): μ\mu
  • Variance: σ2\sigma^2
  • Independence: changing the mean does not change the variance!

Why do these matter for regression?

When building a model, we need to know:

  • Support: what range of values is possible
  • Mean (expected value): what value to predict on average
  • Variance: how much variation around the mean
  • Mean–variance relationship: does variance change with the mean?

Where the Normal model breaks down

Reaction Times

Where the Normal model breaks down

Exam pass/fail

Where the Normal model breaks down

Number of errors in a task

Probability distributions

Understanding distributions

A probability distribution of a random variable YY describes the probabilities assigned to each possible value yy, given certain parameters values.


f(y𝛉)f(y \mid \boldsymbol{\theta})


  • yy is a specific value
  • 𝛉\boldsymbol{\theta} is a vector of variables (parameters)
  • f()f(\cdot) is the function itself

Understanding distributions

  • Discrete random variables: counts, binary, etc.

\rightarrow f(Y𝛉)f(Y\mid \boldsymbol{\theta}) becomes f(Y=y𝛉)f(Y = y\mid \boldsymbol{\theta}) because we are assigning a certain probability to a discrete variables. The function is called probability mass function (PMF).

  • Continuous random variables: time, weight, etc.

\rightarrow f(Y𝛉)f(Y\mid \boldsymbol{\theta}) and the function is called probability density function (PDF).


The PMF return the probability associated with a certain value of yy while the PDF returns the probability density of a certain value of yy

Moments of the distributions

When we use the distributions, we are usually interested in some properties describing the distribution:

  • First moment: is the expected value (i.e., the mean) of the distribution
  • Second moment: is the variance of the distribution
  • Third moment: is the skewness
  • Fourth moment: is the kurtosis

Why is it important?

Moments are always the same but the way we actual compute them could change according to the distribution.

In normal linear and generalized linear models we generally include predictors on the mean of the distribution.

In the case of Normal distribution the mean and the variance are the same as the parameters defining the distribution.

Why is important?

Distribution Support mean: E[Y]E[Y] variance: Var(Y)\mathrm{Var}(Y)
Normal: f(y | μ,σ2)f(y \text{ | } \mu,\sigma^2) yy \in \mathbb{R} μ\mu σ2\sigma^2
Gamma: f(y | α,β)f(y \text{ | } \alpha,\beta) y(0,)y \in (0,\infty) α/β\alpha/\beta α/β2\alpha/\beta^2
Binomial: f(y | n,p)f(y \text{ | } n,p) y{0,1,,n}y \in \{0,1,\dots,n\} npnp np(1p)np(1-p)
Poisson: f(y | λ)f(y \text{ | } \lambda) y{0,1,2,}y \in \{0,1,2,\dots\} λ\lambda λ\lambda

Example: Passing the exam

Probability of passing the exam as a function of how many hours students study:

   id studyh passed
1   1     29      0
2   2     79      1
3   3     41      1
4   4     88      1
5   5     94      1
6   6      5      0
7   7     53      0
8   8     89      1
9   9     55      1
10 10     46      0
11 11     96      1
12 12     45      0
13 13     68      1
14 14     57      0
15 15     10      0
16 16     90      1
17 17     25      0
18 18      4      0
19 19     33      1
20 20     95      1
21 21     89      1
22 22     69      0
23 23     64      0
24 24     99      1
25 25     66      1
26 26     71      1
27 27     54      0
28 28     59      1
29 29     29      0
30 30     15      0
31 31     96      1
32 32     90      1
33 33     69      1
34 34     80      0
35 35      2      0
36 36     48      0
37 37     76      0
38 38     22      1
39 39     32      1
40 40     23      0
41 41     14      0
42 42     41      0
43 43     41      0
44 44     37      0
45 45     15      0
46 46     14      0
47 47     23      0
48 48     47      0
49 49     27      0
50 50     86      1
# number of students that have passed the exam
sum(dat_exam$passed) 
#> [1] 22
# proportion of students that have passed the exam
mean(dat_exam$passed) 
#> [1] 0.44

# study hours and passing the exam
tapply(dat_exam$studyh, dat_exam$passed, mean)
#>        0        1 
#> 35.46429 73.04545
#> 
table(dat_exam$passed, cut(dat_exam$studyh, breaks = 4))
#>    (1.9,26.2] (26.2,50.5] (50.5,74.8] (74.8,99.1]
#>  0         11          10           5           2
#>  1          1           3           6          12
#>  
tapply(dat_exam$passed, cut(dat_exam$studyh, breaks = 4), mean)
#>   (1.9,26.2] (26.2,50.5] (50.5,74.8] (74.8,99.1] 
#>   0.08333333  0.23076923  0.54545455  0.85714286

Example: Passing the exam

Fitting a Normal Linear Model

Do you see something strange?

Fitting a Generalized Linear Model

The model should consider both the support of the yy variable and the non-linear pattern!

Generalized Linear Models (GLM)

General idea

  • using distributions beyond the Gaussian
  • modeling non linear functions on the response scale
  • taking into account mean-variance relationships

The three ingredients of a GLM

  1. Random Component: Choose a distribution

  2. Systematic Component: Linear predictor ηi=β0+β1xi\eta_i = \beta_0 + \beta_1 x_i

  3. Link Function: Transform the mean g(μi)=ηig(\mu_i) = \eta_i

1. Random component

The random component specifies a probability model for yiy_i:


yixiDistribution(parameters). y_i \mid x_i \sim \text{Distribution}(\text{parameters}).


What support?

  • Binary: {0,1}\{0,1\} \rightarrow Bernoulli(p)\text{Bernoulli}(p) (or Binomial(1,p)\text{Binomial}(1,p))
  • Counts: {0,1,2,}\{0,1,2,\ldots\} \rightarrow Poisson(λ)\text{Poisson}(\lambda)
  • Any real number: (,)(-\infty,\infty) \rightarrow Normal(μ,σ2)\text{Normal}(\mu,\sigma^2)

1. Random component (mean–variance)

Choosing a distribution specifies not only the mean, but also how the variance depends on the mean:


Var(YX)=ϕV(μ),μ=E(YX). \mathrm{Var}(Y \mid X) = \phi \, V(\mu), \qquad \mu = E(Y \mid X).


where V(μ)V(\mu) is the variance function (determined by the family) and ϕ\phi is a constant scale factor.

  • Normal: V(μ)=1V(\mu) = 1 (constant variance, ϕ=σ2\phi = \sigma^2)
  • Bernoulli: V(μ)=μ(1μ)V(\mu) = \mu(1-\mu) (typically ϕ=1\phi = 1)
  • Poisson: V(μ)=μV(\mu) = \mu (typically ϕ=1\phi = 1)

2. Systematic Component

The systematic component is exactly the same as in normal linear regression: we predict a linear combination of predictors.


ηi=β0+β1xi1+β2xi2++βkxik. \eta_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_k x_{ik}.

Basically it describes how the expected value (i.e., the mean, the first moment) of the chosen distribution (the random component) varies according to the predictors.

Normal + identity

For example, a Generalized Linear Model with the Normal family and identity link can be written as:


yi𝒩(μi,σ2)Random componentμi=ηiLink (identity)ηi=β0+β1xi1++βkxikSystematic component \begin{aligned} y_i &\sim \mathcal{N}(\mu_i, \sigma^2) && \text{Random component} \\ \mu_i &= \eta_i && \text{Link (identity)} \\ \eta_i &= \beta_0 + \beta_1 x_{i1} + \cdots + \beta_k x_{ik} && \text{Systematic component} \end{aligned}

Binomial + logit

For example, a Generalized Linear Model with the Binomial family and logit link can be written as:


yixiBinomial(ni,pi)Random componentlogit(pi)=log(pi1pi)=ηiLink (logit)ηi=β0+β1xi1++βkxikSystematic component \begin{aligned} y_i \mid x_i &\sim \text{Binomial}(n_i, p_i) && \text{Random component} \\ \text{logit}(p_i) &= \log\!\left(\frac{p_i}{1-p_i}\right)=\eta_i && \text{Link (logit)} \\ \eta_i &= \beta_0 + \beta_1 x_{i1} + \cdots + \beta_k x_{ik} && \text{Systematic component} \end{aligned}

Binomial Logistic Regression

Binary outcomes and counts

The Bernoulli or the Binomial distributions can be used as random component when we have a binary dependent variable or the number of successes over the total number of trials.


  • Binary (pass/fail), one trial per person; or
  • Counts of successes out of nn trials (e.g., items correct out of nn).

Bernoulli (one student)

Let k{0,1}k \in \{0,1\} indicate whether a student passes the exam:


f(k;p)={pif k=1,q=1pif k=0. f(k; p) = \begin{cases} p & \text{if } k = 1, \\ q = 1-p & \text{if } k = 0. \end{cases}


The probability mass function f{\displaystyle f} of the Bernoulli distribution, over possible outcomes kk, is f(k,p)=pk(1p)1kf(k,p) = p^k(1-p)^{1-k}


Where pp is the probability of success and kk the two possible results 0 and 1. The mean is pp and the variance is p(1p)p(1-p)

One student takes the exam

rbinom(n = 1, size = 1, prob = 0.7)
[1] 0

Over 10,000 students?

many = rbinom(n = 100000, size = 1, prob = 0.7); head(many)
[1] 1 1 1 0 1 1
# Mean
(p = mean(many))
[1] 0.70085
# Variance
(p*(1-p))
[1] 0.2096593
(sd(many)^2)
[1] 0.2096614

Binomial (a class of students)

Now let kk be the number of students who pass out of nn students who take the same exam.


The probability of having kk success (e.g., 0, 1, 2, etc.), out of nn trials with a probability of success pp (and failing q=1pq = 1 - p) is:


f(n,k,p)=(nk)pk(1p)nk f(n, k, p) \;=\; \binom{n}{k}\, p^{k}\,(1-p)^{\,n-k}


The npnp is the mean of the binomial distribution and np(1p)np(1-p) is the variance. The binomial distribution is just the repetition of nn independent Bernoulli trials.

Ten students take the exam

How many pass?

n = 10; p = 0.7
rbinom(n = 1, size = n, prob = 0.7)
[1] 7

Over 10,000 repetition …

many = rbinom(n = 100000, size = n, prob = p); head(many)
[1] 7 8 8 8 8 9
n*p # Mean count: E(Y) = np 
[1] 7
n*p*(1-p) # Variance count: Var(y) = np(1-p)
[1] 2.1
mean(many/n) # Mean proportion = p
[1] 0.700083

Binomial distribution: n=20n = 20, p=0.9p = 0.9

Binomial distribution: n=20n = 20, p=0.5p = 0.5

Mean–variance relationship

E[Y]=npandVar(Y)=np(1p)E[Y] = np \quad \text{and} \quad \mathrm{Var}(Y) = np(1-p)

Variance is not constant!

Odds

Because probabilities must stay between 0 and 1, we work with odds.

Let p=P(Pass=1)p = P(\text{Pass}=1), then the odds of passing compare “pass” to “fail”:


odds=p1p,p=oddsodds+1 \text{odds} = \frac{p}{1-p}, \qquad \text{p} = \frac{odds}{odds+1}


If p=0.80p=0.80, then odds=0.800.20=4\text{odds}=\frac{0.80}{0.20}=4.


So passing is 4-to-1 relative to failing (4 expected passes for 1 fail, on average).

Why ?

The odds have an interesting property when taking the logarithm. We can express a probability pp sing a scale ranging [+,][+\infty,-\infty].

Odds

With pi=P(Passi=1xi)p_i = P(\text{Pass}_i = 1 \mid x_i) and xix_i hours studied, the model assumes:


log(pi1pi)=ηiηi=β0+β1xi \begin{aligned} \log\left(\frac{p_i}{1-p_i}\right) &= \eta_i \\ \eta_i &= \beta_0 + \beta_1 x_i \end{aligned}

Then the odds:


pi1pi=exp(β0+β1xi). \frac{p_i}{1-p_i}=\exp(\beta_0+\beta_1 x_i).

Odds ratios

Then the odds:
pi1pi=exp(β0+β1xi). \frac{p_i}{1-p_i}=\exp(\beta_0+\beta_1 x_i).


When x=0x = 0

pi1pi=exp(β0). \frac{p_i}{1-p_i} = \exp(\beta_0).


When x=1x = 1

pi1pi=exp(β0+β1)=exp(β0)exp(β1). \frac{p_i}{1-p_i} = \exp(\beta_0 + \beta_1) = \exp(\beta_0)\exp(\beta_1).

Odds Ratios

Then the ratio of these two odds is:


exp(β0)exp(β1)exp(β0)=exp(β1)\frac{\exp(\beta_0)\exp(\beta_1)}{\exp(\beta_0)} = \exp(\beta_1)


This means that the odds of passing when increasing study hours by 1 is exp(β1)\exp(\beta_1) times greater than at the baseline (i.e., when x=0x = 0).

Linear on log-odds

A +1 increase in xx changes η\eta by a constant amount β1\beta_1. So equal increases in xx correspond to equal increases in log-odds.

Not linear in probability

Equal increases in xx generally do not correspond to equal increases in pp, because p=logit1(η)p=\text{logit}^{-1}(\eta) is nonlinear.

Numerical example (β1=0.8\beta_1=0.8)

Here exp(0.8)2.23\exp(0.8)\approx 2.23, so each +1 hour multiplies the odds by ~2.23.

Baseline p Baseline odds p/(1p)p/(1-p) New odds =2.23×=2.23\times odds New p Δ\Delta p
0.10 0.11 0.25 0.20 +0.10
0.20 0.25 0.55 0.36 +0.16
0.36 0.55 1.22 0.55 +0.20
0.55 1.22 2.73 0.73 +0.18
0.73 2.73 6.07 0.86 +0.13
0.86 6.07 13.51 0.93 +0.07

Vecteezy

References