Appendix B — Mathematics of logistic regression

In , we introduced logistic regression models for binary response variables. Here we will show some of the mathematics underlying these models, making using of the matrix notation for regression introduced in .

B.1 Matrix representation of logistic regression

Given a binary response variable Y and predictors X1,X2,,Xp, the logistic regression model is

(B.1)logit(π)=log(π1π)=β0+β1X1+β2X2++βpXp

where π=Pr(Y=1).

Similar to linear regression, we can write a matrix representation of .

(B.2)logit(π)=log(π1π)=Xβ

We have the following components in :

  • π is the n×1 vector of probabilities, such that πi=Pr(yi=1)
  • X is the n×(p+1) design matrix. Similar to linear regression, the first column is 1, a column of 1’s corresponding to the intercept.
  • β is a (p+1)×1 vector of model coefficients.

Though not directly in or , the underlying data also includes y, an n×1 vector of the binary response variables.

We are often interested in the probabilities computed from the logistic regression model. The probabilities computed from are

(B.3)π=eXβ1+eXβ See for more detail about the relationship between the logit, odds, and probability.

B.2 Estimation

We want to find estimates β^ that are the best fit for the data based on the model in . In , we outlined how we use maximum likelihood estimation to find β^. Here we will show more of the mathematical details behind the model estimation.

Let Z be a random variable that takes values 0 or 1. Then Z follows a Bernoulli distribution such that

Pr(Z=z)=pz(1p)1z

where p=Pr(z=1).


E(Z)=p and Var(Z)=p(1p).

The response variable follows a Bernoulli distribution, such that P(Y=yi)=πiyi(1πi)1πi. Let xiT be the ith row of the design matrix X. Then, using , we have

(B.4)P(Y=yi)=(exiTβ1+exiTβ)yi(1exiTβ1+exiTβ)1yi that the likelihood function is a measure of how likely we observe the data given particular values of the model parameters β^.

Let Z1,Z2,,Zn be independent Bernoulli random variables. The joint distribution of Z1,Z2,,Zn (the probability of observing these values) is

f(Z1,Z2,,Zn)=i=1npizi(1pi)1zi

where pi=Pr(Zi=1) .

Using , the likelihood function for logistic regression is

(B.5)L(β|X,y)=i=1n(exiTβ1+exiTβ)yi(1exiTβ1+exiTβ)1yi

To make the math more manageable, we will maximize the log likelihood shown in . Maximizing is equivalent to maximizing .

(B.6)logL(β|X,y)=i=1nyilog(exiTβ1+exiTβ)+i=1n(1yi)log(1exiTβ1+exiTβ)[Given 1exiTβ1+exiTβ=11+exiTβ]=i=1nyilog(exiTβ)i=1nyilog(1+exiTβ)i=1nlog(1+exiTβ)+i=1nyilog(1+exiTβ)=i=1nyixiTβi=1nlog(1+exiTβ)

We take the first derivative of with respect to β. An outline of the steps is shown below. The maximum likelihood estimator is the vector of coefficients such that is the solution to logLβ=0 .

(B.7)logLβ=i=1nyixiTβi=1nlog(1+exiTβ)=i=1nyixiTexiTβxiT1+exiTβ

There is no closed-from solution for this, i.e., there is no neat formula for β^ as we found in for linear regression. Therefore, numerical approximation methods such are used to find the maximum likelihood estimators β^. One popular method is Newton-Raphson, a “root-finding algorithm which produces successively better approximations to the roots (or zeroes) of a real-valued function” (). Numerical approximation methods such as Newton Raphson systematically search the space of of possible values of β^ until it converges on the solution (the “root”) to .

B.3 Inference for logistic regression

In , we introduced inference for a single coefficient βj in the logistic regression model. Because there is no closed-form solution for the maximum likelihood estimator β^ found in , there is no closed form solution for the mean and variance for the distribution of β^. We rely on theoretical results to know the distribution of β^ as n gets large (called asymptotic results).

Given n is large,

(B.8)β^N(β,(XTVX)1)

where V is an n×n diagonal matrix, such that Vii is the estimated variance for the ith observation.

The standard error used for hypothesis testing and confidence intervals for a single coefficient βj , is computed as the jth diagonal element of Var(β^)1/2=(XTVX)1/2. This is why the hypothesis tests and confidence intervals in are only reliable for large n, because they depend on asymptotic approximations in . We can use simulation-based methods if the data has a small sample size.


  1. Recall that the variance of the Bernoulli distribution depends on π, so each observation has a different variance. This is in contrast to linear regression, where all observations have the same variance σϵ2.↩︎