How to interpret confidence intervals in multiple regression

Wei YiFollowJun 28, 2021·21 min readSaveHands-on TutorialsWhere do confidence intervals in linear regression come from the case of least squares formulationUnderstand the std err

How to interpret confidence intervals in multiple regression
Wei Yi

Wei YiFollow

Jun 28, 2021·21 min read

Save

Hands-on Tutorials

Where do confidence intervals in linear regression come from  the case of least squares formulation

Understand the std err of the learnt parameter weights in linear regression, and other metrics such as the the t metric, and the P>|t| metic.

Image from Pixabay

I always assumed that I know linear regression well, until recently my brain went blank on the question from a colleague  where do the parameter confidence intervals come from when using Pythons ordinary least squares class OLS? I decided to derive all the relevant formulas for linear regression and compiled those derivations into a set of articles. And this is the first one.

The linear regression model

The linear regression model f(x)=xᵀ · w is the first machine learning model that most people study. In this model:

  • x are features and w are model parameters. Both x and w are vector of length p, with p being the number of features.  · is vector dot product.
  • we need to learn concrete values for w from a training dataset (X, Y), where X is the feature matrix with shape n×p, where n is the number of training data points. Y is the observation matrix with shape n×1.

The following code uses the Ordinary Least Squares class (OLS) to learn the values of the model parameter w:

Line 4 ~ 6 sets up a linear regression model with three features prestige, education and a constant feature (or bias), and with the target variable income.

Line 10 prints the following table. Within the table:

  • the coef column (highlighted with yellow background) shows the learnt parameter values for our features  prestige, education and the added constant feature. This is what we usually focus on.
  • Interestingly, the table also shows some probabilistic metrics (highlighted with pink background): for the learnt coefficients, std err, t, P>|t| and confidence interval [0.025, 0.975]. They seem to be probabilistic measures and what do they mean?

This article explains what do those probabilistic metrics mean and how they are computed in the least squares formulation for a linear regression model.

But before we answer these questions, we first need to understand how to learn the values for our model parameters w from training data (X, Y). And the method we use is called least squares.

The least squares formulation

In the least squares formulation, to find concrete values for the model parameter w, we minimize the following loss function

The loss function L(w) is the square of the distance between the observation Y and model prediction X·w. And the job is to minimize this loss  finding values for w such that L(w)s value is the smallest, hence the name least squares.

Finding w by solving the normal equation

Since L(w) is a quadratic function with unknown w, the obvious way to find values for w that minimize L(w) is:

  1. Find the derivative of L(w) with respect to w. The derivative is another function with w as unknown.
  2. Set the derivative to 0. This gives you the famous normal equation.
  3. Solve the normal equation to find concrete values for w.

Formally:

Line (1) left side is the notation for the derivative of the loss function L(w) with respect to w. The symbol  reflects the fact that w is a vector of variables, not a single variable, so we are in the world of partial derivatives. partial is in the sense that each derivative is with respect to a single variable in the vector w, resulting in p partial derivatives, where p is length of the w vector. We will see this more clearly later in this article.

Line (2) plugs in the definition of L(w).

Line (3) applies the vector differentiation rule to compute the derivative. Setting this derivative to 0 gives us the normal equation.

To solve the normal equation, we do:

Lets use the name w to refer the value of w that make the normal equation to hold, in other words: when w take the value w=(XᵀX)¹XᵀY, the left side of the normal equation evaluates to 0. So when I write w, I refer to the vector of unknown model parameters; and when I write w, I refer to the value that w can take to make the normal equation hold:

Note in the expression for w, you cannot simplify (XᵀX)¹ by using the rule (XᵀX)¹ = X¹(Xᵀ)¹ because X is usually not a a square matrix. Only square matrices have inverses. For example, if there are n=10 data points, and p=2 features, X has shape 10×2; it is not a square matrix.

Vector differentiation

I think all the above steps are straightforward, expect for the step when we apply the vector differentiation rule to derive the normal equation. I can memorize the vector differentiation rules such as (wᵀw) = 2w, but the result still seems mysterious.

It seems mysterious because the use of vector differentiation rules, introduced in linear algebra. These rules provide a level of abstraction. On one hand, they allow you to write down long and verbose expressions in a concise way; but on the other hand, they stop you from immediately seeing how the computation is carried out.

We should understand all those vector notations and linear algebra rules are just for the purpose of book-keeping. The results of vector differentiation should be the same if you expand all the vectors and apply the non-vector version of differentiation rules. If you want see how this is done to solve the normal equation, please go to the Appendix of this article.

Solving the normal equation is one method of performing parameter learning for a linear regression model. Other methods exists. Here is a good read about how to find linear regression parameter values by another method  stochastic gradient descent.

Uncertainty in least squares formulation

Until now, our linear model f(x)=x · w is a deterministic model, not a probabilistic model. To see this:

  • The learnt parameter values w=(XᵀX)¹XᵀY are deterministic because its ingredients X and Y are the given training dataset. w is a concrete vector with real scalars.
  • The predictions from this model Ŷ= X · w is deterministic because both X and w are deterministic.

But the model fitting summary, shown again here, includes probabilistic metrics, such as standard error, confidence interval. Whats going on?

Probabilistic measures exist together with random variables

These probabilistic metrics only exist together with the notion of random variables. So to talk about these metrics, we need to introduce random variables in our model. Specifically, since the standard derivations, the confidence intervals are for the learnt model parameters. We need to turn the learnt model parameter values from a scalar, such as 0.6237 for prestige, into a random variable.

Introducing random variables in linear model

We introduce random variables to represent possible model parameter values following these steps:

Step 1

Introduce a random variable vector y=X ·wᵣᵤ+ ε, where wᵣᵤ represents the single true model parameter value that we will never know. Dont worry about the fact that we dont know the value of wᵣᵤ, later, we will estimate it. ε is a multivariate Gaussian noise with zero mean, and it is independent at each data point, with variance η². y is a random variable vector of shape n×1, with n being the number of training data points:

To further understand the definition of y:

  • Each entry yᵢ in the random variable vector y represents the possible values that the corresponding training data point Yᵢ can take. So observation Y vector in our training dataset can be modelled as a sample from the random variable vector y.
  • In this random variable vector, each entry yᵢ is defined as the sum of two parts: yᵢ=Xᵢ ·wᵣᵤ+ εᵢ. The Xᵢ ·wᵣᵤ part is deterministic, as both Xᵢ and wᵣᵤ are concrete values. The noise part εᵢ is random  is a single dimensional random variable from the Gaussian distribution N(0, η²). Defined this way, yᵢ is a random variable from a single dimensional Gaussian distribution as well.
  • All single random variables from the vector y follow the same definition pattern. We have n data points, so there are n independent noise, all of them have zero mean and variance η².
  • We can use a multivariate Gaussian random variable to represent these n independent noise more concisely (but mathematically equivalently): ε~N(0, I·η²). So ε is a random variable vector of length n, I is the n×n identity matrix. The η² part in I·η² indicates that the variance of each single random variable in ε is η²; the I part indicates that all these single random variables are independent.
  • An alternative way of writing y=X ·wᵣᵤ+ ε is y~N(X · wᵣᵤ, I·η²).

Step 2

Weve already solved the normal equation to get w=(XᵀX)¹XᵀY. Here is another way to understand this formula:

  • it reveals that the concrete value Y is linearly transformed, by the transformation matrix (XᵀX)¹Xᵀ, into w. Note that w is a different from wᵣᵤ we introduced in Step 1. wᵣᵤ is our notation for the true coefficients that we will never know. w our estimate (or learnt parameter values) of wᵣᵤ when using least squares as the estimation method.
  • Since now we model Y as a sample from the random variable vector y, which has distribution N(X · wᵣᵤ, I·η²), we can view w =(XᵀX)¹XᵀY as a sample from a new random variable vector ŵ, defined as ŵ=(XᵀX)¹ Xᵀ y. Thats right, the random variable vector ŵ is a linear transformation from the random variable vector y, with the same transformation matrix (XᵀX)¹Xᵀ.
  • Structurally, ŵ is linearly transformed from y by (XᵀX)¹Xᵀ, just as w is from Y. Even ŵ is a linear transformation from y, please note w is of shape p×1, and y is of shape n×1, with p being the number of features in our linear regression problem, and n being the number of training data points.

Being a random variable vector, ŵ=(XᵀX)¹ Xᵀ y represents all the possible values and their probability densities for the parameters of our linear regression model. We use the probability density function of ŵ, which we will derive later, to talk about those probabilistic concepts such as standard deviation, confidence interval, for the parameters in our linear regression model.

What does ŵ, the distribution of of learnt parameter values mean?

It is important to understand that we introduce ŵ to represents the distribution of learnt parameter values. But what does the phrase the distribution of learnt parameter values mean?

The best way to understand this phrase is to remind ourselves that we assume that there is a single true parameter value wᵣᵤ. And we want to figure out what wᵣᵤ is by looking at the training dataset (X, Y) from the lens of a linear model X· wᵣᵤ. This works when:

  1. The observations Y is indeed generated from X using the formula X· wᵣᵤ.
  2. When our measurements of Y is indeed precise.

Then we can workout the value of wᵣᵤ by solving the normal equation.

As a modeller, we can challenge the above two conditions.

We can challenge the first condition, saying that a linear relationship between X and Y is not correct, instead, it should be an exponential relationship. Thats OK, and it directs us to a model selection problem  should we consider a linear model at all? or we should we consider an exponential model, or neural network, or decision tree? I will write another article for this topic. For now, lets stay linear.

We can challenge the second condition, saying that due to some measuring difficulties, the measured observations Y is not strictly equal to X· wᵣᵤ. Instead it is X· wᵣᵤ corrupted by some Gaussian noise ε with zero mean. An equivalent way of saying of this is that Y is now a sample from the random variable y, which is defined as y~ N(X ·wᵣᵤ, I·η²). This is what we are doing in this article.

By its definition that Y is a sample from the random variable vector y, we have to treat the concrete observations values Y in the training dataset as the outcome of a random process  we have to admit that the next time we sample, we will get a different set of observations Y (I said a different set because Y is a concrete vector with n scalars).

Since the learnt parameter value (XᵀX)¹XᵀY is computed from Y, and Y now has randomness because it comes from y, the learnt parameter value becomes a random variable. We use ŵ to denote this random variable vector and defined it as ŵ=(XᵀX)¹Xᵀy. By definition, a random variable represents a distribution of values (with different probability density), hence the phrase ŵ represents the distribution of learnt parameter values.

Deriving the probability density function for ŵ

To talk about probabilistic such as standard deviation of ŵ, we need to derive its probability density function. The derivation is easy:

  1. Because the random variable vector y is defined as y=X ·wᵣᵤ+ ε. And ε is a multivariate Gaussian random variable ε~N(0, I·η²), y is a multivariate Gaussian random variable as well: y ~ N(X ·wᵣᵤ, I·η²). We just need to move the mean from 0 to X ·wᵣᵤ.
  2. Since ŵ is defined as a linear transformation from y: ŵ=(XᵀX)¹ Xᵀ y, according to the multivariate Gaussian linear transformation rule, ŵ is also a multivariate Gaussian random variable, and the rule tells us what is the probability density function for ŵ.

Multivariate Gaussian linear transformation rule

This rule pops up in a lot of places in machine learning, such as Kalman filter, Gaussian Process, so please remember it by heart. Formally, it says:

In English, it says if a multivariate Gaussian random variable a has a known probability density function, and random variable b is a linear transformation from a by the transformation matrix A, then b is a multivariate Gaussian random variable with the same shape as a. And the probability density function of b has the the above fixed form that mentioning the mean, covariance matrix from the distribution of a, and the transformation matrix A.

For ŵ, we have:

This is a horribly long probability density function for ŵ. Luckily, we can simplify it dramatically:

Line (2)~(3) simplifies he mean component. Line (4)~(7) simplifies the covariance matrix component. One thing to note is the simplification from line (6) to (7) works because (XᵀX)¹ is a symmetric matrix, its transpose it itself.

From line (7) we note that the mean value of ŵ is the true parameter value wᵣᵤ, in other words, ŵ is an unbiased estimator of wᵣᵤ.

Estimating unknowns in the probability density function of ŵ

The simplified probability density function, ŵ~N(wᵣᵤ, (XᵀX)¹η²) has two unknowns, wᵣᵤ and η². We need to give them concrete values before we can use this probability density function to report probabilistic metrics.

  • For wᵣᵤ, we use w=(XᵀX)¹XᵀY.
  • For η², we use the following formula, where Ŷ = X · w being the in-sample predictions from our linear regression model:

Estimate for wᵣᵤ

Lets understand why we can use w=(XᵀX)¹XᵀY to estimate wᵣᵤ. From the derived probability density function of ŵ~N(wᵣᵤ, (XᵀX)¹η²), we know that ŵ is an unbiased estimator for wᵣᵤ. So if we draw infinite number of samples from this distribution, the average of those samples (or expectation) will equal to wᵣᵤ. In other words, we can use the expectation of samples from ŵ to estimate wᵣᵤ.

In reality, we cannot draw infinite samples from this distribution. In fact, we only have one sample, that is w=(XᵀX)¹XᵀY, and the average of it is itself.

First, you may ask why w is a sample from the distribution of ŵ? This is

because:

  • ŵ is defined as a linear transformation, with the transformation matrix (XᵀX)¹Xᵀ, from the random variable y.
  • Y is a sample from y.
  • w is defined by applying the same transformation to the sample Y.

Then you may ask, we should use the average of infinite samples from the distribution of ŵ to estimate wᵣᵤ, but we only have one sample. Is this OK? Well, it is not ideal, but that all weve got.

Estimate for η²

Lets now understand the estimate for η²:

η² is the variance of the observation noise random variable ε~N(0, I·η²). We dont observe the noise directly. Instead we only observed a sample Y from the random variable y=X ·wᵣᵤ+ ε. By its definition, the variance of y is equal to the variance of the noise ε because the only random component in y is from the noise. The X ·wᵣᵤ component in ys definition only shifts the mean of y, not the variance of it.

y is a random variable vector that contains n random variables:

  • The mean of those random variables, that is X ·wᵣᵤ, which we estimated by Ŷ=X · w.
  • We know one sample of y, that is Y. This observation Y is not equal to Ŷ. In other words, the fitted line wont pass through all observations. And we use the introduced noise component to explain these deviations  our model accepts that the actual observations Y is not equal to its mean prediction X · w because in the model, the observation Y is just a sample from the Gaussian distribution N(X ·wᵣᵤ, I·η²), which becomes N(Ŷ, I·η²) after we plugged in our mean estimate. A sample does not need to be equal to the mean of that distribution. And how much difference the sample Y is from the mean is determined by the noise variance I·η².
  • So Y-Ŷ can be seen as observations of the noise. Y-Ŷ is a vector of length n, representing the noise part, one for each observation. We know by definition, all those noise random variables has zero mean and they are independent to each other, and have the same variance η².

Since all those n noises share the same variance value η², we can think this way: we have a single random variable, which has zero mean, we also have n samples Y-Ŷ of this random variable. We want to estimate the variance of this random variable, that is η². Recall the formula to calculate variance of a random variable from its samples is (from wiki):

In our case, we have something similar, with Y-Ŷ being the sample and 0 being the mean. (Y-Ŷ-0)ᵀ(Y-Ŷ-0) is the square part, and it simplifies to (Y-Ŷ)ᵀ(Y-Ŷ), so the formula to estimate η² reads (again):

The normalizer 1/(n-p-1) is needed to turn the result into an unbiased estimator of the noise variance  remember the discussion why we use n-1 instead of n when estimating population variance with a sample? This is the same argument here, and the part -p reflects that fact that we use p features to forecast Ŷ. This results in reduction of p degrees of freedom.

Fully specified probability density function for ŵ

With wᵣᵤ and η² estimated, the final, fully specified probability density function for ŵ is:

After we plug in w=(XᵀX)¹XᵀY, we have an evaluate-able (there is no unknowns) probability density function for learnt model parameters ŵ. We can use it to report the those probabilistic metrics for the learnt parameter values.

With the fully specified probability density function for ŵ derived, we are finally ready to understand how those probabilistic metrics are calculated.

Standard error (or standard deviation) of ŵ

The std err column in the summary table reports the standard deviations of each random variable in ŵ.

From the distribution of ŵ, which is multivariate Gaussian, we get the covariance matrix of the random variable vector ŵ Var(ŵ)=(XᵀX)¹η². It is a p×p matrix. The variance of each random variable is at the main diagonal of this matrix. So the standard deviation of each random variable is the square root of those main diagonal entries.

Confidence interval of ŵ

Now lets talk about the confidence interval of ŵ, which is highlighted below:

The column names [0.025, 0.975] defines the 95% range of the coefficient values  there is 95% possibility that the true value of the coefficients lie within this interval.

Since ŵ is a multivariate Gaussian random variable, the confidence interval for each single random variable, such as prestige, in ŵ is just some standard deviation away (below and above) from its mean. The [0.025, 0.975] range corresponds to 2 standard deviation away from the mean since we are talking about Gaussian distributions.

To report the highlighted confidence interval, we need to find the mean and standard deviation of the learnt parameter values:

  • we use w to estimate the mean of ŵ, and they are reported in the coef column.
  • The standard deviation for each univariate random variable in ŵ are already in the the std err column. This is because of multivariate Gaussian marginalization rule.

Feature importance

Now lets understand the columns related to feature importance, as highlighted below:

There are two related metrics t and P>|t|. The P>|t| metric a hypothesis testing of the t metric.

The P>|t| metric reports the probability that the model parameter value is 0, despite what is reported in the coef column. A high value of the P>|t| metric says that features coefficient is likely to be 0, hence it does not contribute to predicting the target variable Y.

From the above summary table:

  • For the feature prestige, the P>|t| column reports that there is a 0% probability that the coefficient for prestige is 0. In other words, it is very likely that the prestige feature helps to predict the target variable.
  • For the feature education, the P>|t| column reports that there is a 80% probability that the coefficient for education is 0. In other words, it is unlikely that the education feature can help to predict the target variable.

To compute this 0% or 80% probability, we need to introduce the t-statistic.

The t-statistic

For the j-th random variable in ŵ, which we denote as ŵⱼ, we get the mean and standard deviation for this random variable from the probability density function of ŵ:

  • The mean is the j-th entry in the mean vector. Lets denote the mean by μⱼ, which is the reported coef value for a feature  0.6237 for prestige, 0.0323 for education.
  • The standard deviation is the square root of the j-th entry in the main diagonal of the covariance matrix. Lets denote the standard deviation by ηⱼ, which is the reported std err value for a feature  0.125 for prestige, 0.132 for education.

Then the t-statistic for the j-th feature tⱼ is

This statistic measures the normalised distance between the mean μⱼ of the estimated parameter ŵⱼ to the mean 0. This distance is scaled by the standard deviation in the denominator. The t column in the summary table reports this tⱼ quantity. The fact that the t-statistic is a scaled distance (also called scale-free metric) is important  it makes the t-statistics for different features comparable.

The P>|t| metric

Since the tⱼ statistic is scaled, we can study it using the unit Gaussian probability density function. From now on, lets use t instead of tⱼ to stay consistent with the metric name in the summary table.

The P>|t| metric reports the accumulated probability in the two shaded tails from a unit Gaussian distribution. Thats the shaded purple areas from [-, -t] and from [t, +], added together:

The question we ask is: if the coefficient for a feature is indeed zero, what is the probability (viewed from the unit Gaussian distribution) to have the two tails defined by [-, -t] and [t, +].

The larger the t-statistic (or smaller if the t-statistic is negative) is, the smaller the two tails starting from -t and t, hence the less likely that the computed t-statistic came from a unit Gaussian distribution with zero mean.

So, if we see a feature with a large t-statistic, we are more sure that the report coefficient of that feature is indeed not zero. Or equivalently, that feature is more likely to contribute to predicting the target variable, hence it is more important.

You may ask, why we define the P>|t| metric as the accumulated probability of the two tails, instead of just looking at the probability density at the t location? This is because we want to report a probability, which by definition is an area over a range of values under the curve of the probability density function. If we only look at the t location, the probability is 0 since the unit Gaussian is a continuous distribution.

You may also ask, why we measure the areas of two tails instead of one? We use two tails because we are conducting a two-tailed hypothesis testing with the null hypothesis being that the t-statistic comes from a unit Gaussian distribution. We use a two-tailed hypothesis testing because we are interested in knowing whether there is a difference (either positive, which is captured by one tail, or negative, which is captured by the other tail) between the computed t-statistic and zero.

You may further ask, can I use a one-tailed hypothesis testing to report feature importance? Yes, you can, it is just that the OLS class reports a two-tailed hypothesis testing result.

Conclusion

This articles explains how to introduce random variables into a deterministic least squares linear regression model to reason about uncertainty in learnt model parameter values. We used random variables because uncertainty measures such as confidence intervals only exist with random variables.

Besides least squares, other formulations exist for linear regression, such as full Bayesian formulation, Maximum likelihood formulation (MLE), and Maximum a posteriori formulation (MAP) formulation. These method have more natural ways of reasoning about uncertainty. I will cover them in future articles.

Support me

If you like my story, I will be grateful if you consider supporting me by becoming a Medium member via this link: https://jasonweiyi.medium.com/membership.

I will keep writing these stories.

Appendix: Gradient of normal equation

Lets work out the derivation of normal equation with an example where we have three data points and w is a two dimension vector. Formally:

In the above:

  • X=[X, X], X=[X, X], X=[X, X]. They are 2×1 row vectors.
  • X, X, X, X, X, X are scalars.
  • Y, Y, Y are scalars.
  • w, w are scalar variables.

Lets further introduce the vector U:

where U, U, U are scalars according to their definition of Uᵢ=Yᵢ-Xᵢ · w. You can clearly see that U is a function of w. This paves the way for applying the chain rule in computing the derivative of L(w) with respect to w.

Now lets derive the normal equation (shown again):

Line (1) is the notation for the derivative of the loss function L(w) with respect to the model parameter w.

Line (2) plugs in the definition of L(w).

Line (3) uses U to simplify the formula and applies the chain rule in differentiation to split the whole derivative into two parts.

Lets focus on these two parts separately.

The first part first:

Line (1) is the first term.

Line (2) replaced the vector U with its element-wise definition.

Line (3) applies the dot product operation on the numerator.

Line (4) explicitly writes down the definition of partial derivative  computes the derivative for each variable, namely, U, U, U, separately, and follow the convention to organize these three derivatives in a row.

Line (5) applies the rule of derivation.

Line (6) plugs in the definition of U, U and U.

Line (7) uses the vector notation to simplify the formula. This gives us a term that is close to the team in the normal equation.

Now we work on the second term:

Line (1) is the second term.

Line (2) expands the element-wise definition of U and w.

Line (3) applies the vector partial derivative rule, and follow the convention that rows go over U, and columns go over w.

Line (4) expands the definition of U.

Line (5) computed the derivatives.

Lien (6) simplifies the formula using the vector notation.

Now we can multiply these two terms together:

Note that at line (5), we have a row vector  remember that when we do w, we follow the convention to write down the partial derivatives with respect to each element of w in a row vector. To turn L(w)/w into a column vector, we transpose it to get the left hand size of the normal equation -2Xᵀ(Y-X · w), and we can drop -2 too because we have an equation.

Video liên quan