The F-statistic for a take a look at of a number of linear restrictions is a staple of introductory econometrics programs.
Within the easiest case, it may be written as
[F equiv frac{(SSR_r – SSR_{u})/q}{SSR_{u} / (n – k – 1)}]
the place (SSR_r) is the restricted sum of squared residuals, (SSR_{u}) is the unrestricted sum of squared residuals, (q) is the variety of restrictions, and ((n – okay – 1)) is the levels of freedom of the unrestricted mannequin.
In my expertise, college students encountering this expression for the primary time discover it bewilderingly arbitrary; it turns into only one extra merchandise so as to add the an inventory of formulation memorized for the examination and promptly forgotten. My goal on this put up is to demystify the (F) statistic. By the tip, I hope that you will see that the type of this expression intuitive, maybe even apparent.
This isn’t a put up about asymptotic principle, and it isn’t a put up about heteroskedasticity. I can’t show that (F) follows an (F)-distribution, and I’ll blithely assume that we inhabit the idealized textbook realm wherein all errors are homoskedastic. I will even dodge the query of whether or not it is best to even be finishing up an F-test within the first place. It is a put up about understanding what the (F)-statistic measures and why it takes the shape that it does.
The Easiest Potential Instance
The easiest way to grasp the (F)-statistic is by an instance that’s so easy that there’s no purpose to make use of an (F)-test within the first place. Right here’s a dataset of scholars’ scores on two introductory statistics midterms that I gave a few years in the past:
midterms <- learn.csv('https://ditraglia.com/econ103/midterms.csv')
head(midterms)
## Midterm1 Midterm2
## 1 57.14 60.71
## 2 77.14 77.86
## 3 83.57 93.57
## 4 88.00 NA
## 5 69.29 72.14
## 6 80.71 89.29
As you possibly can see, there’s at the very least one lacking commentary: pupil #4 scored 88% on the primary midterm, however missed the second. In actual fact, 9 college students missed the second midterm:
abstract(midterms)
## Midterm1 Midterm2
## Min. :56.43 Min. :47.86
## 1st Qu.:70.53 1st Qu.:74.64
## Median :80.36 Median :84.29
## Imply :79.74 Imply :81.39
## third Qu.:87.86 third Qu.:90.71
## Max. :97.86 Max. :99.29
## NA's :9
To maintain this instance so simple as doable, I’ll drop the lacking observations.
midterms <- na.omit(midterms)
Let’s name pupil #4 Natalie: she scored 88% on the primary midterm however missed the second. Suppose we needed to foretell how properly Natalie would have carried out on the second midterm had she taken it. There are a lot of ways in which we might attempt to make this prediction. One chance could be to disregard Natalie’s rating on the primary midterm, and predict that she would have scored 81.4 on the second: the common rating amongst all college students who took this examination. One other chance could be to suit a linear regression to the scores of all college students who took each exams and use this to challenge Natalie’s rating on midterm two based mostly on her rating on midterm one. If scores on the 2 exams are correlated, choice two looks as if a greater concept: Natalie outperformed the category common on midterm one by 8.4 or roughly 0.77 normal deviations. It appears cheap to account for this when predicting her second on the second examination.
In actual fact each of those these prediction guidelines could be considered as particular instances of linear regression. Let (x_i) denote pupil (i)’s rating on midterm one and (y_i) denote her rating on midterm two. The pattern imply (bar{y} = frac{1}{n} sum_{i=1}^n y_i) solves the optimization drawback
[
min_a sum_{i=1}^n (y_i – a)^2
]
which is just least squares with out a predictor variable. In distinction, the same old least-squares regression drawback is
[
min_{a,b} sum_{i=1}^n (y_i – a – b x_i)^2
]
with options (hat{a} = bar{y} – hat{b} bar{x}) and (hat{b} = s_{xy} / s_x^2), the place (s_{xy}) is the pattern covariance of scores on the 2 midterms and (s_x^2) is the pattern variance of scores on the primary midterm. Discover how these two optimization issues are associated: the primary is a restricted (aka constrained) model of the second with the constraint (b = 0). Within the dialogue beneath, I’ll name the primary of those the restricted regression and the second the unrestricted regression.
It’s straightforward to suit these regressions in R.
We’ll begin with the restricted:
restricted <- lm(Midterm2 ~ 1, information = midterms)
abstract(restricted)
##
## Name:
## lm(method = Midterm2 ~ 1, information = midterms)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.531 -6.746 2.899 9.319 17.899
##
## Coefficients:
## Estimate Std. Error t worth Pr(>|t|)
## (Intercept) 81.391 1.457 55.86 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual normal error: 12.28 on 70 levels of freedom
The syntax Midterm2 ~ 1 specifies a regression method containing no predictor variables, solely an intercept.
Discover that our estimate for the intercept agrees with the pattern imply rating on the second midterm from above, because it ought to!
Turning our consideration to the unrestricted regression, we see that scores on the primary midterm are strongly predictive of scores on the second:
unrestricted <- lm(Midterm2 ~ Midterm1, information = midterms)
abstract(unrestricted)
##
## Name:
## lm(method = Midterm2 ~ Midterm1, information = midterms)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.809 -7.127 2.047 8.125 18.549
##
## Coefficients:
## Estimate Std. Error t worth Pr(>|t|)
## (Intercept) 32.575 9.243 3.524 0.000759 ***
## Midterm1 0.613 0.115 5.329 1.17e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual normal error: 10.41 on 69 levels of freedom
## A number of R-squared: 0.2916, Adjusted R-squared: 0.2813
## F-statistic: 28.4 on 1 and 69 DF, p-value: 1.174e-06
For a pair of scholars who differed by one level of their scores on the primary midterm, we’d predict a distinction of 0.61 factors on the second.
The restricted regression ignores a pupil’s rating on the primary midterm when predicting her rating on the second.
However we’ve seen from the unrestricted regression that scores on midterm #1 are strongly correlated with scores on midterm #2.
As such, our greatest wager is to foretell Natalie’s second midterm rating utilizing the unrestricted regression mannequin:
predict(unrestricted, newdata = information.body(Midterm1 = 88))
## 1
## 86.52169
As a result of Natalie scored above the imply on the primary examination, we predict that she is going to rating above the imply on the second examination.
How significantly better is the match of the unrestricted regression?
Whereas I didn’t current it on this means, the selection between restricted and unrestricted regressions above could possibly be formulated as a speculation take a look at. The restricted regression imposes a zero regression slope, however the unrestricted regression doesn’t. On this case, we are able to take a look at the restriction that the slope is actually zero utilizing easy t-test.
Based mostly on the t-statistic of 5.33 from above we’d simply reject the restriction at any typical significance stage.
However there’s one other approach to perform the identical take a look at.
Though it could be overkill on this instance, the ideas that underlie it may be used to hold out exams in additional sophisticated conditions the place a easy t-test wouldn’t suffice.
Quite than analyzing the slope estimate from the unrestricted regression, this different method compares the sum of squared residuals (SSR) of the 2 regressions to see which does a greater job of becoming the noticed information.
It’s straightforward to compute the SSR of the 2 regressions from above utilizing the residuals() operate:
SSR_u <- sum(residuals(unrestricted)^2)
SSR_r <- sum(residuals(restricted)^2)
c(Unrestricted = SSR_u, Restricted = SSR_r)
## Unrestricted Restricted
## 7475.741 10552.883
The SSR of the restricted regression is increased than that of the unrestricted regression.
However what precisely ought to we make of this?
An image might help to make issues clearer.
This one has two panels: one for the restricted regression and one other for the unrestricted regression.
Every panel plots the observations from the midterms dataset together with the fitted regression line, utilizing dashed vertical strains to point the residuals: the gap from a given commentary to the regression line.
Discover that the restricted regression line is flat as a result of it doesn’t use scores on the primary midterm to foretell these on the second.
The decrease SSR of the unrestricted mannequin displays the truth that the observations within the midterms dataset are on common nearer to a line with slope 0.6 and intercept 32.6 than they’re to a line with slope zero and intercept 81.4.
To grasp this image, it helps to consider the next query: is it doable for the unrestricted regression to have a increased SSR than the restricted one?
Recall from above that every of those regressions is the answer to an optimization drawback.
The distinction between them is that the restricted regression imposes a constraint whereas the unrestricted regression doesn’t.
If one of the best slope for predicting second midterm scores utilizing first midterm scores is zero, the unrestricted regression is free to set (b = 0).
On this case its estimates would coincide with these of the restricted regression.
Alternatively, if one of the best slope isn’t zero meaning another alternative of (b) by definition leads to a decrease SSR: linear regression chooses the road whose slope and intercept decrease the squared vertical deviations between the information and the road.
The restricted regression is pressured to have (b = 0), so on this case it should do a worse job becoming the information.
This reasoning reveals that we are going to at all times discover that the SSR of the restricted mannequin is at the very least as giant as that of the unrestricted mannequin.
We shouldn’t be stunned to see that the unrestricted regression “matches the information” higher than the restricted one: it might’t do in any other case except the pattern correlation between midterm scores is strictly zero.
However there’s nonetheless the query of how significantly better it matches.
Taking variations tells us how a lot bigger the SSR of the restricted mannequin is in comparison with that of the unrestricted mannequin:
SSR_r - SSR_u
## [1] 3077.142
So is that this a giant distinction or a small distinction?
The reply will depend on the models wherein (y) is measured.
A residual is a vertical deviation, i.e. a distance alongside the (y)-axis.
Which means that it has the identical models because the (y)-variable.
If (y) is measured in inches, so are the residuals; if (y) is measured in kilometers, so are the residuals.
As a result of the SSR is a sum of squared residuals, it has the identical models as (y^2).
If (y) is measured in inches, the SSR is measured in sq. inches; if (y^2) is measured in kilometers, the SSR is measured in sq. kilometers.
Altering the models of (y) modifications the models of the SSR.
For instance, an SSR of 1 turns into an SSR of a million if we modify the models of (y) from kilometers to meters.
Accordingly, a comparability of SSR_r to SSR_u is meaningless except we account for the models of (y).
The best means account for models is by eliminating them from the issue.
That is exactly what we do after we perform a t-test: (bar{x}/textual content{SE}(bar{x})) is unitless as a result of the usual error of (bar{x}) has the identical models as (bar{x}) itself. Any change of models within the numerator could be cancelled out within the denominator.
It is a essential level: take a look at statistics are unitless.
We don’t examine (bar{x}) to a desk of regular important values measured in inches for a distribution with normal deviation (2.4); we examine (bar{x}/2.4) to a unitless normal regular distribution.
The t-statistic eliminates models by taking a ratio, so let’s strive the identical concept in our comparability of SSR_r to SSR_u.
There are numerous prospects, and any of them would work simply as properly from the angle of eliminating models.
The F-test statistic is predicated on a ratio that asks how a lot worse the restricted mannequin matches relative to the unrestricted regression.
In different phrases, we ask: how a lot bigger is SSR_r in comparison with SSR_u as a proportion expressed in decimal phrases?
(SSR_r - SSR_u) / SSR_u
## [1] 0.4116171
There may be nothing refined happening right here.
If we needed to know the way a lot bigger US GDP is in 2021 in comparison with 1921, we’d merely calculate
[
frac{text{GDP}_{2021} – text{GDP}_{1921}}{text{GDP}_{1921}}
]
assuming, in fact, that each of those figures are corrected for inflation!
That is exactly the identical reasoning that we used above: the SSR “grows” after we impose a restriction.
We need to know the way a lot it grows as a proportion.
The reply is 0.41 or equivalently 41%.
Sampling Uncertainty
We’ve practically arrived on the F-statistic.
To see what’s lacking, we’ll use a little bit of algebra to re-write it as
[F equiv frac{(SSR_r – SSR_{u})/q}{SSR_{u} / (n – k – 1)} = left(frac{SSR_r – SSR_u}{SSR_u}right) left(frac{n – k – 1}{q}right).]
We obtained the primary issue on the RHS, ((SSR_r – SSR_u) / SSR_u), just by reasoning about models and the character of constrained versus unconstrained optimization issues.
Cease for a minute and respect how spectacular that is: easy instinct has taken us midway to this somewhat formidable-looking expression.
To grasp the second issue, we’d like to consider sampling uncertainty.
Within the midterms instance we discovered that the restricted regression SSR was 41% bigger than the unrestricted one.
Is that this a giant distinction or a small one?
Models don’t enter into it, as a result of now we have already eradicated them.
However the midterms dataset solely incorporates info on 71 college students.
If we merely need to summarize the connection between take a look at scores for these college students, there is no such thing as a function for statistical inference: abstract statistics suffice.
Assessments and confidence intervals enter the image after we hope to generalize from an noticed pattern to the inhabitants from which it was drawn.
Think about a big inhabitants of introductory statistics college students who took my two midterms.
Now suppose that we observe a random pattern of 71 college students from this inhabitants.
How a lot info do the noticed examination scores for these college students present in regards to the relationship between midterm scores that we within the inhabitants?
The bigger the pattern measurement, the extra proof an noticed distinction within the pattern gives a couple of potential distinction within the inhabitants.
We are able to see this within the expression for the usual error of the pattern imply: (textual content{SE}(bar{X}) = sigma_x^2/sqrt{n}).
The bigger the pattern measurement, the smaller the usual error, all else equal.
Accordingly, given two datasets with equivalent abstract statistics, the bigger pattern may have the bigger t-statistic.
The identical reasoning applies to the F-statistic above.
The numerator ((n – okay – 1)) within the second issue will increase with the pattern measurement (n).
This magnifies the impact of the primary issue.
An (SSR_r) that’s 41% increased than the (SSR_u) is “extra spectacular” proof when the pattern measurement is (1000) than when it’s (10).
Small samples are intrinsically extra variable than giant ones, so we should always anticipate them to show up anomalous outcomes extra incessantly.
The F-statistic takes this under consideration.
So why ((n – okay – 1)) somewhat than (n)?
It is a so-called “levels of freedom correction.”
By estimating (okay) regression slope parameters and (1) intercept parameter, we “expend” ((okay + 1)) of the observations, leaving solely ((n – okay – 1)) items of actually unbiased info.
This isn’t notably intuitive.
In a surprising departure from my traditional recommendation to introductory statistics and econometrics college students, I recommend that you just memorize this a part of the F-statistic.
It could assist to note that the identical levels of freedom correction seems within the expression for the usual error of the regression, (SER equiv sqrt{SSR/(n – okay – 1)}),
a measure of the common distance that the noticed information fall from the regression line.
The one as-yet-unexplained amount within the F-test statistic is (q).
This denotes the variety of restrictions imposed by the restricted mannequin.
Counting restrictions is similar factor as counting equals indicators.
In a regression of the shape (Y = beta_0 + beta_1 X + U) a restriction of the shape (beta_1 = 1) provides (q = 1) as a result of there it takes a single equals signal to say that (beta_1) equals one.
Extra sophisticated regressions enable extra sophisticated sorts of restrictions.
For instance, within the regression
[
Y = beta_0 + beta_1 X_1 + beta_2 X_2 + beta_3 X_3 + U
]
we might take into account the restriction (beta_1 = beta_2 = beta_3 = 0) yielding (q =3).
Alternatively we might take into account (beta_0 = beta_2 = 7) yielding (q = 2).
We might even take into account (beta_1 + beta_2 = 1) yielding (q = 1).
Once more: to depend the variety of restrictions, depend the variety of equals indicators that it requires to specific these restrictions.
Now we all know decide (q), however the query stays: why does it enter the F-test statistic?
Above we mentioned why (SSR_u) can’t exceed (SSR_r) in a specific dataset.
Now now we have to consider what occurs when sampling uncertainty enters the image.
The sum of squared residuals measures how properly a linear regression mannequin matches the noticed dataset.
Crucially, that is the exact same dataset that was used to calculate the regression slope and intercept.
In impact, now we have used the information twice: first to find out the parameter values that decrease the sum of squared vertical deviations after which to evaluate how properly our regression matches the information, measured by the identical vertical deviations.
The hazard lurking here’s a phenomenon referred to as overfitting.
We’re not likely excited about how properly the regression matches this dataset; what we need to know is how properly it could assist us to foretell future observations.
In-sample match, as measured by SSR or associated portions, could be proven to be an over-optimistic measure of out-of-sample match.
That is well-known to machine studying practitioners, who usually use one dataset to suit their fashions, the coaching information, and a separate dataset, the take a look at information, to judge their predictive efficiency.
Generally, the extra “versatile” the mannequin, the more severe the overfitting drawback turns into.
As a result of including restrictions reduces a mannequin’s flexibility, this creates a problem for any process that compares the in-sample match of two regressions.
As a result of it’s much less versatile, we should always anticipate the restricted regression to suit the pattern information much less properly than the unrestricted regression even when the restrictions are true within the inhabitants.
To drive this level house, I generated a dataset referred to as sim_data wherein (Y_i = alpha + epsilon_i) the place (epsilon_i sim textual content{Regular}(0, 1)).
I then simulated a lot of regressors ((X_{i1}, X_{i2}, dots, X_{iq})) utterly independently of (Y_i).
(For the simulation code, see the appendix beneath.)
Within the inhabitants from which I simulated my information, none of those regressors incorporates any info to foretell (Y).
However, if (q) is comparatively giant in comparison with the pattern measurement (n), a few of these regressors will seem to be correlated with (Y) based mostly on the noticed information, purely due to sampling variability.
On this instance I set (n = 100) and (q = 50).
Becoming a restricted regression with solely an intercept, and an unrestricted regression that features all 50 regressors from sim_dat we receive
reg_sim_unrestricted <- lm(y ~ ., sim_dat)
reg_sim_restricted <- lm(y ~ 1, sim_dat)
SSR_sim_r <- sum(residuals(reg_sim_restricted)^2)
SSR_sim_u <- sum(residuals(reg_sim_unrestricted)^2)
(SSR_sim_r - SSR_sim_u) / SSR_sim_u
## [1] 0.9144509
Though the restrictions are true on this simulation examine, the restricted SSR is 91% bigger than the unrestricted SSR purely because of sampling variability.
The F-statistic explicitly takes this phenomenon under consideration by way of the scaling issue ((n – okay – 1) / q).
What issues just isn’t the pattern measurement per se, however the pattern measurement relative to the variety of restrictions imposed by the restricted regression.
Whereas we’re right here we’d as properly perform the take a look at!
Now that we perceive why the F-test statistic takes the shape that it does, let’s perform the F-test in every of the 2 examples from above: midterms and sim_dat.
Beneath the null speculation that the constraints imposed by the restricted regression are right, the F-test statistic follows an (F(q, n-k-1)) distribution.
For the midterms dataset our take a look at statistic is
F_midterms <- ((SSR_r - SSR_u) / SSR_u) * (71 - 1 - 1) / 1
F_midterms
## [1] 28.40158
whereas the ten%, 5% and 1% important values for an (F(1, 69)) distribution are
alpha <- c(0.1, 0.05, 0.01)
qf(1 - alpha, df1 = 1, df2 = 69)
## [1] 2.779684 3.979807 7.017078
The related p-value is
1 - pf(F_midterms, df1 = 1, df2 = 69)
## [1] 1.173624e-06
so we resoundingly reject the restriction: first midterm scores clearly do assist to foretell second midterm scores.
The sim_data instance provides a really totally different consequence.
The take a look at statistic on this instance is properly beneath any normal important worth, and the p-value may be very giant:
F_sim <- ((SSR_sim_r - SSR_sim_u) / SSR_sim_u) * (100 - 50 - 1) / 50
F_sim
## [1] 0.8961619
qf(1 - alpha, df1 = 50, df2 = 49)
## [1] 1.444392 1.604442 1.957803
1 - pf(F_sim, df1 = 50, df2 = 49)
## [1] 0.6497498
On this case we’d fail to reject the restrictions.
Certainly, the restrictions are true: my simulation generated regressors which can be utterly unbiased of (Y)!
The Backside Line
The F-statistic is a product of two components.
The primary issue measures how a lot bigger the sum of squared residuals turns into in proportion phrases after we impose the restriction. We use a relative comparability to get rid of models from the issue. The second issue accounts for sampling variability. The bigger the pattern measurement (n) relative to the variety of restrictions (q), the extra we “inflate” the worth of the primary issue. The one factor it is advisable memorize is the levels of freedom correction: ((n – okay – 1)).
Appendix: Code
I used the next code to generate my plot evaluating the SSR of the restricted and unrestricted regression fashions within the midterm exams dataset from above:
par(mfrow = c(1, 2))
plot(Midterm2 ~ Midterm1, information = midterms, primary = 'Unrestricted', pch = 20)
abline(coef(unrestricted), lwd = 2, col = 'blue')
with(midterms, segments(x0 = Midterm1, y0 = Midterm2,
x1 = Midterm1, y1 = fitted(unrestricted),
col = 'blue', lty = 2, lwd = 1))
textual content(90, 50, bquote(SSR == .(spherical(SSR_u))))
plot(Midterm2 ~ Midterm1, information = midterms, primary = 'Restricted', pch = 20)
with(midterms, segments(x0 = Midterm1, y0 = Midterm2,
x1 = Midterm1, y1 = fitted(restricted),
col = 'purple', lty = 2, lwd = 1))
abline(h = coef(restricted), lwd = 2, col = 'purple')
textual content(90, 50, bquote(SSR == .(spherical(SSR_r))))
par(mfrow = c(1,1))
and the next code to generate the information contained in sim_data
set.seed(3817)
n <- 100
q <- 50
y <- 0.5 + rnorm(n)
x <- matrix(rnorm(n * q), n, q)
colnames(x) <- paste0('x', 1:q)
sim_dat <- information.body(x, y)
