An introduction to the lasso in Stata

Why is the lasso fascinating?

The least absolute shrinkage and choice operator (lasso) estimates mannequin coefficients and these estimates can be utilized to pick out which covariates ought to be included in a mannequin. The lasso is used for final result prediction and for inference about causal parameters. On this publish, we offer an introduction to the lasso and talk about utilizing the lasso for prediction. Within the subsequent publish, we talk about utilizing the lasso for inference about causal parameters.

The lasso is most helpful when a couple of out of many potential covariates have an effect on the result and you will need to embody solely the covariates which have an have an effect on. “Few” and “many” are outlined relative to the pattern dimension. Within the instance mentioned under, we observe the newest health-inspection scores for 600 eating places, and we’ve 100 covariates that might doubtlessly have an effect on every one’s rating. We’ve too many potential covariates as a result of we can not reliably estimate 100 coefficients from 600 observations. We imagine that solely about 10 of the covariates are vital, and we really feel that 10 covariates are “a couple of” relative to 600 observations.

On condition that just a few of the various covariates have an effect on the result, the issue is now that we don’t know which covariates are vital and which aren’t. The lasso produces estimates of the coefficients and solves this covariate-selection drawback.

There are technical phrases for our instance state of affairs. A mannequin with extra covariates than whose coefficients you possibly can reliably estimate from the obtainable pattern dimension is called a high-dimensional mannequin. The idea that the variety of coefficients which can be nonzero within the true mannequin is small relative to the pattern dimension is called a sparsity assumption. Extra realistically, the approximate sparsity assumption requires that the variety of nonzero coefficients within the mannequin that greatest approximates the actual world be small relative to the pattern dimension.

In these technical phrases, the lasso is most helpful when estimating the coefficients in a high-dimensional, roughly sparse, mannequin.

Excessive-dimensional fashions are practically ubiquitous in prediction issues and fashions that use versatile purposeful kinds. In lots of circumstances, the various potential covariates are created from polynomials, splines, or different features of the unique covariates. In different circumstances, the various potential covariates come from administrative information, social media, or different sources that naturally produce enormous numbers of potential covariates.

Predicting restaurant inspection scores

We use a sequence of examples to make our dialogue of the lasso extra accessible. These examples use some simulated information from the next drawback. A well being inspector in a small U.S. metropolis desires to make use of social-media evaluations to foretell the health-inspection scores of eating places. The inspector plans so as to add shock inspections to the eating places with the lowest-predicted well being scores, utilizing our predictions.

hsafety2.dta has 1 commentary for every of 600 eating places, and the rating from the newest inspection is in rating. The share of a restaurant’s social-media evaluations that comprise a phrase like “soiled” might predict the inspection rating. We recognized 50 phrases, 30 phrase pairs, and 20 phrases whose incidence percentages in evaluations written within the three months previous to an inspection might predict the inspection rating. The incidence percentages of the 50 phrases are in word1 – word50. The incidence percentages of 30-word pairs are in wpair1 – wpair30. The incidence percentages of the 20 phrases are in phrase1 – phrase20.

Researchers broadly use the next steps to search out the perfect predictor.

Divide the pattern into coaching and validation subsamples.
Use the coaching information to estimate the mannequin parameters of every of the competing estimators.
Use the validation information to estimate the out-of-sample imply squared error (MSE) of the predictions produced by every competing estimator.
One of the best predictor is the estimator that produces the smallest out-of-sample MSE.

The unusual least-squares (OLS) estimator is ceaselessly included as a benchmark estimator when it’s possible. We start the method with splitting the pattern and computing the OLS estimates.

Within the output under, we learn the information into reminiscence and use splitsample with the choice break up(.75 .25) to generate the variable pattern, which is 1 for a 75% of the pattern and a couple of for the remaining 25% of the pattern. The project of every commentary in pattern to 1 or 2 is random, however the rseed possibility makes the random project reproducible.

. use hsafety2

. splitsample , generate(pattern) break up(.75 .25) rseed(12345)

. label outline slabel 1 "Coaching" 2 "Validation"

. label values pattern slabel

. tabulate pattern

     pattern |      Freq.     P.c        Cum.
------------+-----------------------------------
   Coaching |        450       75.00       75.00
 Validation |        150       25.00      100.00
------------+-----------------------------------
      Complete |        600      100.00

The one-way tabulation of pattern produced by tabulate verifies that pattern incorporates the requested 75%–25% division.

Subsequent, we compute the OLS estimates utilizing the information within the coaching pattern and retailer the leads to reminiscence as ols.

. quietly regress rating word1-word50 wpair1-wpair30 phrase1-phrase20 
>         if pattern==1

. estimates retailer ols

Now, we use lassogof with possibility over(pattern) to compute the in-sample (Coaching) and out-of-sample (Validation) estimates of the MSE.

. lassogof ols, over(pattern)

Penalized coefficients
-------------------------------------------------------------
Title             pattern |         MSE    R-squared        Obs
------------------------+------------------------------------
ols                     |
               Coaching |    24.43515       0.5430        450
             Validation |    35.53149       0.2997        150
-------------------------------------------------------------

As anticipated, the estimated MSE is way smaller within the Coaching subsample than within the Validation pattern. The out-of-sample estimate of the MSE is the extra dependable estimator for the prediction error; see, for instance, chapters 1, 2, and three in Hastie, Tibshirani, and Friedman (2009).

On this part, we introduce the lasso and evaluate its estimated out-of-sample MSE to the one produced by OLS.

What’s a lasso?
The lasso is an estimator of the coefficients in a mannequin. What makes the lasso particular is that a few of the coefficient estimates are precisely zero, whereas others should not. The lasso selects covariates by excluding the covariates whose estimated coefficients are zero and by together with the covariates whose estimates should not zero. There are not any commonplace errors for the lasso estimates. The lasso’s skill to work as a covariate-selection technique makes it a nonstandard estimator and prevents the estimation of normal errrors. On this publish, we talk about how one can use the lasso for inferential questions.

Tibshirani (1996) derived the lasso, and Hastie, Tibshirani, and Wainwright (2015) present a textbook introduction.

The rest of this part supplies some particulars in regards to the mechanics of how the lasso produces its coefficient estimates. There are totally different variations of the lasso for linear and nonlinear fashions. Variations of the lasso for linear fashions, logistic fashions, and Poisson fashions can be found in Stata 16. We talk about solely the lasso for the linear mannequin, however the factors we make generalize to the lasso for nonlinear fashions.

Like many estimators, the lasso for linear fashions solves an optimization drawback. Particularly, the linear lasso level estimates (widehat{boldsymbol{beta}}) are given by

$$
widehat{boldsymbol{beta}} = argmin_{boldsymbol{beta}}
left{
frac{1}{2n} sum_{i=1}^nleft(y_i – {bf x}_iboldsymbol{beta}’proper)^2
+lambdasum_{j=1}^pomega_jvertbeta_jvert
proper}
$$

the place

(lambda>0) is the lasso penalty parameter,
(y) is the result variable,
({bf x}) incorporates the (p) potential covariates,
(boldsymbol{beta}) is the vector of coefficients on ({bf x}),
(beta_j) is the (j)th component of (boldsymbol{beta}),
the (omega_j) are parameter-level weights often called penalty loadings, and
(n) is the pattern dimension.

There are two phrases on this optimization drawback, the least-squares match measure
$$frac{1}{2n} sum_{i=1}^nleft(y_i – {bf x}_iboldsymbol{beta}’proper)^2$$
and the penalty time period
$$lambdasum_{j=1}^pomega_jvertboldsymbol{beta}_jvert$$
The parameters (lambda) and the (omega_j) are referred to as “tuning” parameters. They specify the burden utilized to the penalty time period. When (lambda=0), the linear lasso reduces to the OLS estimator. As (lambda) will increase, the magnitude of all of the estimated coefficients is “shrunk” towards zero. This skrinkage happens as a result of the price of every nonzero (widehat{beta}_j) will increase with the penalty time period that will increase as (lambda) will increase.

The penalty time period consists of absolutely the worth of every (beta_j). Absolutely the worth perform has a kink, generally referred to as a verify, at zero. The kink within the contribution of every coefficient to the penalty time period causes a few of the estimated coefficients to be precisely zero on the optimum resolution. See part 2.2 of Hastie, Tibshirani, and Wainwright (2015) for extra particulars.

There’s a worth (lambda_{rm max}) for which all of the estimated coefficients are precisely zero. As (lambda) decreases from (lambda_{rm max}), the variety of nonzero coefficient estimates will increase. For (lambdain(0,lambda_{rm max})), a few of the estimated coefficients are precisely zero and a few of them should not zero. If you use the lasso for covariate choice, covariates with estimated coefficients of zero are excluded, and covariates with estimated coefficients that aren’t zero are included.

That the variety of potential covariates (p) may be larger than the pattern dimension (n) is a a lot mentioned benefit of the lasso. It is very important keep in mind that the approximate sparsity assumption requires that the variety of covariates that belong within the mannequin ((s)) should be small relative to (n).

Choosing the lasso tuning parameters

The tuning parameters should be chosen earlier than utilizing the lasso for prediction or mannequin choice. Probably the most frequent strategies used to pick out the tuning parameters are cross-validation (CV), the adaptive lasso, and plug-in strategies. As well as, (lambda) is usually set by hand in a sensitivity evaluation.

CV finds the (lambda) that minimizes the out-of-sample MSE of the predictions. The mechanics of CV mimic the method utilizing break up samples to search out the perfect out-of-sample predictor. The small print are introduced in an appendix.

CV is the default technique of choosing the tuning parameters within the lasso command. Within the output under, we use lasso to estimate the coefficients within the mannequin for rating, utilizing the coaching pattern. We specified the choice rseed() to make our CV outcomes reproducible.

. lasso linear rating word1-word50 wpair1-wpair30 phrase1-phrase20
>         if pattern==1, nolog rseed(12345)

Lasso linear mannequin                          No. of obs        =        450
                                            No. of covariates =        100
Choice: Cross-validation                 No. of CV folds   =         10

--------------------------------------------------------------------------
         |                                No. of      Out-of-      CV imply
         |                               nonzero       pattern   prediction
      ID |     Description      lambda     coef.    R-squared        error
---------+----------------------------------------------------------------
       1 |    first lambda    3.271123         0       0.0022       53.589
      25 |   lambda earlier than    .3507518        22       0.3916     32.53111
    * 26 | chosen lambda     .319592        25       0.3917     32.52679
      27 |    lambda after    .2912003        26       0.3914     32.53946
      30 |     final lambda    .2202824        30       0.3794     33.18254
--------------------------------------------------------------------------
* lambda chosen by cross-validation.

. estimates retailer cv

We specified the choice nolog to supress the CV log over the candidate values of (lambda). The output reveals that CV chosen a (lambda) for which 25 of the 100 covariates have nonzero coefficients. We used estimates retailer to retailer these outcomes beneath the identify cv in reminiscence.

We use cvplot to plot the CV perform.

. cvplot, minmax

The CV perform seems considerably flat close to the optimum (lambda), which means that close by values of (lambda) would produce comparable out-of-sample MSEs.

The variety of included covariates can fluctuate considerably over the flat a part of the CV perform. We will examine the variation within the variety of chosen covariates utilizing a desk referred to as a lasso knot desk. Within the jargon of lasso, a knot is a price of (lambda) for which a covariate is added or subtracted to the set of covariates with nonzero values. We use lassoknots to show the desk of knots.

. lassoknots

-------------------------------------------------------------------------------------
       |              No. of   CV imply |
       |             nonzero     pred. |        Variables (A)dded, (R)emoved,    
    ID |   lambda      coef.     error |             or left (U)nchanged         
-------+-------------------------------+---------------------------------------------
     2 | 2.980526          2   52.2861 | A phrase3   phrase4                     
     3 | 2.715744          3  50.48463 | A phrase5                               
     4 | 2.474485          4  48.55981 | A word3                                 
     6 | 2.054361          5  44.51782 | A phrase6                               
     9 | 1.554049          6  40.23385 | A wpair3                                
    10 | 1.415991          8  39.04494 | A wpair2    phrase2                     
    12 | 1.175581          9    36.983 | A word2                                 
    14 | .9759878         10  35.42697 | A word31                                
    16 | .8102822         11   34.2115 | A word19                                
    17 |  .738299         12  33.75501 | A word4                                 
    21 | .5088809         14  32.74808 | A word14    phrase7                     
    22 | .4636733         17  32.64679 | A word32    wpair19   wpair26           
    23 | .4224818         19  32.56572 | A wpair15   wpair25                     
    24 | .3849497         22  32.53301 | A wpair24   phrase13  phrase14          
  * 26 |  .319592         25  32.52679 | A word25    word30    phrase8           
    27 | .2912003         26  32.53946 | A wpair11                               
    29 | .2417596         27  32.86193 | A wpair17                               
    30 | .2202824         30  33.18254 | A word23    word38    wpair4            
-------------------------------------------------------------------------------------
* lambda chosen by cross-validation.

The CV perform is minimized on the (lambda) with ID=26, and the lasso consists of 25 covariates at this (lambda) worth. The flat a part of the CV perform consists of the (lambda) values with ID (in{21,22,23,24,26,27}). Solely 14 covariates are included by the lasso utilizing the (lambda) at ID=21. We’ll discover this commentary utilizing sensitivity evaluation under.

CV tends to incorporate further covariates whose coefficients are zero within the mannequin that greatest approximates the method that generated the information. This could have an effect on the prediction efficiency of the CV-based lasso, and it may possibly have an effect on the efficiency of inferential strategies that use a CV-based lasso for mannequin choice. The adaptive lasso is a multistep model of CV. It was designed to exclude a few of these further covariates.

Step one of the adaptive lasso is CV. The second step does CV among the many covariates chosen in step one. On this second step, the penalty loadings are (omega_j=1/| widehat{boldsymbol{beta}}_j|), the place (widehat{boldsymbol{beta}}_j) are the penalized estimates from step one. Covariates with smaller-magnitude coefficients usually tend to be excluded within the second step. See Zou (2006) and Bühlmann and Van de Geer (2011) for extra in regards to the adaptive lasso and the tendency of the CV-based lasso to overselect. Additionally see Chetverikov, Liao, and Chernozhukov (2019) for formal outcomes for the CV lasso and outcomes that might clarify this overselection tendency.

We specify the choice choice(adaptive) under to trigger lasso to make use of the adaptive lasso as an alternative of CV to pick out the tuning parameters. We used estimates retailer to retailer the outcomes beneath the identify adaptive.

. lasso linear rating word1-word50 wpair1-wpair30 phrase1-phrase20
>         if pattern==1, nolog rseed(12345) choice(adaptive)

Lasso linear mannequin                         No. of obs         =        450
                                           No. of covariates  =        100
Choice: Adaptive                        No. of lasso steps =          2

Last adaptive step outcomes
--------------------------------------------------------------------------
         |                                No. of      Out-of-      CV imply
         |                               nonzero       pattern   prediction
      ID |     Description      lambda     coef.    R-squared        error
---------+----------------------------------------------------------------
      31 |    first lambda    124.1879         0       0.0037     53.66569
      77 |   lambda earlier than    1.719861        12       0.4238     30.81155
    * 78 | chosen lambda    1.567073        12       0.4239      30.8054
      79 |    lambda after    1.427859        14       0.4237     30.81533
     128 |     final lambda    .0149585        22       0.4102     31.53511
--------------------------------------------------------------------------
* lambda chosen by cross-validation in remaining adaptive step.

. estimates retailer adaptive

We see that the adaptive lasso included 12 as an alternative of 25 covariates.

Plug-in strategies are typically much more parsimonious than the adaptive lasso. Plug-in strategies discover the worth of the (lambda) that’s giant sufficient to dominate the estimation noise. The plug-in technique chooses (omega_j) to normalize the scores of the (unpenalized) match measure for every parameter. Given the normalized scores, it chooses a price for (lambda) that’s larger than the most important normalized rating with a likelihood that’s near 1.

The plug-in-based lasso is way sooner than the CV-based lasso and the adaptive lasso. In follow, the plug-in-based lasso tends to incorporate the vital covariates and it’s actually good at not together with covariates that don’t belong within the mannequin that greatest approximates the information. The plug-in-based lasso has a threat of lacking some covariates with giant coefficients and discovering just some covariates with small coefficients. See Belloni, Chernozhukov, and Wei (2016) and Belloni, et al. (2012) for particulars and formal outcomes.

We specify the choice choice(plugin) under to trigger lasso to make use of the plug-in technique to pick out the tuning parameters. We used estimates retailer to retailer the outcomes beneath the identify plugin.

. lasso linear rating word1-word50 wpair1-wpair30 phrase1-phrase20  
>         if pattern==1, choice(plugin)

Computing plugin lambda ...
Iteration 1:     lambda = .1954567   no. of nonzero coef. =       8
Iteration 2:     lambda = .1954567   no. of nonzero coef. =       9
Iteration 3:     lambda = .1954567   no. of nonzero coef. =       9

Lasso linear mannequin                          No. of obs        =        450
                                            No. of covariates =        100
Choice: Plugin heteroskedastic

--------------------------------------------------------------------------
         |                                No. of
         |                               nonzero    In-sample
      ID |     Description      lambda     coef.    R-squared          BIC
---------+----------------------------------------------------------------
     * 1 | chosen lambda    .1954567         9       0.3524     2933.203
--------------------------------------------------------------------------
* lambda chosen by plugin system assuming heteroskedastic.

. estimates retailer plugin

The plug-in-based lasso included 9 of the 100 covariates, which is much fewer than included by the CV-based lasso or the adaptive lasso.

Evaluating the predictors

We now have 4 totally different predictors for rating: OLS, CV-based lasso, adaptive lasso, and plug-in-based lasso. The three lasso strategies might predict rating utilizing the penalized coefficients estimated by lasso, or they may predict rating utilizing the unpenalized coefficients estimated by OLS, together with solely the covariates chosen by lasso. The predictions that use the penalized lasso estimates are often called the lasso predictions and the predictions that use the unpenalized coefficients are often called the postselection predictions, or the postlasso predictions.

For linear fashions, Belloni and Chernozhukov (2013) current circumstances through which the postselection predictions carry out no less than in addition to the lasso predictions. Heuristically, one expects the lasso predictions from a CV-based lasso to carry out higher than the postselection predictions as a result of CV chooses (lambda) to make the perfect lasso predictions. Analogously, one expects the postselection predictions for the plug-in-based lasso to carry out higher than the lasso predictions as a result of the plug-in tends to pick out a set of covariates shut to those who greatest approximate the method that generated the information.

In follow, we estimate the out-of-sample MSE of the predictions for all estimators utilizing each the lasso predictions and the postselection predictions. We choose the one which produces the bottom out-of-sample MSE of the predictions.

Within the output under, we use lassogof to check the out-of-sample prediction efficiency of OLS and the lasso predictions from the three lasso strategies.

. lassogof ols cv adaptive plugin if pattern==2

Penalized coefficients
-------------------------------------------------
       Title |         MSE    R-squared        Obs
------------+------------------------------------
        ols |    35.53149       0.2997        150
         cv |    27.83779       0.4513        150
   adaptive |    27.83465       0.4514        150
     plugin |    32.29911       0.3634        150
-------------------------------------------------

For these information, the lasso predictions utilizing the adaptive lasso carried out a bit of bit higher than the lasso predictions from the CV-based lasso.

Within the output under, we evaluate the out-of-sample prediction efficiency of OLS and the lasso predictions from the three lasso strategies utilizing the postselection coefficient estimates.

. lassogof ols cv adaptive plugin if pattern==2, postselection

Penalized coefficients
-------------------------------------------------
       Title |         MSE    R-squared        Obs
------------+------------------------------------
        ols |    35.53149       0.2997        150
         cv |    27.87639       0.4506        150
   adaptive |    27.79562       0.4522        150
     plugin |    26.50811       0.4775        150
-------------------------------------------------

It’s not shocking that the plug-in-based lasso produces the smallest out-of-sample MSE. The plug-in technique tends to pick out covariates whose postselection estimates do job of approximating the information.

The true competitors tends to be between the lasso estimates from the perfect of the penalized lasso predictions and the postselection estimates from the plug-in-based lasso. On this case, the postselection estimates from the plug-in-based lasso produced the higher out-of-sample predictions, and we might use these outcomes to foretell rating.

The elastic internet and ridge regression

The elastic internet extends the lasso through the use of a extra basic penalty time period. The elastic internet was initially motivated as a technique that might produce higher predictions and mannequin choice when the covariates had been extremely correlated. See Zou and Hastie (2005) for particulars.

The linear elastic internet solves

$$
widehat{boldsymbol{beta}} = argmin_{boldsymbol{beta}}
left{
frac{1}{2n} sum_{i=1}^nleft(y_i – {bf x}_iboldsymbol{beta}’proper)^2
+lambdaleft[
alphasum_{j=1}^pvertboldsymbol{beta}_jvert
+ frac{(1-alpha)}{2}
sum_{j=1}^pboldsymbol{beta}_j^2
right]
proper}
$$
the place (alpha) is the elastic-net penalty parameter. Setting (alpha=0) produces ridge regression. Setting (alpha=1) produces lasso.

The elasticnet command selects (alpha) and (lambda) by CV. The choice alpha() specifies the candidate values for (alpha).

. elasticnet linear rating word1-word50 wpair1-wpair30 phrase1-phrase20
>         if pattern==1, alpha(.25 .5 .75) nolog rseed(12345)

Elastic internet linear mannequin                         No. of obs        =        450
                                                 No. of covariates =        100
Choice: Cross-validation                      No. of CV folds   =         10

-------------------------------------------------------------------------------
               |                               No. of      Out-of-      CV imply
               |                              nonzero       pattern   prediction
alpha       ID |     Description      lambda    coef.    R-squared        error
---------------+---------------------------------------------------------------
0.750          |
             1 |    first lambda    13.08449        0       0.0062     53.79915
            39 |   lambda earlier than    .4261227       24       0.3918     32.52101
          * 40 | chosen lambda    .3882671       25       0.3922     32.49847
            41 |    lambda after    .3537745       27       0.3917     32.52821
            44 |     final lambda    .2676175       34       0.3788     33.21631
---------------+---------------------------------------------------------------
0.500          |
            45 |    first lambda    13.08449        0       0.0062     53.79915
            84 |     final lambda    .3882671       34       0.3823     33.02645
---------------+---------------------------------------------------------------
0.250          |
            85 |    first lambda    13.08449        0       0.0058     53.77755
           120 |     final lambda    .5633091       54       0.3759       33.373
-------------------------------------------------------------------------------
* alpha and lambda chosen by cross-validation.

. estimates retailer enet

We see that the elastic internet chosen 25 of the 100 covariates.

For comparability, we additionally use elasticnet to carry out ridge regression, with the penalty parameter chosen by CV.

. elasticnet linear rating word1-word50 wpair1-wpair30 phrase1-phrase20
>         if pattern==1, alpha(0) nolog rseed(12345)

Elastic internet linear mannequin                         No. of obs        =        450
                                                 No. of covariates =        100
Choice: Cross-validation                      No. of CV folds   =         10

-------------------------------------------------------------------------------
               |                               No. of      Out-of-      CV imply
               |                              nonzero       pattern   prediction
alpha       ID |     Description      lambda    coef.    R-squared        error
---------------+---------------------------------------------------------------
0.000          |
             1 |    first lambda    3271.123      100       0.0062     53.79914
            90 |   lambda earlier than     .829349      100       0.3617     34.12734
          * 91 | chosen lambda    .7556719      100       0.3621      34.1095
            92 |    lambda after    .6885401      100       0.3620     34.11367
           100 |     final lambda    .3271123      100       0.3480     34.86129
-------------------------------------------------------------------------------
* alpha and lambda chosen by cross-validation.

. estimates retailer ridge

Ridge regression doesn’t carry out mannequin choice and thus consists of all of the covariates.

We now evaluate the out-of-sample predictive skill of the CV-based lasso, the elastic internet, ridge regression, and the plug-in-based lasso utilizing the lasso predictions. (For elastic internet and ridge regression, the “lasso predictions” are made utilizing the coefficient estimates produced by the penalized estimator.)

. lassogof cv adaptive enet ridge plugin if pattern==2

Penalized coefficients
-------------------------------------------------
       Title |         MSE    R-squared        Obs
------------+------------------------------------
         cv |    27.83779       0.4513        150
   adaptive |    27.83465       0.4514        150
       enet |    27.77314       0.4526        150
      ridge |    29.47745       0.4190        150
     plugin |    32.29911       0.3634        150
-------------------------------------------------

On this case, the penalized elastic-net coefficient estimates predict greatest out of pattern among the many lasso estimates. The postselection predictions produced by the plug-in-based lasso carry out greatest general. This may be seen by evaluating the above output with the output under.

. lassogof cv adaptive enet plugin if pattern==2, postselection

Penalized coefficients
-------------------------------------------------
       Title |         MSE    R-squared        Obs
------------+------------------------------------
         cv |    27.87639       0.4506        150
   adaptive |    27.79562       0.4522        150
       enet |    27.87639       0.4506        150
     plugin |    26.50811       0.4775        150
-------------------------------------------------

So we might use these postselection coefficient estimates from the plug-in-based lasso to foretell rating.

Sensitivity evaluation

Sensitivity evaluation is usually carried out to see if a small change within the tuning parameters results in a big change within the prediction efficiency. When wanting on the output of lassoknots produced by the CV-based lasso, we famous that for a small improve within the CV perform produced by the penalized estimates, there may very well be a major discount within the variety of chosen covariates. Restoring the cv estimates and repeating the lassoknots output, we see that

. estimates restore cv
(outcomes cv are energetic now)

. lassoknots

-------------------------------------------------------------------------------------
       |              No. of   CV imply |
       |             nonzero     pred. |        Variables (A)dded, (R)emoved,    
    ID |   lambda      coef.     error |             or left (U)nchanged         
-------+-------------------------------+---------------------------------------------
     2 | 2.980526          2   52.2861 | A phrase3   phrase4                     
     3 | 2.715744          3  50.48463 | A phrase5                               
     4 | 2.474485          4  48.55981 | A word3                                 
     6 | 2.054361          5  44.51782 | A phrase6                               
     9 | 1.554049          6  40.23385 | A wpair3                                
    10 | 1.415991          8  39.04494 | A wpair2    phrase2                     
    12 | 1.175581          9    36.983 | A word2                                 
    14 | .9759878         10  35.42697 | A word31                                
    16 | .8102822         11   34.2115 | A word19                                
    17 |  .738299         12  33.75501 | A word4                                 
    21 | .5088809         14  32.74808 | A word14    phrase7                     
    22 | .4636733         17  32.64679 | A word32    wpair19   wpair26           
    23 | .4224818         19  32.56572 | A wpair15   wpair25                     
    24 | .3849497         22  32.53301 | A wpair24   phrase13  phrase14          
  * 26 |  .319592         25  32.52679 | A word25    word30    phrase8           
    27 | .2912003         26  32.53946 | A wpair11                               
    29 | .2417596         27  32.86193 | A wpair17                               
    30 | .2202824         30  33.18254 | A word23    word38    wpair4            
-------------------------------------------------------------------------------------
* lambda chosen by cross-validation.

lasso chosen the (lambda) with ID=26 and 25 covariates. We now use lassoselect to specify that the (lambda) with ID=21 be the chosen (lambda) and retailer the outcomes beneath the identify hand.

. lassoselect id = 21
ID = 21  lambda = .5088809 chosen

. estimates retailer hand

We now compute the out-of-sample MSE produced by the postselection estimates of the lasso whose (lambda) has ID=21. The outcomes should not wildly totally different and we might stick to these produced by the post-selection plug-in-based lasso.

. lassogof hand plugin if pattern==2, postselection

Penalized coefficients
-------------------------------------------------
       Title |         MSE    R-squared        Obs
------------+------------------------------------
       hand |    27.71925       0.4537        150
     plugin |    26.50811       0.4775        150
-------------------------------------------------

Conclusion

This publish has introduced an introduction to the lasso and to the elastic internet, and it has illustrated how one can use them for prediction. There may be far more info obtainable within the Stata 16 LASSO handbook. The subsequent publish will talk about utilizing the lasso for inference about causal parameters.

References

Belloni, A., D. Chen, V. Chernozhukov, and C. Hansen. 2012. Sparse fashions and strategies for optimum devices with an software to eminent area. Econometrica 80: 2369–2429.

Belloni, A., and V. Chernozhukov. 2013. Least squares after mannequin choice in high-dimensional sparse fashions. Bernoulli 19: 521–547.

Belloni, A., V. Chernozhukov, and Y. Wei. 2016. Publish-selection inference for generalized linear fashions with many controls. Journal of Enterprise & Financial Statistics 34: 606–619.

Bühlmann, P., and S. Van de Geer. 2011. Statistics for Excessive-Dimensional Information: Strategies, Principle and Purposes. Berlin: Springer.

Chetverikov, D., Z. Liao, and V. Chernozhukov. 2019. On cross-validated Lasso. arXiv Working Paper No. arXiv:1605.02214. http://arxiv.org/abs/1605.02214.

Hastie, T., R. Tibshirani, and J. Friedman. 2009. The Parts of Statistical Studying: Information Mining, Inference, and Prediction. 2nd ed. New York: Springer.

Hastie, T., R. Tibshirani, and M. Wainwright. 2015. Statistical Studying with Sparsity: The Lasso and Generalizations. Boca Rotaon, FL: CRC Press.

Tibshirani, R. 1996. Regression shrinkage and choice through the lasso. Journal of the Royal Statistical Society, Collection B 58: 267–288.

Zou, H. 2006. The adaptive Lasso and its oracle properties. Journal of the American Statistical Affiliation 101: 1418–1429.

Zou, H., and T. Hastie. 2005. Regularization and variable choice through the elastic internet. Journal of the Royal Statistical Society, Collection B 67: 301–320.

Appendix: Okay-fold cross-validation

Cross-validation finds the worth for (lambda) in a grid of candidate values ({lambda_1, lambda_2, ldots, lambda_Q}) that minimizes the MSE of the out-of-sample predictions. Cross-validation units (omega_j=1) or to user-specified values.

After you specify the grid, the pattern is partitioned into (Okay) nonoverlapping subsets. For every grid worth (lambda_q), predict the out-of-sample squared errors utilizing the next steps.

For every (kin{1,2,ldots, Okay}),
- utilizing the information not in partition (ok), estimate the penalized coefficients (widehat{boldsymbol{beta}}) with (lambda=lambda_q).
- utilizing the information in partition (ok), predict the out-of-sample squared errors.

The imply of those out-of-sample squared errors estimates the out-of-sample MSE of the predictions. The cross-validation perform traces the values of those out-of-sample MSEs over the grid of candidate values for (lambda). The (lambda_j) that produces the smallest estimated out-of-sample MSE minimizes the cross-validation perform, and it’s chosen.

An introduction to the lasso in Stata

Related Articles

The Worth of z-index | CSS-Methods

Prime 7 Free Anthropic AI Programs with Certificates

Ought to the CIO, CFO, or CEO maintain the kill change on AI?

Latest Articles

The Worth of z-index | CSS-Methods

Prime 7 Free Anthropic AI Programs with Certificates

Ought to the CIO, CFO, or CEO maintain the kill change on AI?

Android 17 Beta 2 is selecting up a safety trick from Chrome OS

Man builds purposeful typewriter out of Legos