Programming an estimation command in Stata: Nonlinear least-squares estimators

February 1, 2026

55

(newcommand{xb}{{bf x}}
newcommand{gb}{{bf g}}
newcommand{Hb}{{bf H}}
newcommand{Gb}{{bf G}}
newcommand{Eb}{{bf E}}
newcommand{betab}{boldsymbol{beta}})I wish to write ado-commands to estimate the parameters of an exponential conditional imply (ECM) mannequin and probit conditional imply (PCM) mannequin by nonlinear least squares (NLS). Earlier than I can write these instructions, I want to point out the right way to trick optimize() into performing the Gauss–Newton algorithm and apply this trick to those two issues.

That is the twenty sixth publish within the collection Programming an estimation command in Stata. I like to recommend that you just begin at first. See Programming an estimation command in Stata: A map to posted entries for a map to all of the posts on this collection.

Gauss–Newton algorithm

Gauss–Newton algorithms continuously carry out higher than different Newton-type algorithms for fixing NLS minimization issues, as a result of they use an anticipated Hessian as a substitute of a full Hessian.

Recall that Newton-type algorithms get the subsequent guess of parameter estimates by an replace rule of the shape
$$
betab_{s+1} = betab_s – lambdaHb_s^{-1}gb_s
$$
as I mentioned in Programming an estimation command in Stata: A evaluation of nonlinear optimization utilizing Mata. The target operate in NLS issues is
$$
min_{betab} frac{1}{2} sum_{i=1}^n left[y_i-f(xb_i,betab)right]^2
$$

The Gauss–Newton algorithm makes use of
$$
betab_{s+1} = betab_s – lambdaGb_s^{-1}gb_s
$$
the place
$$Gb_s =-
sum_{i=1}^N frac{partial f(xb_i,betab)}{partial betab,’}
frac{partial f(xb_i,betab)}{partial betab}
hspace{1cm}mbox{with } betab mbox{ evaluated at } betab_s
$$
Utilizing (Gb_s) as a substitute of (Hb_s) causes Gauss–Newton to carry out higher than different Newton-type algorithms as a result of the primary time period in
start{align*}
Hb_s &=
sum_{i=1}^N left[y_i-f(xb_i,betab)right]
frac{partial^2 f(xb_i,betab)}{partialbetab,’partial betab}
–
sum_{i=1}^N frac{partial f(xb_i,betab)}{partial boldsymbol{beta},’}
frac{partial f(xb_i,betab)}{partial betab}
finish{align*}
with (betab) evaluated at (betab_s) has imply ({bf 0}). A lot of the literature on this algorithm exploits the truth that (Gb_s^{-1}gb_s) could be obtained by an OLS regression of the residuals ([y_i-f(xb_i,betab)]) on the columns of (frac{partial f(xb_i,betab)}{partial betab,’}) (with (betab) evaluated at (betab_s)) as a result of
$$
gb_s= sum_{i=1}^N left[y_i-f(xb_i,betab)right]
frac{partial f(xb_i,betab)}{partial betab,’}
hspace{1cm}mbox{ with } betab mbox{ evaluated at } betab_s
$$
See Cameron and Trivedi (2005, part 10.3.6) and Wooldridge (2010, part 12.7.3) for examples of this method. Whereas this method is utilized in nl, I’ll trick optimize() into doing Gauss–Newton by specifying (Gb_s) as a substitute of (Hb_s) as my Hessian.

ECM by NLS

In code block 1, I take advantage of optimize() to suit the accident knowledge to an ECM mannequin,
$$
Eb[{tt accidents}|{tt cvalue},{tt tickets}] =
exp(beta_1 {tt cvalue} + beta_2 {tt tickets} + beta_0)
$$

Code block 1: nls1.do


mata:
void MYNLExp(actual scalar todo, actual vector b,   ///
        actual vector y, actual matrix X,           ///
        val, grad, hess)
{
        actual vector  r, f, xb
        actual matrix  df

        xb  = X*b'
        f   = exp(X*b')
        r   = y - f
        val = -(r:^2)
        df  = f:*X

        if (todo>=1) {
                grad = r:*df
        }
        if (todo==2) {
                hess = -1*quadcross(df, df)
        }
}
y = st_data(., "accidents")
X = st_data(., "cvalue tickets")
n = rows(y)
X = X, J(n, 1, 1)
p = cols(X)
S = optimize_init()
optimize_init_argument(S, 1, y)
optimize_init_argument(S, 2, X)
optimize_init_evaluator(S, &MYNLExp())
optimize_init_conv_nrtol(S, 1e-11)
optimize_init_params(S, J(1, 3, .01))
optimize_init_evaluatortype(S, "gf2")
bh  = optimize(S)
M   = invsym(-1*optimize_result_Hessian(S))
sb  = (-1/(n-p))*optimize_result_value(S)
V  = sb*M
"Level estimates"
bh
"VCE"
V
finish

Traces 2–21 are the evaluator operate for the NLS downside. This code ought to be acquainted from the Poisson regression command that I beforehand mentioned. Be aware that line 19 defines (Gb_s) to the Hessian.

Traces 22–26 copy the info from Stata into Mata. Traces 28–34 use optimize() to resolve the NLS downside. Traces 35–37 compute the estimator for the VCE primarily based on appropriate specification and errors which are independently and identically distributed; see Wooldridge (2010, 417). Traces 38–41 show the outcomes.

Instance 1 runs this code

Instance 1: Gauss–Newton in optimize() for ECM mannequin


. use accident3

. do nls1

. mata:
------------------------------------------------- mata (kind finish to exit) ------
: void MYNLExp(actual scalar todo, actual vector b,   ///
>         actual vector y, actual matrix X,           ///
>         val, grad, hess)
> {
>         actual vector  r, f, xb
>         actual matrix  df
>
>         xb  = X*b'
>         f   = exp(X*b')
>         r   = y - f
>         val = -(r:^2)
>         df  = f:*X
>
>         if (todo>=1) {
>                 grad = r:*df
>         }
>         if (todo==2) {
>                 hess = -1*quadcross(df, df)
>         }
> }
notice: variable xb set however not used

: y = st_data(., "accidents")

: X = st_data(., "cvalue tickets")

: n = rows(y)

: X = X, J(n, 1, 1)

: p = cols(X)

: S = optimize_init()

: optimize_init_argument(S, 1, y)

: optimize_init_argument(S, 2, X)

: optimize_init_evaluator(S, &MYNLExp())

: optimize_init_conv_nrtol(S, 1e-11)

: optimize_init_params(S, J(1, 3, .01))

: optimize_init_evaluatortype(S, "gf2")

: bh  = optimize(S)
Iteration 0:   f(p) =  -2530.846
Iteration 1:   f(p) = -1116.4901
Iteration 2:   f(p) = -248.56923
Iteration 3:   f(p) = -225.91644
Iteration 4:   f(p) = -225.89573
Iteration 5:   f(p) = -225.89573
Iteration 6:   f(p) = -225.89573

: M   = invsym(-1*optimize_result_Hessian(S))

: sb  = (-1/(n-p))*optimize_result_value(S)

: V  = sb*M

: "Level estimates"
  Level estimates

: bh
                 1             2             3
    +-------------------------------------------+
  1 |  .1759434081   1.447671532   -7.66060808  |
    +-------------------------------------------+

: "VCE"
  VCE

: V
[symmetric]
                  1              2              3
    +----------------------------------------------+
  1 |   .0010491815                                |
  2 |  -.0000111792     .001112881                 |
  3 |  -.0019055019   -.0075744389    .0554943947  |
    +----------------------------------------------+

: finish
--------------------------------------------------------------------------------

.
finish of do-file

For comparability, I take advantage of nl to copy these outcomes.

Instance 2: ECM mannequin by nl


. nl (accidents = exp({b1}*cvalue + {b2}*tickets + {b0})) , eps(1e-11)
(obs = 505)

Iteration 0:  residual SS =  1123.962
Iteration 1:  residual SS =  380.2557
Iteration 2:  residual SS =  232.1493
Iteration 3:  residual SS =  225.9077
Iteration 4:  residual SS =  225.8957
Iteration 5:  residual SS =  225.8957
Iteration 6:  residual SS =  225.8957
Iteration 7:  residual SS =  225.8957
Iteration 8:  residual SS =  225.8957
Iteration 9:  residual SS =  225.8957
Iteration 10:  residual SS =  225.8957
Iteration 11:  residual SS =  225.8957


    Supply |      SS            df       MS
-----------+----------------------------------    Variety of obs =        505
     Mannequin |  2266.1043          3  755.368089    R-squared     =     0.9094
  Residual |  225.89573        502  .449991501    Adj R-squared =     0.9088
-----------+----------------------------------    Root MSE      =   .6708141
     Complete |       2492        505  4.93465347    Res. dev.     =   1026.863

----------------------------------------------------------------------------
 accidents |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-----------+----------------------------------------------------------------
       /b1 |   .1759434   .0323911     5.43   0.000     .1123047    .2395822
       /b2 |   1.447671   .0333597    43.40   0.000      1.38213    1.513213
       /b0 |  -7.660608   .2355717   -32.52   0.000    -8.123436    -7.19778
----------------------------------------------------------------------------

. matrix listing e(V)

symmetric e(V)[3,3]
                  b1:         b2:         b0:
               _cons       _cons       _cons
b1:_cons   .00104918
b2:_cons  -.00001118   .00111287
b0:_cons   -.0019055  -.00757439   .05549405

As anticipated, the purpose estimates and the estimates of the VCE are primarily the identical.

I now implement the PCM mannequin
$$
Eb[{tt hadaccident}|{tt cvalue},{tt tickets}] =
Phi(beta_1 {tt cvalue} + beta_2 {tt tickets} + beta_0)
$$
in code block 2, the place (Phi()) is the standard-normal cumulative distribution operate.

Code block 2: nls2.do


mata:
void MYNLProbit(actual scalar todo, actual vector b,        ///
        actual vector y, actual matrix X,           ///
        val, grad, hess)
{
        actual vector  r, f, xb
        actual matrix  df

        xb  = X*b'
        f   = regular(xb)
        r   = y - f
        val = -(r:^2)
        df  = normalden(xb):*X

        if (todo>=1) {
                grad = r:*df
        }
        if (todo==2) {
                hess = -1*quadcross(df, df)
        }
}
y = st_data(., "hadaccident")
X = st_data(., "cvalue tickets")
n = rows(y)
X = X, J(n, 1, 1)
p = cols(X)
S = optimize_init()
optimize_init_argument(S, 1, y)
optimize_init_argument(S, 2, X)
optimize_init_evaluator(S, &MYNLProbit())
optimize_init_conv_nrtol(S, 1e-11)
optimize_init_params(S, J(1, 3, .01))
optimize_init_evaluatortype(S, "gf2")
bh  = optimize(S)
M   = invsym(-1*optimize_result_Hessian(S))
sb  = (-1/(n-p))*optimize_result_value(S)
V   = sb*M
"Level estimates"
bh
"VCE"
V
finish

nls2.do is nearly equivalent to nls1.do. The variations are that strains 2–21 outline MYNLProbit() as a substitute of MYNLExp(), that strains 10 and 13 outline the operate and the primary spinoff for the PCM mannequin as a substitute of for the ECM mannequin, that line 22 specifies the binary dependent variable hadaccident as a substitute of the accident depend accidents, and that line 31 specifies that optimize() use MYNLProbit() as a substitute of MYNLExp() because the evaluator operate.

Instance 3 runs this code

Instance 3: Gauss–Newton in optimize() for PCM mannequin


. generate hadaccident = accidents>0

. do nls2

. mata:
------------------------------------------------- mata (kind finish to exit) ------
: void MYNLProbit(actual scalar todo, actual vector b,        ///
>         actual vector y, actual matrix X,           ///
>         val, grad, hess)
> {
>         actual vector  r, f, xb
>         actual matrix  df
>
>         xb  = X*b'
>         f   = regular(xb)
>         r   = y - f
>         val = -(r:^2)
>         df  = normalden(xb):*X
>
>         if (todo>=1) {
>                 grad = r:*df
>         }
>         if (todo==2) {
>                 hess = -1*quadcross(df, df)
>         }
> }

: y = st_data(., "hadaccident")

: X = st_data(., "cvalue tickets")

: n = rows(y)

: X = X, J(n, 1, 1)

: p = cols(X)

: S = optimize_init()

: optimize_init_argument(S, 1, y)

: optimize_init_argument(S, 2, X)

: optimize_init_evaluator(S, &MYNLProbit())

: optimize_init_conv_nrtol(S, 1e-11)

: optimize_init_params(S, J(1, 3, .01))

: optimize_init_evaluatortype(S, "gf2")

: bh  = optimize(S)
Iteration 0:   f(p) = -132.90997
Iteration 1:   f(p) = -16.917203
Iteration 2:   f(p) = -10.995001
Iteration 3:   f(p) = -10.437501
Iteration 4:   f(p) = -10.427738
Iteration 5:   f(p) = -10.427156
Iteration 6:   f(p) = -10.427123
Iteration 7:   f(p) = -10.427121
Iteration 8:   f(p) =  -10.42712
Iteration 9:   f(p) =  -10.42712
Iteration 10:  f(p) =  -10.42712
Iteration 11:  f(p) =  -10.42712

: M   = invsym(-1*optimize_result_Hessian(S))

: sb  = (-1/(n-p))*optimize_result_value(S)

: V   = sb*M

: "Level estimates"
  Level estimates

: bh
                  1              2              3
    +----------------------------------------------+
  1 |   .3616312823    2.177513743   -10.95168163  |
    +----------------------------------------------+

: "VCE"
  VCE

: V
[symmetric]
                  1              2              3
    +----------------------------------------------+
  1 |   .0084311958                                |
  2 |   .0102702556    .0389739985                 |
  3 |  -.0648856339   -.2061800064    1.114408707  |
    +----------------------------------------------+

: finish
--------------------------------------------------------------------------------

.
finish of do-file

For comparability, I take advantage of nl to copy the outcomes.

Instance 4: PCM mannequin by nl


. nl (hadaccident = regular({b1}*cvalue + {b2}*tickets + {b0})) , eps(1e-11)
(obs = 505)

Iteration 0:  residual SS =  29.15659
Iteration 1:  residual SS =  18.64138
Iteration 2:  residual SS =  12.69845
Iteration 3:  residual SS =  10.74838
Iteration 4:  residual SS =  10.44729
Iteration 5:  residual SS =  10.42855
Iteration 6:  residual SS =  10.42723
Iteration 7:  residual SS =  10.42713
Iteration 8:  residual SS =  10.42712
Iteration 9:  residual SS =  10.42712
Iteration 10:  residual SS =  10.42712
Iteration 11:  residual SS =  10.42712
Iteration 12:  residual SS =  10.42712
Iteration 13:  residual SS =  10.42712
Iteration 14:  residual SS =  10.42712
Iteration 15:  residual SS =  10.42712
Iteration 16:  residual SS =  10.42712
Iteration 17:  residual SS =  10.42712
Iteration 18:  residual SS =  10.42712
Iteration 19:  residual SS =  10.42712
Iteration 20:  residual SS =  10.42712
Iteration 21:  residual SS =  10.42712
Iteration 22:  residual SS =  10.42712


    Supply |      SS            df       MS
-----------+----------------------------------    Variety of obs =        505
     Mannequin |   49.57288          3  16.5242932    R-squared     =     0.8262
  Residual |   10.42712        502  .020771156    Adj R-squared =     0.8252
-----------+----------------------------------    Root MSE      =    .144122
     Complete |         60        505  .118811881    Res. dev.     =   -526.347

----------------------------------------------------------------------------
hadaccident|      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-----------+----------------------------------------------------------------
       /b1 |    .361632   .0918214     3.94   0.000     .1812304    .5420337
       /b2 |   2.177515   .1974173    11.03   0.000     1.789649    2.565381
       /b0 |  -10.95169    1.05565   -10.37   0.000    -13.02572   -8.877652
----------------------------------------------------------------------------

. matrix listing e(V)

symmetric e(V)[3,3]
                  b1:         b2:         b0:
               _cons       _cons       _cons
b1:_cons   .00843118
b2:_cons   .01027015   .03897358
b0:_cons  -.06488509  -.20617778   1.1143968

As anticipated, the purpose estimates and the estimates of the VCE are primarily the identical.

Carried out and undone

I confirmed the right way to trick optimize() into performing the Gauss–Newton algorithm and the right way to compute an estimator of the VCE primarily based on appropriate specification and independently and identically distributed errors. Within the subsequent publish, I focus on ado-commands that implement these estimators.

References

Cameron, A. C., and P. Ok. Trivedi. 2005. Microeconometrics: Strategies and Functions. Cambridge: Cambridge College Press.

Wooldridge, J. M. 2010. Econometric Evaluation of Cross Part and Panel Knowledge. 2nd ed. Cambridge, Massachusetts: MIT Press.

Programming an estimation command in Stata: Nonlinear least-squares estimators

Related Articles

Cockapoos, doodles, and different crossbreeds have behavioral issues, too

Two-Stage Hurdle Fashions: Predicting Zero-Inflated Outcomes

Techno-nationalism complicates IT resilience, provide chain

Latest Articles

Cockapoos, doodles, and different crossbreeds have behavioral issues, too

Two-Stage Hurdle Fashions: Predicting Zero-Inflated Outcomes

Techno-nationalism complicates IT resilience, provide chain

The Obtain: The Pentagon’s new AI plans, and next-gen nuclear reactors

Finest MacBook 2026: Which Apple laptop computer must you purchase?