Programming an estimation command in Stata: A evaluate of nonlinear optimization utilizing Mata

February 15, 2026

94

(newcommand{betab}{boldsymbol{beta}}
newcommand{xb}{{bf x}}
newcommand{yb}{{bf y}}
newcommand{gb}{{bf g}}
newcommand{Hb}{{bf H}}
newcommand{thetab}{boldsymbol{theta}}
newcommand{Xb}{{bf X}}
)I evaluate the idea behind nonlinear optimization and get extra apply in Mata programming by implementing an optimizer in Mata. In actual issues, I like to recommend utilizing the optimize() operate or moptimize() operate as a substitute of the one I describe right here. In subsequent posts, I’ll focus on optimize() and moptimize(). This put up will show you how to develop your Mata programming abilities and can enhance your understanding of how optimize() and moptimize() work.

That is the seventeenth put up within the sequence Programming an estimation command in Stata. I like to recommend that you just begin at first. See Programming an estimation command in Stata: A map to posted entries for a map to all of the posts on this sequence.

A fast evaluate of nonlinear optimization

We wish to maximize a real-valued operate (Q(thetab)), the place (thetab) is a (ptimes 1) vector of parameters. Minimization is completed by maximizing (-Q(thetab)). We require that (Q(thetab)) is twice, constantly differentiable, in order that we are able to use a second-order Taylor sequence to approximate (Q(thetab)) in a neighborhood of the purpose (thetab_s),

[
Q(thetab) approx Q(thetab_s) + gb_s'(thetab -thetab_s)
+ frac{1}{2} (thetab -thetab_s)’Hb_s (thetab -thetab_s)
tag{1}
]

the place (gb_s) is the (ptimes 1) vector of first derivatives of (Q(thetab)) evaluated at (thetab_s) and (Hb_s) is the (ptimes p) matrix of second derivatives of (Q(thetab)) evaluated at (thetab_s), often called the Hessian matrix.

Nonlinear maximization algorithms begin with a vector of preliminary values and produce a sequence of up to date values that converge to the parameter vector that maximizes the target operate. The algorithms I focus on right here can solely discover native maxima. The operate in determine 1 has a neighborhood most at .2 and one other at 1.5. The worldwide most is at .2.

Determine 1: Native maxima

Every replace is produced by discovering the (thetab) that maximizes the approximation on the right-hand facet of equation (1) and letting or not it’s (thetab_{s+1}). To search out the (thetab) that maximizes the approximation, we set to ({bf 0}) the spinoff of the right-hand facet of equation (1) with respect to (thetab),

[
gb_s + Hb_s (thetab -thetab_s) = {bf 0}
tag{2}
]

Changing (thetab) with (thetab_{s+1}) and fixing yields the replace rule for (thetab_{s+1}).

[
thetab_{s+1} = thetab_s – Hb_s^{-1} gb_s
tag{3}
]

Be aware that the replace is uniquely outlined provided that the Hessian matrix (Hb_s) is full rank. To make sure that now we have a neighborhood most, we would require that the Hessian be damaging particular on the optimum, which additionally implies that the symmetric Hessian is full rank.

The replace rule in equation (3) doesn’t assure that (Q(thetab_{s+1})>Q(thetab_s)). We wish to settle for solely these (thetab_{s+1}) that do produce such a rise, so in apply, we use

[
thetab_{s+1} = thetab_s – lambda Hb_s^{-1} gb_s
tag{4}
]

the place (lambda) is the step measurement. Within the algorithm offered right here, we begin with (lambda) equal to (1) and, if needed, lower (lambda) till we discover a worth that yields a rise.

The earlier sentence is obscure. I make clear it by writing an algorithm in Mata. Suppose that actual scalar Q( actual vector theta ) is a Mata operate that returns the worth of the target operate at a worth of the parameter vector theta. For the second, suppose that g_s is the vector of derivatives on the present theta, denoted by theta_s, and that Hi_s is the inverse of the Hessian matrix at theta_s. These definitions permit us to outline the replace operate

Code block 1: Candidate rule for parameter vector


actual vector tupdate(                 ///
	actual scalar lambda,          ///
	actual vector theta_s,         ///
	actual vector g_s,             ///
	actual matrix Hi_s)
{
	return (theta_s - lambda*Hi_s*g_s)
}

For specified values of lambda, theta_s, g_s, and Hi_s, tupdate() returns a candidate worth for theta_s1. However we solely settle for candidate values of theta_s1 that yield a rise, so as a substitute of utilizing tupdate() to get an replace, we might use GetUpdate().

Code block 2: Replace operate for parameter vector


actual vector GetUpdate(            ///
    actual vector theta_s,          ///
    actual vector g_s,              ///
    actual matrix Hi_s)
{
    lambda = 1
    theta_s1 = tupdate(lambda, thetas, g_s, Hi_s)
    whereas ( Q(theta_s1) < Q(theta_s) ) {
        lambda   = lambda/2
        theta_s1 = tupdate(lambda, thetas, g_s, Hi_s)
    }
    return(theta_s1)
}

GetUpdate() begins by getting a candidate worth for theta_s1 when lambda = 1. GetUpdate() returns this candidate theta_s1 if it produces a rise in Q(). In any other case, GetUpdate() divides lambda by 2 and will get one other candidate theta_s1 till it finds a candidate that produces a rise in Q(). GetUpdate() returns the primary candidate that produces a rise in Q().

Whereas these features make clear the ambiguities within the authentic obscure assertion, GetUpdate() makes the unwise assumption that there’s all the time a lambda for which the candidate theta_s1 produces a rise in Q(). The model of GetUpdate() in code block 3 doesn’t make this assumption, it exits with an error if lambda is just too small; lower than (10^{-11}).

Code block 3: A greater replace operate for parameter vector


actual vector GetUpdate(            ///
    actual vector theta_s,          ///
    actual vector g_s,              ///
    actual matrix Hi_s)
{
    lambda = 1
    theta_s1 = tupdate(lambda, thetas, g_s, Hi_s)
    whereas ( Q(theta_s1) < Q(theta_s) ) {
        lambda   = lambda/2
        if (lambda < 1e-11) {
            printf("{crimson}Can not discover parameters that produce a rise.n")
            exit(error(3360))
        }
        theta_s1 = tupdate(lambda, thetas, g_s, Hi_s)
    }
    return(theta_s1)
}

A top level view of our algorithm for nonlinear optimization is the next:

Choose preliminary values for parameter vector.
If present parameters set vector of derivatives of Q() to zero, go to (3); in any other case go to (A).
- Use GetUpdate() to get new parameter values.
- Calculate g_s and Hi_s at parameter values from (A).
- Go to (2).
Show outcomes.

Code block 4 comprises a Mata model of this algorithm.

Code block 4: Pseudocode for Newton–Raphson algorithm


theta_s  =  J(p, 1, .01)
GetDerives(theta_s, g_s, Hi_s.)
gz = g_s'*Hello*g
whereas (abs(gz) > 1e-13) {
	theta_s = GetUpdate(theta_s, g_s, Hi_s)
	GetDerives(theta_s, g_s, Hi_s)
	gz      = g_s'*Hi_s*g_s
	printf("gz is now %8.7gn", gz)
}
printf("Converged worth of theta isn")
theta_s

Line 2 places the vector of beginning values, a (ptimes 1) vector with every ingredient equal to .01, in theta_s. Line 3 makes use of GetDerives() to place the vector of first derivatives into g_s and the inverse of the Hessian matrix into Hi_s. In GetDerives(), I take advantage of cholinv() to calculate Hi_s. cholinv() returns lacking values if the matrix is just not optimistic particular. By calculating Hi_s = -1*cholinv(-H_s), I make sure that Hi_s comprises lacking values when the Hessian is just not damaging particular and full rank.

Line 3 calculates how completely different the vector of first derivatives is from 0. As a substitute of utilizing a sum of squares, obtainable by g_s’g_s, I weight the primary derivatives by the inverse of the Hessian matrix, which places the (p) first derivatives on the same scale and ensures that the Hessian matrix is damaging particular at convergence. (If the Hessian matrix is just not damaging particular, GetDerives() will put a matrix of lacking values into Hi_s, which causes gz=., which is able to exceed the tolerance.)

To flush out the main points we want a particular downside. Contemplate maximizing the log-likelihood operate of a Poisson mannequin, which has a easy purposeful type. The contribution of every remark to the log-likelihood is

[
f_i(betab) = y_i{bf x_i}betab – exp({bf x}_ibetab) – ln( y_i !)
]

the place (y_i) is the dependent variable, ({bf x}_i) is the vector of covariates, and (betab) is the vector of parameters that we choose to maximise the log-likelihood operate given by (F(betab) =sum_i f_i(betab)). I may drop ln(y_i!), as a result of it doesn’t rely upon the parameters. I embrace it to make the worth of the log-likelihood operate the identical as that reported by Stata. Stata consists of these phrases in order that log-likelihood-function values are comparable throughout fashions.

The pll() operate in code block 5 computes the Poisson log-likelihood operate from the vector of observations on the dependent variable y, the matrix of observations on the covariates X, and the vector of parameter values b.

Code block 5: A operate for the Poisson log-likelihood operate


// Compute Poisson log-likelihood 
mata:
actual scalar pll(actual vector y, actual matrix X, actual vector b)
{
    actual vector  xb

    xb = x*b
    f  = sum(-exp(xb) + y:*xb - lnfactorial(y))
}
finish

The vector of first derivatives is

[
frac{partial~F(xb_,betab)}{partial betab}
= sum_{i=1}^N (y_i – exp(xb_ibetab))xb_i
]

which I can compute in Mata as quadcolsum((y-exp(X*b)):*X), and the Hessian matrix is

[
label{eq:H}
sum_{i=1}^Nfrac{partial^2~f(xb_i,betab)}{partialbetab
partialbetab^prime}
= – sum_{i=1}^Nexp(xb_ibetab)xb_i^prime xb_i
tag{5}
]

which I can compute in Mata as -quadcross(X, exp(X*b), X).

Right here is a few code that implements this Newton-Raphson NR algorithm for the Poisson regression downside.

Code block 6: pnr1.do(Makes use of accident3.dta)


// Newton-Raphson for Poisson log-likelihood
clear all
use accident3

mata:
actual scalar pll(actual vector y, actual matrix X, actual vector b)
{
    actual vector  xb
    xb = X*b
    return(sum(-exp(xb) + y:*xb - lnfactorial(y)))
}

void GetDerives(actual vector y, actual matrix X, actual vector theta, g, Hello)
{
	actual vector exb

	exb = exp(X*theta)
	g   =  (quadcolsum((y - exb):*X))'
	Hello  = quadcross(X, exb, X)
	Hello = -1*cholinv(Hello)
}

actual vector tupdate(                 ///
	actual scalar lambda,          ///
	actual vector theta_s,          ///
	actual vector g_s,              ///
	actual matrix Hi_s)
{
	return (theta_s - lambda*Hi_s*g_s)
}

actual vector GetUpdate(            ///
    actual vector y,                ///
    actual matrix X,                ///
    actual vector theta_s,          ///
    actual vector g_s,              ///
    actual matrix Hi_s)
{
    actual scalar lambda 
    actual vector theta_s1

    lambda = 1
    theta_s1 = tupdate(lambda, theta_s, g_s, Hi_s)
    whereas ( pll(y, X, theta_s1) <= pll(y, X, theta_s) ) {
        lambda   = lambda/2
        if (lambda < 1e-11) {
            printf("{crimson}Can not discover parameters that produce a rise.n")
            exit(error(3360))
        }
        theta_s1 = tupdate(lambda, theta_s, g_s, Hi_s)
    }
    return(theta_s1)
}


y = st_data(., "accidents")
X = st_data(., "cvalue youngsters visitors")
X = X,J(rows(X), 1, 1)

b  =  J(cols(X), 1, .01)
GetDerives(y, X, b, g=., Hello=.)
gz = .
whereas (abs(gz) > 1e-11) {
	bs1 = GetUpdate(y, X, b, g, Hello)
	b   = bs1
	GetDerives(y, X, b, g, Hello)
	gz = g'*Hello*g
	printf("gz is now %8.7gn", gz)
}
printf("Converged worth of beta isn")
b

finish

Line 3 reads within the downloadable accident3.dta dataset earlier than dropping all the way down to Mata. I take advantage of variables from this dataset on strains 56 and 57.

Strains 6–11 outline pll(), which returns the worth of the Poisson log-likelihood operate, given the vector of observations on the dependent variable y, the matrix of covariate observations X, and the present parameters b.

Strains 13‐21 put the vector of first derivatives in g and the inverse of the Hessian matrix in Hello. Equation 5 specifies a matrix that’s damaging particular, so long as the covariates will not be linearly dependent. As mentioned above, cholinv() returns a matrix of lacking values if the matrix is just not optimistic particular. I multiply the right-hand facet on line 20 by (-1) as a substitute of on line 19.

Strains 23–30 implement the tupdate() operate beforehand mentioned.

Strains 32–53 implement the GetUpdate() operate beforehand mentioned, with the caveats that this model handles the info and makes use of pll() to compute the worth of the target operate.

Strains 56–58 get the info from Stata and row-join a vector to X for the fixed time period.

Strains 60–71 implement the NR algorithm mentioned above for this Poisson regression downside.

Working nr1.do produces

Instance 1: NR algorithm for Poisson


. do pnr1

. // Newton-Raphson for Poisson log-likelihood
. clear all

. use accident3

. 
. mata:

[Output Omitted]

: b  =  J(cols(X), 1, .01)

: GetDerives(y, X, b, g=., Hello=.)

: gz = .

: whereas (abs(gz) > 1e-11) {
>         bs1 = GetUpdate(y, X, b, g, Hello)
>         b   = bs1
>         GetDerives(y, X, b, g, Hello)
>         gz = g'*Hello*g
>         printf("gz is now %8.7gn", gz)
> }
gz is now -119.201
gz is now -26.6231
gz is now -2.02142
gz is now -.016214
gz is now -1.3e-06
gz is now -8.3e-15

: printf("Converged worth of beta isn")
Converged worth of beta is

: b
                  1
    +----------------+
  1 |  -.6558870685  |
  2 |  -1.009016966  |
  3 |   .1467114652  |
  4 |   .5743541223  |
    +----------------+

: 
: finish
--------------------------------------------------------------------------------
. 
finish of do-file

The purpose estimates in instance 1 are equal to these produced by poisson.

Instance 2: poisson outcomes

. poisson accidents cvalue youngsters visitors

Iteration 0:   log probability = -555.86605  
Iteration 1:   log probability =  -555.8154  
Iteration 2:   log probability = -555.81538  

Poisson regression                              Variety of obs     =        505
                                                LR chi2(3)        =     340.20
                                                Prob > chi2       =     0.0000
Log probability = -555.81538                     Pseudo R2         =     0.2343

------------------------------------------------------------------------------
   accidents |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      cvalue |  -.6558871   .0706484    -9.28   0.000    -.7943553   -.5174188
        youngsters |  -1.009017   .0807961   -12.49   0.000    -1.167374   -.8506594
     visitors |   .1467115   .0313762     4.68   0.000     .0852153    .2082076
       _cons |    .574354   .2839515     2.02   0.043     .0178193    1.130889
------------------------------------------------------------------------------

Achieved and undone

I carried out a easy nonlinear optimizer to apply Mata programming and to evaluate the idea behind nonlinear optimization. In future posts, I implement a command for Poisson regression that makes use of the optimizer in optimize().

Programming an estimation command in Stata: A evaluate of nonlinear optimization utilizing Mata

Related Articles

Google Releases Gemini 3.5 Reside Translate, a Streaming Speech-to-Speech Audio Mannequin Masking 70+ Languages Throughout Meet, Translate, and the Reside API

The AI boomerang impact: extra information suggests employers are reversing AI layoffs

Planet 9 thriller deepens as new discovery challenges hidden planet concept

Latest Articles

Google Releases Gemini 3.5 Reside Translate, a Streaming Speech-to-Speech Audio Mannequin Masking 70+ Languages Throughout Meet, Translate, and the Reside API

The AI boomerang impact: extra information suggests employers are reversing AI layoffs

Planet 9 thriller deepens as new discovery challenges hidden planet concept

Lastly the Steady Diff-in-Diff Estimator Reveals Up!

Testing Claude Fable 5: Hype or Actuality?