Sunday, October 19, 2025

Approximate statistical exams for evaluating binary classifier error charges utilizing H2OML


Motivation

You will have simply skilled a gradient boosting machine (GBM) and a random forest (RF) classifier in your information utilizing Stata’s new h2oml command suite. Your GBM mannequin achieves 87% accuracy on the testing information, and your RF mannequin, 85%. It seems as if GBM is the popular classifier, proper? Not so quick.

Why accuracy alone isn’t sufficient

Accuracy, space underneath the curve, and root imply squared error are in style metrics, however they supply solely level estimates. These numbers replicate how nicely a mannequin carried out on one particular testing pattern, however they don’t account for the variability that may come up from pattern to pattern. In different phrases, they don’t reply this key query: Will the distinction in efficiency between these strategies maintain on the inhabitants stage, or may it have occurred by likelihood solely on this explicit testing dataset?

When evaluating strategies like GBM and RF, a number of share factors in efficiency may not be compelling on their very own. With out contemplating how a lot the outcomes may differ throughout completely different samples, it’s arduous to inform whether or not one methodology persistently outperforms the opposite or whether or not the noticed distinction is only a product of random variation within the information. Statistical exams are important on this regard, as they supply a framework for assessing whether or not the noticed variations are prone to persist within the inhabitants.

Introduction

A typical follow in machine studying for evaluating classifiers is to separate the dataset into both a three-way holdout (coaching, validation, and testing units) or a two-way holdout (coaching and testing units). The validation set (for three-way splits) or cross-validation (for two-way splits) is used to tune the mannequin, whereas the testing set evaluates the ultimate efficiency. For particulars, see Mannequin choice in machine studying in [H2OML] Intro.

Nonetheless, a refined however crucial downside of counting on a single take a look at set is random variation within the collection of the testing information. Particularly, even when two classifiers carry out identically on the complete inhabitants, one could seem superior due to likelihood fluctuations within the sampled testing information. That is particularly problematic with small testing units.

To deal with this, statistical exams are beneficial within the literature (Dietterich 1998; Alpaydin 1998; Raschka 2018). On this submit, we discover the next query: Given two machine studying strategies and a coaching set, how can we take a look at whether or not the classifiers exhibit the identical error charge on unseen information?

We concentrate on two exams: the McNemar take a look at (Mcnemar 1947) and the mixed (5 occasions 2) cross-validated ((5 occasions 2) CV) F take a look at (Alpaydin 1998). Utilizing Stata and its h2oml suite, we’ll show their utility. The submit is structured as follows: First, we introduce each exams conceptually; then, we transition to sensible implementation in Stata.

Statistical exams

In binary classification, the efficiency of a mannequin might be evaluated utilizing the misclassification error charge, which is the proportion of incorrect predictions amongst all predictions. Let true positives (TP) and true negatives (TN) characterize the variety of appropriately categorized optimistic and detrimental instances, respectively. Let false positives (FP) and false negatives (FN) characterize the variety of misclassified optimistic and detrimental instances. The misclassification error charge is outlined as
[
e = frac{text{FP} + text{FN}}{text{TP} + text{TN} + text{FP} + text{FN}} tag{1}label{eq:errrate}
]

Conversely, the accuracy of the mannequin, which measures the proportion of right predictions, is given by
[
text{acc} = frac{text{TP} + text{TN}}{text{TP} + text{TN} + text{FP} + text{FN}} = 1 – e tag{2}label{eq:accuracy}
]

For particulars, see [H2OML] metric_option. These metrics are basic for assessing the standard of predictions made by strategies similar to RFs or GBMs.

McNemar’s Check

McNemar’s take a look at is a nonparametric take a look at for paired comparisons that can be utilized to evaluate whether or not two classification strategies differ in efficiency on the identical testing set.

Let (n_{ij}) denote the variety of situations for which classifier A’s (for instance, GBM) prediction was (i) ((i=1) for proper prediction or (i=0) for incorrect prediction) and classifier B’s (for instance, RF) prediction was (j) ((j=1) for proper prediction or (j=0) for incorrect prediction). The (2 occasions 2) contingency desk is

Desk 1: Info wanted to conduct McNemar’s take a look at for evaluating two binary classifiers’ error charges

B incorrect B right
A incorrect (n_{00}) (n_{01})
An accurate (n_{10}) (n_{11})

We have an interest within the off-diagonal components: (n_{01}) (A is wrong, B is right) and (n_{10}) (A is right, B is wrong). These values characterize the disagreements between classifiers.

The null speculation (H_0) is that the 2 classifiers have the identical error charge:
[
H_0 : P(text{A incorrect, B correct}) = P(text{A correct, B incorrect})
quad text{or} quad n_{01} = n_{10}
]

Beneath the null speculation, the variety of disagreements (n_{01} + n_{10}) follows a binomial distribution with equal chance of both final result. For giant pattern sizes, the binomial distribution might be approximated by a chi-squared distribution with 1 diploma of freedom.

The McNemar take a look at statistic is
[
chi^2 = frac{(n_{01} – n_{10})^2}{n_{01} + n_{10}}
]

This statistic is roughly chi-squared distributed with 1 diploma of freedom underneath the null speculation. See Unstratified matched case–management information (mcc and mcci) in [R] epitab for extra particulars.

Mixed 5 x 2 CV F take a look at

The ( 5times 2) CV F take a look at is a statistical methodology for evaluating the efficiency of two supervised classification strategies. It’s designed to check the null speculation
[
H_0: text{The two classifiers have equal generalization error}
]
and is constructed upon Dietterich’s (5times 2) CV paired t take a look at (Dietterich 1998). Alpaydin (1998) recognized instability within the unique take a look at because of the arbitrary selection of certainly one of 10 attainable take a look at statistics and proposed a mixed F take a look at that aggregates over all of them for robustness.

We carry out 5 replications of 2-fold cross-validation, yielding 10 distinct take a look at units. Let (p_i^{(j)}) denote the distinction in error charges between the 2 classifiers on fold (j = 1, 2) of replication (i = 1, dots, 5). That’s,
[
p_i^{(j)} = e_{i,A}^{(j)} – e_{i,B}^{(j)} = text{acc}_{i,B}^{(j)} – text{acc}_{i,A}^{(j)} tag{3} label{eq:pij}
]
the place ( e_{i,A}^{(j)} ) and ( e_{i,B}^{(j)} ) are the misclassification error charges of classifiers A and B, respectively, on the (j)th fold of the (i)th replication [as defined in eqref{eq:errrate}] and ( textual content{acc}_{i,A}^{(j)} ) and ( textual content{acc}_{i,B}^{(j)} ) are the corresponding accuracy values [as defined in eqref{eq:accuracy}].

For every replication (i), we compute the typical,
[
bar{p}_i = frac{p_i^{(1)} + p_i^{(2)}}{2}
]
and the estimate of the variance:
[
s_i^2 = (p_i^{(1)} – bar{p}_i)^2 + (p_i^{(2)} – bar{p}_i)^2 = frac{(p_i^{(1)} – p_i^{(2)})^2 }{2} tag{4} label{eq:var}
]

Authentic 5 x 2 CV t take a look at (for reference)

Dietterich (1998) proposed the t statistic:

[
t = frac{p_1^{(1)}}{sqrt{ frac{1}{5} sum_{i=1}^{5} s_i^2 }}
]

This makes use of only one of the ten attainable (p_i^{(j)}) values, which introduces randomness primarily based on the selection of fold order.

Mixed 5 x 2 CV F-test derivation

To enhance robustness, the mixed F take a look at aggregates all 10 squared variations (p_i^{(j)}) and all 5 variances (s_i^2).

Outline
[
N = sum_{i=1}^{5} sum_{j=1}^{2} left( p_i^{(j)} right)^2
quadtext{and}quad
M = sum_{i=1}^{5} s_i^2 tag{5} label{eq:NandM}
]

Beneath the null speculation and the belief of independence (approximate), now we have
[
F = frac{N / 10}{M / 5} = frac{ sum_{i=1}^{5} sum_{j=1}^{2} left( p_i^{(j)} right)^2}{2 sum_{i=1}^{5} s_i^2} tag{6} label{eq:Fstat}
]

This statistic is roughly F-distributed with ((10, 5)) levels of freedom.

In abstract, the mixed (5times 2) CV F take a look at improves upon Dietterich’s unique t take a look at by

  • utilizing all 10-fold variations as an alternative of simply 1,
  • lowering sensitivity to the order of folds or replications, and
  • offering higher management of sort I error and improved statistical energy.

Implementation in Stata

We start our evaluation by loading attrition.dta and producing a brand new variable, logincome, that shops the log of month-to-month revenue. It is a frequent transformation used to normalize skewed variables earlier than modeling.

. use https://www.stata.com/customers/assaad_dallakyan/attrition, clear
. gen logincome = log(monthlyincome)

We then initialize the H2O cluster utilizing h2o init and put the present dataset into an H2O body, attrition, and make it the present H2O body.

. h2o init
. _h2oframe put, into(attrition) present

We break up attrition.dta into coaching (70%) and testing (30%) frames utilizing random seed 19 for reproducibility. Then we set prepare as the present working body for mannequin coaching.

. _h2oframe break up attrition, into(prepare take a look at) break up(0.7 0.3) rseed(19) exchange
. _h2oframe change prepare

For comfort, we outline a world macro, predictors, that features the whole set of predictors for the mannequin. These cowl a variety of non-public and job-related options, similar to training, job satisfaction, work-life steadiness, and demographic particulars.

. world predictors age training employeenumber environmentsat
> jobinvolvement jobsatisfaction logincome numcompaniesworked 
> efficiency relationshipsat totalworkingyears worklifebalance
> yearsatcompany yearsincurrentrole yearswithcurrmanager
> businesstravel gender jobrole maritalstatus

McNemar’s take a look at

We first prepare a GBM classifier utilizing the coaching dataset. As soon as the mannequin is skilled, we specify that the take a look at body needs to be used for subsequent postestimation instructions, show the confusion matrix, and generate predictions. These predicted lessons are saved in variable attrition_gbm within the testing body take a look at, and the mannequin is saved underneath the title gbm for future comparability. For simplicity, for each the GBM and RF classifiers, we used the default values for all hyperparameters and didn’t carry out tuning. Nonetheless, in real-world purposes, we might extra probably need to examine one of the best fashions obtained after hyperparameter tuning; see Hypereparameter tuning in [H2OML] Intro for extra particulars about tuning.

. h2oml gbbinclass attrition $predictors, h2orseed(19)
(output omitted)

. h2omlpostestframe take a look at
(testing body take a look at is now energetic for h2oml postestimation)

. h2omlestat confmatrix

Confusion matrix utilizing H2O
Testing body: take a look at

           |      Predicted
 attrition |         No        Sure |  Whole  Error    Fee
-----------+-----------------------+----------------------
        No |        318         33 |    351     33    .094
       Sure |         48         32 |     80     48      .6
-----------+-----------------------+----------------------
     Whole |        366         65 |    431     81    .188

Observe: Chance threshold .254 that maximizes F1 metric
      used for classification.


. h2omlpredict attrition_gbm, class

Progress (%): 0 100

. h2omlest retailer gbm

Throughout all 431 observations within the testing dataset, there have been 81 misclassifications, giving an total error charge of 0.188.

We repeat the identical process for a RF classifier. The predictions are saved in variable attrition_rf, and the mannequin is saved as rf.

. h2oml rfbinclass attrition $predictors, h2orseed(19)
(output omitted)

. h2omlpostestframe take a look at
(testing body take a look at is now energetic for h2oml postestimation)

. h2omlestat confmatrix

Confusion matrix utilizing H2O
Testing body: take a look at

           |      Predicted
 attrition |         No        Sure |  Whole  Error    Fee
-----------+-----------------------+----------------------
        No |        276         75 |    351     75    .214
       Sure |         29         51 |     80     29    .362
-----------+-----------------------+----------------------
     Whole |        305        126 |    431    104    .241

Observe: Chance threshold .21 that maximizes F1 metric
      used for classification.


. h2omlpredict attrition_rf, class

Progress (%): 0 100

. h2omlest retailer rf

Throughout all 431 observations within the testing dataset, there have been 104 misclassifications, giving an total error charge of 0.241. At first look, it seems that GBM outperforms RF when it comes to predictive accuracy (0.188 versus 0.241 error charges). Nonetheless, this distinction will not be indicative of a distinction within the inhabitants. This highlights the significance of supplementing accuracy metrics with correct statistical testing, as we do subsequent with McNemar’s take a look at and the 5×2 CV F take a look at.

To carry out McNemar’s take a look at, we convey the take a look at information and predictions again into Stata (by way of _h2oframe get) for additional statistical evaluation. We encode the string-valued categorical predictions and final result into numeric variables and drop the unique string variations.

. clear
. _h2oframe get attrition attrition_gbm attrition_rf utilizing take a look at
. encode attrition, gen(nattrition)
. encode attrition_gbm, gen(nattrition_gbm)
. encode attrition_rf, gen(nattrition_rf)
. drop attrition attrition_gbm attrition_rf

The following step is to provide a three-way desk that cross-tabulates true values with each mannequin predictions. From the outcomes, we determine the counts wanted (proven in desk 1) for McNemar’s take a look at and retailer them in native macros.

. desk (nattrition_gbm) (nattrition nattrition_rf ), nototal

---------------------------------------------------
               |              nattrition
               |         No               Sure
               |   nattrition_rf     nattrition_rf
               |      No      Sure       No      Sure
---------------+-----------------------------------
nattrition_gbm |
  No           |     303       17       41        8
  Sure          |       9       22        5       26
---------------------------------------------------
. native n00 = 22 + 41  // Nb. of obs. misclassified by each GBM and RF
. native n01 = 17 + 5   // Nb. of obs. misclassified by RF however not by GBM
. native n10 = 9 + 8    // Nb. of obs. misclassified by GBM however not by RF
. native n11 = 303 + 26

We then run mcci to compute the McNemar statistic utilizing these frequencies.

. mcci `n00' `n01' `n10' `n11'

                 |        Controls        |
Circumstances            |   Uncovered   Unexposed  |      Whole
-----------------+------------------------+-----------
         Uncovered |        63          22  |         85
       Unexposed |        17         329  |        346
-----------------+------------------------+-----------
           Whole |        80         351  |        431

McNemar's chi2(1) =      0.64    Prob > chi2 = 0.4233
Precise McNemar significance chance       = 0.5224

The outcome doesn’t present proof to reject the null speculation, suggesting no efficiency distinction.

For fashions which are computationally costly to coach, Dietterich (1998) beneficial McNemar’s take a look at as the strategy of selection. For fashions that may be skilled a number of occasions (for instance, 10 occasions), he beneficial the (5times 2) CV (t) take a look at as a result of it’s barely extra highly effective than McNemar’s take a look at. Subsequent, we describe the right way to implement the (5times 2) CV (F) take a look at in Stata, which is an improved model of the (5 occasions 2) CV (t) take a look at.

Mixed 5 x 2 CV F take a look at

We begin by switching to the body that comprises the complete dataset (attrition). We then initialize scalars to build up (N) and (M) [see eqref{eq:NandM}] which are used to compute the F statistic in eqref{eq:Fstat}.

. _h2oframe change attrition
. scalar N = 0
. scalar M = 0

We’ll then carry out 5 iterations, the place in every iteration, we randomly break up the dataset into two equal halves, prepare and take a look at. To make sure reproducibility, we first set a seed in Stata after which generate pseudo–random numbers utilizing runiformint(). We extract digits from this quantity to type a brand new seed, which we move to H2O’s pseudo-random-number generator by way of the rseed() possibility of the _h2oframe break up command. Observe that this process differs from the one we suggested in opposition to within the [R] set seed entry. On this case, as a result of H2O’s pseudo-random-number generator is unrelated to Stata’s, there isn’t a threat of the generator converging to a cycle. We then prepare GBM and RF on every half and consider them on the opposite, recording their accuracy (computed by way of the h2omlestat threshmetric command). We compute the distinction in efficiency for every fold ((p_i^{(j)}, j = 1, 2)) and retailer them in scalars pi1 and pi2. Then we calculate the variance and accumulate squared variations and variances throughout all replications. These are then used to calculate the F statistic.

. set seed 19
. forvalues i = 1(1)5 {
  2.         native split_seed = runiformint(1, 50000)
  3.         _h2oframe break up attrition, into(prepare take a look at) break up(0.5 0.5) rseed(`split_seed') exchange
  4.         quietly {
  5.                 _h2oframe change prepare
  6.                 h2oml gbbinclass attrition $predictors, h2orseed(19) validframe(take a look at)
  7.                 h2omlestat threshmetric
  8.                 scalar accA_1 = r(threshmetric)[4,1]   // Accuracy of A (GBM) on 1st fold
  9.
.                    h2oml rfbinclass attrition $predictors, h2orseed(19) validframe(take a look at)
 10.                 h2omlestat threshmetric
 11.                 scalar accB_1 = r(threshmetric)[4,1]   // Accuracy of B (RF) on 1st fold
 12.
.                    _h2oframe change take a look at
 13.                 h2oml gbbinclass attrition $predictors, h2orseed(19) validframe(prepare)
 14.                 h2omlestat threshmetric
 15.                 scalar accA_2 = r(threshmetric)[4,1]   // Accuracy of A (GBM) on 2nd fold
 16.
.                    h2oml rfbinclass attrition $predictors, h2orseed(19) validframe(prepare)
 17.                 h2omlestat threshmetric
 18.                 scalar accB_2 = r(threshmetric)[4,1]   // Accuracy of B (RF) on 2nd fold
 19.
                     // Compute the distinction in efficiency
.                    scalar pi1 = accA_1 - accB_1                   // Equation (2)
 20.                 scalar pi2 = accA_2 - accB_2
 21.                 scalar variance = (pi1 - pi2)^2 / 2            // Equation (3)
 22.                 scalar N = N + pi1^2 + pi2^2                   // Equation (4)
 23.                 scalar M = M + variance                        // Equation (4)
 24.         }
 25. }
. scalar f_stat = N / (2 * M)                                       // Equation (5)
. scalar p_value = Ftail(10, 5, f_stat)
. di p_value
.19382379

The results of this take a look at corroborates the results of McNemar’s take a look at. There’s not proof to recommend that the strategies carry out in a different way.

References
Alpaydin, E. 1998. Mixed 5x2cv f take a look at for evaluating supervised classification studying algorithms mixed 5x2cv f take a look at for evaluating supervised classification studying algorithms.
https://api.semanticscholar.org/CorpusID:6872443.

Dietterich, T. G. 1998. Approximate statistical exams for evaluating supervised classification studying algorithms. Neural Computation 10: 1895–1923. https://doi.org/10.1162/089976698300017197.

Mcnemar, Quinn. 1947. Observe on the sampling error of the distinction between correlated proportions or percentages. Psychometrika 12: 153–157. https://doi.org/10.1007/BF02295996.

Raschka, S. 2018. Mannequin analysis, mannequin choice, and algorithm choice in machine studying. arXiv:1811.12808 [cs.LG]. https://doi.org/10.48550/arXiv.1811.12808.



Related Articles

Latest Articles