Equity in machine studying: Equalized Odds

May 26, 2026

72

I not too long ago dipped my toes into equity ideas in machine studying. What does being truthful imply, virtually? is it truthful that I used to be not born with the physique required to qualify for the NBA?

Completely it’s truthful.

Within the lottery of life, my genes had been selected in the identical method as they had been for everybody else who was born. Equity does NOT imply “treating everybody the identical”, quite to have the identical place to begin in that nature-lottery.

That written, in actuality it’s not solely nature who is asking the photographs. We the folks, additionally play an essential half. In relation to high-stakes predictions (say who will get a mortgage, or who passes a medical screening) we would like implement the identical place to begin for everybody. Regrettably, making use of the identical mannequin to everybody doesn’t straight suggest equal therapy. Why? As a result of we don’t deal with everybody the identical. Merely talking, fashions are educated on real-world knowledge produced by… us. And, since we don’t deal with everybody the identical, our fashions inevitably replicate that actuality. We will solely ask to a lot from our fashions (see my quick rant in regards to the bias in AI false impression in that regards).

Once we talk about equity in machine studying, a strong “referee” is an idea known as Equalized Odds. Intuitively it means “the identical error charges for everybody” or “equal accuracy throughout teams”; if two people are equally certified for a mortgage, they need to have the identical probability of being accurately authorized on the one hand, or carry the identical threat of being incorrectly rejected alternatively. Observe the ethical stance right here: we acknowledge that our classifier will make errors, however we would like the errors to be distributed in a specific method (pretty). By means of distinction, we should always not settle for a mannequin that assigns one group larger probability for misclassification, simply because. So Equalized Odds is about equalizing the conditional error charges.

Formally talking, equalizing the error conduct throughout teams signifies that for protected attribute (say gender, or race), the Equalized Odds is equal to:

In phrases, your gender ( $Equity in machine studying: Equalized Odds$ ) shouldn’t matter. Folks with the identical true end result $(Y = y)$ ought to have the identical chance of being accurately or incorrectly labeled by the mannequin. Equalized Odds forces each error varieties to be balanced throughout teams. That is a gorgeous idea in lots of domains the place totally different error varieties are socially expensive:

In lending: false positives (giving a mortgage that defaults) vs false negatives (denying a mortgage that may repay).
In medical screening: false positives (pointless nervousness/checks) vs false negatives (missed illness).
In policing: false positives (unwarranted scrutiny) vs false negatives (missed threats).

Equalized Odds says: no matter we’re doing, we shouldn’t systematically impose one kind of mistake extra closely on one group than one other, conditional on the reality $(Y = y)$ . Additionally observe, Equalized Odds is a distributional constraint, it doesn’t assure equity per particular person. It’s equity in combination conditional error charges (learn: on common, not per particular person).
Now for the half statisticians have discovered to count on:

💡 constraints are by no means free.

Equalized Odds might pressure you to surrender some predictive efficiency. To see why, and perceive tips on how to apply Equalized Odds, we first check out our mannequin chance rating, which signifies the probability of a person falling into one class or one other. Your logistic regression (say) outputs a rating (chance) quite than a inflexible resolution (classification). Often for binary classification we use a 0.5 threshold to find out the category (e.g. certified or rejected for a mortgage). Since Equalized Odds means the error charges throughout teams are similar, we have a look at the 2 varieties of errors (sometimes known as Kind 1 and Kind 2 errors) for every group:

False Positives (e.g., mistakenly denying candidate) and
False Negatives (e.g., mistakenly approving a dangerous candidate).

Enter the Receiver Operator Attribute (ROC) curve of the rating. ROC captures the false optimistic and true optimistic (equivalently, false unfavorable) charges at totally different cutoff factors. What we have to seek for is a threshold which might ship the identical error charges for each error varieties and for each teams, like so:
On this hypothetical graph above we see that there’s a level which error charges throughout the 2 teams are the identical. Particularly, when the false optimistic price is roughly 30%, and the false unfavorable price (which is the complement of TPR on the Y-axis) is about 20%. The exact classification cutoff will not be proven on this chart (so don’t get confused), however no matter that cutoff is, it might fulfill the Equalized Odds criterion of equity. At this particular cutoff, each demographic teams have similar error charges, successfully neutralizing gender on the subject of that attribute’s predictive energy.

The tough half is after all to seek out that cutoff level.

Sensible instance

Let’s use the Grownup dataset for our instance. It’s a traditional supervised-learning benchmark. Every row is one particular person with demographic and employment options (e.g., age, schooling, occupation, hours per week), and the label is whether or not annual earnings is above $50K. The gender can be an attribute used for prediction, and whereas it might matter, we wish to disregard that delicate attribute as a predictor for top earnings. So we are able to use Equalized Odds for making the 2 varieties of error charges match throughout teams. Under is the Python code used to calculate the particular cutoff level the place the error charges of the 2 teams (women and men) intersect.

The primary chunk is used to create some wanted features, load the info and estimate a fundamental logistic regression, learn at your leisure, the second chunk reveals the outcomes.


import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.metrics import accuracy_score

np.set_printoptions(precision=3,suppress=True)
pd.set_option("show.precision",3)
pd.choices.show.float_format="{:.3f}".format

X,y=fetch_openml("grownup",model=2,as_frame=True,return_X_y=True)
df=pd.concat([X,y.rename("income")],axis=1).dropna()
df["income"]=(df["income"]==">50K").astype(int)
X=df.drop(columns=["income"])
y=df["income"]
X=pd.get_dummies(X,drop_first=True)
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=0,
stratify=y)
X_tr,X_cal,y_tr,y_cal=train_test_split(X_train,y_train,test_size=0.3,random_state=1,
stratify=y_train)
mannequin=LogisticRegression(max_iter=2000,solver="liblinear")
mannequin.match(X_tr,y_tr)
s_cal=mannequin.predict_proba(X_cal)[:,1]
s_test=mannequin.predict_proba(X_test)[:,1]
df_cal=X_cal.copy()
df_cal["y"]=y_cal.values
df_cal["s"]=s_cal
df_test=X_test.copy()
df_test["y"]=y_test.values
df_test["s"]=s_test
df_test["y_hat"]=(df_test["s"]>=0.5).astype(int)

def roc_by_group(df,group_col,group_val):
    d=df[df[group_col]==group_val]
    y=d["y"].to_numpy()
    s=d["s"].to_numpy()
    fpr,tpr,thr=roc_curve(y,s)
    return fpr,tpr,thr,d
''

def segment_intersections(fpr0,tpr0,thr0,fpr1,tpr1,thr1):
    cand=[]
    for i in vary(len(fpr0)-1):
        a0=np.array([fpr0[i],tpr0[i]])
        b0=np.array([fpr0[i+1],tpr0[i+1]])
        u=b0-a0
        for j in vary(len(fpr1)-1):
            a1=np.array([fpr1[j],tpr1[j]])
            b1=np.array([fpr1[j+1],tpr1[j+1]])
            v=b1-a1
            M=np.column_stack([u,-v])
            det=np.linalg.det(M)
            if abs(det)<1e-12:
                proceed
            rhs=a1-a0
            sol=np.linalg.remedy(M,rhs)
            p=float(sol[0])
            q=float(sol[1])
            if 0<=p<=1 and 0<=q<=1:
                pt=a0+p*u
                cand.append((pt[0],pt[1],i,p,j,q))
    return cand
''

def total_error_at_point(df,group_col,fpr_star,tpr_star):
    errs=[]
    wts=[]
    for gv in [0,1]:
        d=df[df[group_col]==gv]
        y=d["y"].to_numpy()
        pi=y.imply()
        err=pi*(1-tpr_star)+(1-pi)*fpr_star
        errs.append(err)
        wts.append(len(d))
    wts=np.array(wts,dtype=float)
    wts=wts/wts.sum()
    return float(wts[0]*errs[0]+wts[1]*errs[1])
''

def apply_random_threshold(scores,ta,tb,p):
    z=rng.random(len(scores))=ta).astype(int)
    yhat[~z]=(scores[~z]>=tb).astype(int)
    return yhat
''

def charges(d):
    y=d["y"].to_numpy()
    h=d["y_hat"].to_numpy()
    tp=((h==1)&(y==1)).sum()
    fp=((h==1)&(y==0)).sum()
    fn=((h==0)&(y==1)).sum()
    tn=((h==0)&(y==0)).sum()
    tpr=tp/(tp+fn) if (tp+fn)>0 else np.nan
    fpr=fp/(fp+tn) if (fp+tn)>0 else np.nan
    return float(tpr),float(fpr)
''

Outcomes:
For simpler exposition, and for the reason that purpose is to foretell whether or not a person earns greater than 50K or not, allow us to merely confer with these incomes greater than 50K as wealthy, and confer with the remaining as poor. If you need to equalize the chances so to talk, the chart above reveals that you have to deviate from the “built-in” 0.5 cutoff level, how? such that the FPR is kind of zero (no poor particular person is classed as wealthy), and the false unfavorable price to about 80% (so regrettably many wealthy persons are wrongly labeled as poor). You’ll be able to see beneath that the mannequin’s accuracy will not be unhealthy at round 80%, however after all it’s pushed by classifying the bulk as poor (which they’re..).

You’ll be able to learn the precise particulars within the code beneath, however briefly:

We moved from the unique TPR which was 26% to TPR which is about 17%, the change in FPR was small.
We moved from a threshold of 0.5 for everybody (each women and men) to a person threshold per group of about 0.7 (females) and 0.68 (males).

Extra issues to say:

As you’ll be able to see from the graph, there are different factors during which the ROC curves intersect, however there may be an underlying rule to decide on one which has the perfect total accuracy (is sensible).
Though it appears logical that accuracy would lower total, this isn’t essentially the case. The reason being easy: we calibrate on coaching knowledge, whereas accuracy is computed out-of-sample.
The code contains a number of different nuances. Since every time you progress the cutoff there are a “bunch” of observations that “flip” concurrently quite than a single statement, the ROC curve will not be easy however quite a step perform, so we at all times have two error charges: earlier than – and after the change within the cutoff. However since we have now many observations (so not steady, however virtually) we are able to merely take into account the typical right here.


# About 25% earn greater than 50K:
print(df_test[["y"]].imply())
y   0.248

dtype: float64
# And accuracy is 
df_base=df_test.copy()
df_base["y_hat"]=(df_base["s"]>=0.5).astype(int)
acc_before=accuracy_score(df_base["y"],df_base["y_hat"])
print(f"Accuracy earlier than EO: {acc_before:.3f}")
Accuracy earlier than EO: 0.789
# round 80%

overall_tpr,overall_fpr=charges(df_test)
print("Total TPR:",overall_tpr) # lacking 75% wealthy folks
print("Total FPR:",overall_fpr) # however virtually by no means calling somebody poor, wealthy

# The next is with none correction utilized:
Total TPR: 0.262
Total FPR: 0.037

if "sex_Male" in df_test.columns:
    g0=df_test[df_test["sex_Male"]==0]
    g1=df_test[df_test["sex_Male"]==1]
    tpr0,fpr0=charges(g0)
    tpr1,fpr1=charges(g1)
    print("Feminine TPR:",tpr0)
    print("Feminine FPR:",fpr0)
    print("Male TPR:",tpr1)
    print("Male FPR:",fpr1)
''
Feminine TPR: 0.291
Feminine FPR: 0.035
Male TPR: 0.257
Male FPR: 0.038

fpr0,tpr0,thr0,_=roc_by_group(df_cal,"sex_Male",0)
fpr1,tpr1,thr1,_=roc_by_group(df_cal,"sex_Male",1)

cand=segment_intersections(fpr0,tpr0,thr0,fpr1,tpr1,thr1)

greatest=None
best_err=np.inf

for fpr_star,tpr_star,i,p,j,q in cand:
    err=total_error_at_point(df_cal,"sex_Male",fpr_star,tpr_star)
    if err
Abstract
This implementation of Equalized Odds is beneficial. I discover it notably interesting in that it operates purely as a post-processing step, and so we're open to make use of whichever classifier we select, with none complexity add-ons.
In fact, there are different notions of equity and extra subtle methods to implement them, however this can be a ok place to begin for understanding what analysis recommends we do, if we fare to be truthful. Under couple of papers who served as inspiration. 
 References 

 Equality of Alternative in Supervised Studying

Fawcett, T. (2006). An introduction to ROC evaluation. Sample recognition letters, 27(8), 861-874.

Equity in machine studying: Equalized Odds

Sensible instance

Abstract

References

Related Articles

5 Key Ideas Behind Agentic AI Each Engineer Should Perceive

Learn how to execute queries in parallel utilizing EF Core

Language Mannequin Hallucination Analysis with GraphEval

Latest Articles

5 Key Ideas Behind Agentic AI Each Engineer Should Perceive

Learn how to execute queries in parallel utilizing EF Core

Language Mannequin Hallucination Analysis with GraphEval

Intel simply posted its greatest progress in 15 years – and burned billions to make it occur

One in every of NASA’s Most Necessary Deep Area Observatories Hit by Spanish Wildfires