Saturday, March 14, 2026

We Used 5 Outlier Detection Strategies on a Actual Dataset: They Disagreed on 96% of Flagged Samples



Picture by Creator

 

Introduction

 

All tutorials on knowledge science make detecting outliers seem like fairly simple. Take away all values higher than three commonplace deviations; that is all there’s to it. However when you begin working with an precise dataset the place the distribution is skewed and a stakeholder asks, “Why did you take away that knowledge level?” you all of the sudden notice you do not have a very good reply.

So we ran an experiment. We examined 5 of probably the most generally used outlier detection strategies on an actual dataset (6,497 Portuguese wines) to seek out out: do these strategies produce constant outcomes?

They did not. What we realized from the disagreement turned out to be extra helpful than something we may have picked up from a textbook.

 

Outlier Detection Methods
Picture by Creator

 

We constructed this evaluation as an interactive Strata pocket book, a format you should utilize in your personal experiments utilizing the Information Undertaking on StrataScratch. You possibly can view and run the total code right here.

 

Setting Up

 
Our knowledge comes from the Wine High quality Dataset, publicly obtainable by means of UCI’s Machine Studying Repository. It accommodates physicochemical measurements from 6,497 Portuguese “Vinho Verde” wines (1,599 purple, 4,898 white), together with high quality rankings from skilled tasters.

We chosen it for a number of causes. It is manufacturing knowledge, not one thing generated artificially. The distributions are skewed (6 of 11 options have skewness ( > 1 )), so the information don’t meet textbook assumptions. And the standard rankings allow us to test if the detected “outliers” present up extra amongst wines with uncommon rankings.

Under are the 5 strategies we examined:

 
Outlier Detection Methods
 

Discovering the First Shock: Inflated Outcomes From A number of Testing

 
Earlier than we may evaluate strategies, we hit a wall. With 11 options, the naive strategy (flagging a pattern based mostly on an excessive worth in at the least one function) produced extraordinarily inflated outcomes.

IQR flagged about 23% of wines as outliers. Z-Rating flagged about 26%.

When practically 1 in 4 wines get flagged as outliers, one thing is off. Actual datasets don’t have 25% outliers. The issue was that we had been testing 11 options independently, and that inflates the outcomes.

The maths is easy. If every function has lower than a 5% chance of getting a “random” excessive worth, then with 11 unbiased options:
[ P(text{at least one extreme}) = 1 – (0.95)^{11} approx 43% ]

In plain phrases: even when each function is completely regular, you’d anticipate practically half your samples to have at the least one excessive worth someplace simply by random likelihood.

To repair this, we modified the requirement: flag a pattern solely when at the least 2 options are concurrently excessive.

 
Outlier Detection Methods
 
Altering min_features from 1 to 2 modified the definition from “any function of the pattern is excessive” to “the pattern is excessive throughout multiple function.”

Here is the repair in code:

# Rely excessive options per pattern
outlier_counts = (np.abs(z_scores) > 3.5).sum(axis=1)
outliers = outlier_counts >= 2

 

Evaluating 5 Strategies on 1 Dataset

 
As soon as the multiple-testing repair was in place, we counted what number of samples every methodology flagged:

 
Outlier Detection Methods
 
Here is how we arrange the ML strategies:

from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
 
iforest = IsolationForest(contamination=0.05, random_state=42)
lof = LocalOutlierFactor(n_neighbors=20, contamination=0.05)

 

Why do the ML strategies all present precisely 5%? Due to the contamination parameter. It requires them to flag precisely that proportion. It is a quota, not a threshold. In different phrases, Isolation Forest will flag 5% no matter whether or not your knowledge accommodates 1% true outliers or 20%.

 

Discovering the Actual Distinction: They Establish Totally different Issues

 
Here is what shocked us most. After we examined how a lot the strategies agreed, the Jaccard similarity ranged from 0.10 to 0.30. That is poor settlement.

Out of 6,497 wines:

  • Solely 32 samples (0.5%) had been flagged by all 4 main strategies
  • 143 samples (2.2%) had been flagged by 3+ strategies
  • The remaining “outliers” had been flagged by just one or 2 strategies

You may suppose it is a bug, nevertheless it’s the purpose. Every methodology has its personal definition of “uncommon”:

 
Outlier Detection Methods
 
If a wine has residual sugar ranges considerably larger than common, it is a univariate outlier (Z-Rating/IQR will catch it). But when it is surrounded by different wines with comparable sugar ranges, LOF will not flag it. It is regular throughout the native context.

So the true query is not “which methodology is finest?” It is “what sort of uncommon am I looking for?”

 

Checking Sanity: Do Outliers Correlate With Wine High quality?

 
The dataset consists of skilled high quality rankings (3-9). We needed to know: do detected outliers seem extra often amongst wines with excessive high quality rankings?

 
Outlier Detection Methods
 
Excessive-quality wines had been twice as prone to be consensus outliers. That is a very good sanity test. In some circumstances, the connection is evident: a wine with method an excessive amount of risky acidity tastes vinegary, will get rated poorly, and will get flagged as an outlier. The chemistry drives each outcomes. However we won’t assume this explains each case. There may be patterns we’re not seeing, or confounding elements we’ve not accounted for.

 

Making Three Choices That Formed Our Outcomes

 
Outlier Detection Methods
 

// 1. Utilizing Sturdy Z-Rating Relatively Than Customary Z-Rating

A Customary Z-Rating makes use of the imply and commonplace deviation of the information, each of that are affected by the outliers current in our dataset. A Sturdy Z-Rating as an alternative makes use of the median and Median Absolute Deviation (MAD), neither of which is affected by outliers.

Because of this, the Customary Z-Rating recognized 0.8% of the information as outliers, whereas the Sturdy Z-Rating recognized 3.5%.

# Sturdy Z-Rating utilizing median and MAD
median = np.median(knowledge, axis=0)
mad = np.median(np.abs(knowledge - median), axis=0)
robust_z = 0.6745 * (knowledge - median) / mad

 

// 2. Scaling Crimson And White Wines Individually

Crimson and white wines have totally different baseline ranges of chemical compounds. For instance, when combining purple and white wines right into a single dataset, a purple wine that has completely common chemistry relative to different purple wines could also be recognized as an outlier based mostly solely on its sulfur content material in comparison with the mixed imply of purple and white wines. Subsequently, we scaled every wine kind individually utilizing the median and Interquartile Vary (IQR) of every wine kind, after which mixed the 2.

# Scale every wine kind individually
from sklearn.preprocessing import RobustScaler
scaled_parts = []
for wine_type in ['red', 'white']:
    subset = df[df['type'] == wine_type][features]
    scaled_parts.append(RobustScaler().fit_transform(subset))

 

// 3. Figuring out When To Exclude A Technique

Elliptic Envelope assumes your knowledge follows a multivariate regular distribution. Ours did not. Six of 11 options had skewness above 1, and one function hit 5.4. We saved the Elliptic Envelope within the comparability for completeness, however left it out of the consensus vote.

 

Figuring out Which Technique Performs Finest For This Wine Dataset

 

Outlier Detection Methods
Picture by Creator

 

Can we decide a “winner” given the traits of our knowledge (heavy skewness, blended inhabitants, no recognized floor reality)?

Sturdy Z-Rating, IQR, Isolation Forest, and LOF all deal with skewed knowledge fairly nicely. If compelled to choose one, we would go along with Isolation Forest: no distribution assumptions, considers all options directly, and offers with blended populations gracefully.

However no single methodology does every thing:

  • Isolation Forest can miss outliers which are solely excessive on one function (Z-Rating/IQR catches these)
  • Z-Rating/IQR can miss outliers which are uncommon throughout a number of options (multidimensional outliers)

The higher strategy: use a number of strategies and belief the consensus. The 143 wines flagged by 3 or extra strategies are much more dependable than something flagged by a single methodology alone.

Here is how we calculated consensus:

# Rely what number of strategies flagged every pattern
consensus = zscore_out + iqr_out + iforest_out + lof_out
high_confidence = df[consensus >= 3]  # Recognized by 3+ strategies

 

With out floor reality (as in most real-world initiatives), methodology settlement is the closest measure of confidence.

 

Understanding What All This Means For Your Personal Tasks

 
Outline your downside earlier than choosing your methodology. What sort of “uncommon” are you truly searching for? Information entry errors look totally different from measurement anomalies, and each look totally different from real uncommon circumstances. The kind of downside factors to totally different strategies.

Test your assumptions. In case your knowledge is closely skewed, the Customary Z-Rating and Elliptic Envelope will steer you fallacious. Take a look at your distributions earlier than committing to a way.

Use a number of strategies. Samples flagged by three or extra strategies with totally different definitions of “outlier” are extra reliable than samples flagged by only one.

Do not assume all outliers ought to be eliminated. An outlier may very well be an error. It may be your most fascinating knowledge level. Area data makes that decision, not algorithms.

 

Concluding Remarks

 
The purpose right here is not that outlier detection is damaged. It is that “outlier” means various things relying on who’s asking. Z-Rating and IQR catch values which are excessive on a single dimension. Isolation Forest and LOF discover samples that stand out of their total sample. Elliptic Envelope works nicely when your knowledge is definitely Gaussian (ours wasn’t).

Work out what you are actually searching for earlier than you decide a way. And in the event you’re undecided? Run a number of strategies and go along with the consensus.

 

FAQs

 

// 1. Figuring out Which Method I Ought to Begin With

place to start is with the Isolation Forest method. It doesn’t assume how your knowledge is distributed and makes use of all your options on the similar time. Nonetheless, if you wish to determine excessive values for a selected measurement (comparable to very hypertension readings), then Z-Rating or IQR could also be extra appropriate for that.

 

// 2. Selecting a Contamination Price For Scikit-learn Strategies

It relies on the issue you are attempting to unravel. A generally used worth is 5% (or 0.05). However needless to say contamination is a quota. Because of this 5% of your samples can be labeled as outliers, no matter whether or not there truly are 1% or 20% true outliers in your knowledge. Use a contamination fee based mostly in your data of the proportion of outliers in your knowledge.

 

// 3. Eradicating Outliers Earlier than Splitting Practice/take a look at Information

No. You must match an outlier-detection mannequin to your coaching dataset, after which apply the educated mannequin to your testing dataset. When you do in any other case, your take a look at knowledge is influencing your preprocessing, which introduces leakage.

 

// 4. Dealing with Categorical Options

The strategies lined right here work on numerical knowledge. There are three doable options for categorical options:

  • encode your categorical variables and proceed;
  • use a method designed for mixed-type knowledge (e.g. HBOS);
  • run outlier detection on numeric columns individually and use frequency-based strategies for categorical ones.

 

// 5. Figuring out If A Flagged Outlier Is An Error Or Simply Uncommon

You can not decide from the algorithm alone when an recognized outlier represents an error versus when it’s merely uncommon. It flags what’s uncommon, not what’s fallacious. For instance, a wine that has an especially excessive residual sugar content material may be a knowledge entry error, or it may be a dessert wine that’s meant to be that candy. Finally, solely your area experience can present a solution. When you’re not sure, mark it for overview slightly than eradicating it mechanically.
 
 

Nate Rosidi is a knowledge scientist and in product technique. He is additionally an adjunct professor instructing analytics, and is the founding father of StrataScratch, a platform serving to knowledge scientists put together for his or her interviews with actual interview questions from prime corporations. Nate writes on the newest traits within the profession market, offers interview recommendation, shares knowledge science initiatives, and covers every thing SQL.



Related Articles

Latest Articles