says “this buyer’s chance of leaving is 0.4” and your code does predict(X) >= 0.5, you’ve got simply made a pricing determination: you determined that the price of sending a retention provide to a buyer who would have stayed is strictly equal to the price of shedding one who would have left, and on the IBM Telco dataset (arguably the most-recycled churn dataset on Kaggle and GitHub), that call is incorrect by an element of 13.
I assembled a corpus of 36 publicly out there IBM Telco churn analyses (Kaggle notebooks, GitHub repositories, weblog posts, peer-reviewed papers), and the reporting sample is placing: roughly 9 in ten report classification accuracy or F1, simply over one in seven report a revenue curve, and none use survival evaluation to compute lifetime worth.
The result’s a literature the place the identical dataset has been re-modelled a whole lot of instances, and each default-threshold mannequin leaves cash on the ground: about $86 per buyer in avoidable burn on the usual 20% check break up, and scaled to a 100,000-subscriber ebook with the identical churn profile that may symbolize $8.6 million in recoverable value; the IBM Telco churn fee (26.5% annual) is unusually excessive, and a more healthy B2C SaaS ebook with 5–8% annual churn would see the per-customer determine drop by roughly 3–4×, so what stays invariant throughout any cost-sensitive setting will not be the headline greenback quantity however the asymmetry — 13× costlier to overlook a churner than to over-treat a loyalist.
This piece lays out three issues, so as: first, what the IBM Telco literature studies and what it leaves out; second, how you can compute the greenback value of a misclassification utilizing public 2026 B2C SaaS benchmarks and Kaplan-Meier survival evaluation, with no hand-waved CAC; third, why the textbook Bayes-optimal threshold system loses to a brute-force sweep when the mannequin is educated on SMOTE-balanced knowledge, and what to do about it.
Each quantity on this article is reproducible from the scripts linked on the finish.
1. The 36-article hole
The IBM Telco Buyer Churn dataset is small (7,032 cleaned rows), tidy, labelled, and has been the canonical introductory churn dataset on Kaggle for practically a decade.
To get a really feel for what the general public corpus truly measures, I listed 36 analyses throughout Kaggle, GitHub, and the main data-science blogs, scoring every one on ten reporting dimensions starting from F1 rating to a CAC-and-LTV-grounded revenue curve.
The sample that emerged is in Determine 2.

Three findings price pulling out:
- Saturated: F1, accuracy, AUC, confusion-matrix screenshots, and SMOTE-vs-no-SMOTE comparisons seem in 80–90% of the corpus, with hyperparameter tuning through Optuna or grid search a near-universal trope.
- Unusual: a revenue curve (whole greenback value of misclassifications as a perform of determination threshold) seems in fewer than 15% of the analyses I reviewed, and when it does seem, the FN/FP value numbers are normally picked from a textbook instance with out anchoring them to actual CAC or LTV.
- Absent: not one of the 36 analyses I listed compute buyer lifetime worth through survival evaluation on
tenure; most both skip LTV totally or use the steady-state Skok systemLTV = ARPU / monthly_churn_rate, which assumes a homogeneous buyer base — a robust declare for a dataset the place contract kind, fee technique, and tenure all materially shift retention.
Skipping survival evaluation issues as a result of the brink determination is a perform of LTV: if you happen to misjudge LTV by 2x you misjudge the price of a missed churner by 2x, and the cost-optimal threshold strikes with it.
The subsequent two sections construct the lacking piece, then plug it again into the brink downside.
2. The price of an error, in {dollars}
Three numbers decide the greenback value of each prediction (ARPU, gross margin, and CAC); two come straight out of the dataset, one from public 2026 business benchmarks.
import pandas as pd
df = pd.read_csv("telco.csv")
df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors="coerce")
df = df.dropna().reset_index(drop=True)
arpu = df["MonthlyCharges"].imply() # $64.80
mean_tenure = df["tenure"].imply() # 32.42 months
churn_rate = (df["Churn"] == "Sure").imply() # 26.58 %
realised_ltv_churned = df.loc[df["Churn"] == "Sure",
"TotalCharges"].imply() # $1,531.80
ARPU and tenure are dataset-native: imply month-to-month cost is $64.80, imply noticed tenure is 32.4 months, and the realised LTV of churners is $1,531.80 (simply the typical TotalCharges over clients who already left), so holding ARPU fastened, three widespread LTV framings give very totally different ceilings:

None of those three is the correct reply, and the primary one is actively incorrect. ARPU × mean_tenure is the framing most tutorials attain for, and it’s damaged on the root. ARPU will not be a property of a buyer; it’s the joint final result of each function that additionally drives churn — contract kind, fee technique, product bundle, family construction. The income leaves with the client, so ARPU and tenure will not be impartial portions, and the textbook decomposition LTV ≈ E[ARPU] × E[lifetime] is just legitimate when Cov(ARPU, lifetime) = 0. In any churn dataset price modelling that covariance is non-zero — if it have been zero, the mannequin would don’t have anything to foretell — and a buyer with $80/month Month-to-month Expenses and eight months of tenure will not be a “high-revenue buyer”; their $80 and their 8 months are two readings of the identical underlying threat profile.
Layer on the truth that ARPU varies inside a single buyer’s lifetime — a promo-onboarding cohort paying $30/month for the primary three months and $80/month thereafter, a long-tenured loyalty buyer grandfathered right into a $50/month fee whereas new clients pay $90/month for a similar product — and multiplying the imply of 1 distribution by the imply of one other describes a buyer who exists nowhere within the knowledge.
The Skok system no less than makes use of a steady-state churn fee, but it surely assumes that fee is fixed without end. The realised LTV is an actual quantity, but it surely solely describes the purchasers you’ve got already misplaced.
Not one of the three tells you when a new buyer breaks even on acquisition value — for that you simply want an actual retention curve, which is the following part, however first a phrase on CAC.
For B2C SaaS within the 2026 benchmarks, CAC ranges from about $68 (eCommerce) to over $200 (fintech), with mid-market subscription merchandise clustering round $150 ([1], [2]); telecom subscriber acquisition value is materially greater ($300+ as soon as handset subsidies are amortised), so $150 is a conservative anchor for this dataset, and selecting the next quantity would solely make the burn calculation in Part 4 bigger.

Gross margin for B2C SaaS sits within the 70–85% band, with 75% as the standard midpoint that matches David Skok’s modelling assumptions for steady-state SaaS economics ([3]).
That provides us the constructing blocks for the price of a single prediction error.
CAC = 150
ARPU = 64.80
REMAINING_TENURE_MONTHS = 18
FN_COST = CAC + ARPU * REMAINING_TENURE_MONTHS # $1,316.40
FP_COST = 100 # typical marketing campaign value (midpoint)
ratio = FN_COST / FP_COST # 13.2 : 1
A false unfavorable (telling a buyer they are going to keep once they truly go away) prices you the brand new acquisition spend ($150) plus 18 months of foregone income ($64.80 × 18 = $1,166.40), for $1,316.40 whole, whereas a false constructive (flagging a buyer as a churn threat once they have been going to remain) prices roughly $100 of marketing campaign and low cost expense, leaving a value ratio of 13.2 to 1.
A observe on what the framework is and isn’t. The $150 CAC and the $100 false-positive value on this article are placeholders; CAC varies materially by acquisition channel, and the $100 is shorthand for no matter your actual retention intervention prices — a reduction, a CSM name, a bundle improve, a product investigation. None of those are interchangeable, and a blanket low cost will not be a retention technique: it’s a deferral mechanism that retains clients solely till the low cost expires whereas paying to retain clients who have been by no means going to go away (and, worse, coaching them to anticipate the following low cost). Actual retention technique maps churn drivers — a buyer leaving as a result of backup_online retains failing is retained by fixing backup_online, not by a ten% off e-mail — and allocates funds towards product enchancment, with the marketing campaign value as a short-term bridge whereas engineering catches up. The revenue curve here’s a threshold-setting software that operates after you’ve got determined your retention playbook (what intervention applies to whom, at what actual value); it isn’t an alternative choice to that call. Deal with the $150 and the $100 as a single consultant pair; phase them, and the framework segments with them.
That ratio is your entire motive threshold = 0.5 is the incorrect default: the choice boundary ought to mirror the asymmetry, and we’ll get to the precise system, however first comes the LTV piece.
3. The LTV revenue curve
Most churn writeups deal with lifetime worth as a static greenback quantity you multiply by a hazard fee.
Survival evaluation does higher.
It measures retention straight from the information and turns LTV right into a curve: the cumulative contribution margin per buyer as a perform of months since acquisition, beginning at −CAC on day zero (you’ve paid to accumulate the client and earned nothing) and climbing as every surviving month provides ARPU × gross_margin × P(nonetheless alive) to the stability.
The Kaplan-Meier estimator does the heavy lifting, with tenure because the period and Churn == "Sure" because the occasion, producing the general curve in Determine 4.
from lifelines import KaplanMeierFitter
import numpy as np
CAC, GROSS_MARGIN, HORIZON = 150, 0.75, 72
kmf = KaplanMeierFitter().match(df["tenure"], (df["Churn"] == "Sure").astype(int))
months = np.arange(0, HORIZON + 1)
S = kmf.survival_function_at_times(months).values
month-to-month = ARPU * GROSS_MARGIN * S
ltv_curve = np.cumsum(month-to-month) - CAC
ltv_curve[0] = -CAC # day zero, solely CAC is sunk

Three readings price pulling out:
- Breakeven at month 3 (in expectation, not per buyer): throughout the unique acquired cohort, the survival-weighted cumulative contribution covers the
$150acquisition value by month 3, and CAC payback below twelve months is the David Skok rule of thumb that Telco beats by an element of 4. That is the correct quantity for budgeting cohort-level retention spend, however it’s a cohort common that hides bimodal variance: a buyer who churns in month 1 individually contributes one month of margin ($48.60) and by no means recoups their CAC, whereas a 70-month survivor contributes effectively over$3,400. The Kaplan-Meier weighting bakes these early losses in accurately — they only don’t get a star on the curve. - LTV on the 72-month horizon ≈ $2,527 per buyer: mixed with the
$150CAC, that’s an LTV:CAC ratio of about 17.8:1, effectively above the three:1 ground most SaaS traders search for, and a helpful sanity test that the dataset describes a wholesome unit-economics enterprise quite than a death-spiral one. - Churn-reduction uplift is modest on the cohort stage: a ten% discount in churn lifts terminal LTV by ~2.8% and a 20% discount lifts it by ~5.7%, so the raise is actual however not heroic, and the acquisition determination issues greater than the retention intervention.
Segmenting the identical calculation by Contract (Determine 5) is the place the framework earns its preserve.

A two-year-contract buyer is price about $3,372 over the 72-month horizon, whereas a month-to-month buyer is price $1,620 (lower than half), with the identical ARPU and the identical CAC, so your entire delta is retention; from a marketing-spend perspective, the right-side acquisition goal (the client try to be prepared to pay extra to accumulate) is the contract-locked one, despite the fact that they give the impression of being “much less worthwhile per 30 days” in any given snapshot.
That is the sort of determination the usual IBM Telco evaluation can’t make, as a result of it by no means computes survival-conditional LTV within the first place.
4. The classification revenue curve
With FN value, FP value, and survival-based LTV in hand, the brink query turns into a one-dimensional optimization: practice a mannequin, get predicted chances on the check set, sweep the brink from 0 to 1, compute whole greenback value at every threshold, and choose the minimal.
The mannequin here’s a tuned XGBoost educated with SMOTE on the practice fold solely, the usual Telco recipe.
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix
from imblearn.over_sampling import SMOTE
from xgboost import XGBClassifier
y = (df["Churn"] == "Sure").astype(int)
X = pd.get_dummies(df.drop(columns=["customerID", "Churn"]),
drop_first=True).astype(float)
X_tr, X_te, y_tr, y_te = train_test_split(
X, y, test_size=0.20, stratify=y, random_state=42)
scaler = StandardScaler().match(X_tr)
X_tr_s, X_te_s = scaler.rework(X_tr), scaler.rework(X_te)
X_tr_b, y_tr_b = SMOTE(random_state=42).fit_resample(X_tr_s, y_tr)
mannequin = XGBClassifier(n_estimators=400, max_depth=5,
learning_rate=0.05,
subsample=0.9, colsample_bytree=0.9,
random_state=42, eval_metric="logloss")
mannequin.match(X_tr_b, y_tr_b)
probs = mannequin.predict_proba(X_te_s)[:, 1]
thresholds = np.linspace(0.01, 0.99, 99)
totals = []
for t in thresholds:
pred = (probs >= t).astype(int)
tn, fp, fn, tp = confusion_matrix(y_te, pred).ravel()
totals.append(fn * 1316.40 + fp * 100)
The result’s in Determine 6.

The numbers, on a 1,407-row check set:

Shifting from 0.5 to the empirical minimal recovers $121,160 on the check set, which is $86.11 per buyer, and making use of that to a 100,000-subscriber ebook provides the headline $8.6M; your mileage will range together with your CAC, your ARPU, and your retention curve, however the multiplier (13× costlier to overlook a churner than to over-treat a loyalist) is what makes the hole giant.
When the textbook system loses to the brink sweep
Open any cost-sensitive classification reference (Provost and Fawcett’s Knowledge Science for Enterprise is the canonical one [4]) and you’ll find the Bayes-optimal threshold system:
t* = C_FP / (C_FP + C_FN)
Plug in our value ratio: t* = 100 / (100 + 1316.40) ≈ 0.0706, which is right math, and on a mannequin with calibrated chances that threshold minimizes anticipated value; however the sweep provides t = 0.03, and at that threshold the test-set value is $8,661 decrease than at 0.07, so the place is the hole?
The Bayes Optimum system assumes the mannequin’s predicted chances are calibrated: a prediction of 0.5 ought to correspond to a 50% true churn chance, however our mannequin is educated on a SMOTE-balanced set, which inflates the minority class to 50% throughout coaching, and tree-based learners then output chances biased towards greater values, with the mannequin’s “0.07” mapping to roughly the true 0.03 in calibrated chance house; the textbook system isn’t incorrect, it’s being utilized to an out-of-spec enter.
There are two clear fixes:
- Calibrate the possibilities first: apply Platt scaling or isotonic regression on a held-out set, then use the Bayes-optimal threshold on the calibrated output, with scikit-learn’s
CalibratedClassifierCVdoing it in a single line. - Skip calibration and sweep: it’s low-cost, it tolerates calibration drift, and on a small dataset like Telco the test-set sweep is extra dependable than a calibration mannequin match on a whole lot of held-out rows.
In follow, for manufacturing programs with common re-training, the sweep is what most groups ship; the system is the correct factor to show (with calibration because the asterisk), and each ought to seem in any sincere writeup of a cost-sensitive churn mannequin.
Neither one exhibits up within the IBM Telco corpus I listed.
5. What the following IBM Telco article ought to report
Three concrete shifts would make the following 36 IBM Telco analyses extra helpful than the final 36:
Report a revenue curve, not a confusion matrix. F1 at threshold 0.5 is a event metric (helpful for rating fashions when it’s a must to choose one, ineffective for deciding how you can ship one), and the curve in Determine 6 has extra decision-relevant data than each accuracy comparability within the corpus mixed.
Anchor LTV in survival evaluation, not steady-state assumptions. Kaplan-Meier on tenure is 30 traces of Python; the breakeven quantity, the LTV at horizon, and the contract-segmented curve give advertising operations a usable funds to spend on retention plus a defensible reply to “which buyer ought to we purchase more durable?”, and the Skok system stays a tremendous sanity test quite than the load-bearing LTV estimate.
Disclose the calibration assumption while you quote a Bayes-optimal threshold. Both calibrate first or observe explicitly that the brink reported is the empirical minimal from a sweep, and Wang et al. ([5]) make a intently associated argument with a extra elaborate metric (e-Income) that makes use of survival evaluation end-to-end, with the identical core concept.
Phase the intervention, not simply the rating. A revenue curve assumes a single FP value (the worth of “treating” one false alarm), however in follow the most affordable intervention for a high-value loyalist is totally different from the most affordable for an at-risk new account: a bundle improve for one, a price-pain investigation for the opposite, do-nothing for the third; segment-aware FP prices (and segment-aware thresholds) are the pure follow-up to the framework right here.
I went into this anticipating the hole can be within the modeling. It isn’t.
The IBM Telco dataset has been mined to bedrock for predictive accuracy, and what it might probably nonetheless train is whether or not our pipelines result in good choices, not simply correct predictions.
That requires three issues: greenback prices on errors, actual retention curves on clients, and an sincere threshold on the classifier — 4 scripts and a Kaplan-Meier match get you there.
References
[1] Genesys Progress Advertising and marketing, Buyer Acquisition Price Benchmarks: 44 Statistics Each Advertising and marketing Chief Ought to Know in 2026 (2026), genesysgrowth.com.
[2] Confirmed SaaS, CAC Payback Benchmarks 2026: SaaS Buyer Acquisition Price (2026), proven-saas.com.
[3] D. Skok, SaaS Metrics 2.0: Detailed Definitions (2014, up to date 2024), forentrepreneurs.com.
[4] F. Provost and T. Fawcett, Knowledge Science for Enterprise (2013), O’Reilly Media, ch. 7–8.
[5] Y. Wang, S. Albrecht, et al., e-Income: A Enterprise-Aligned Analysis Metric for Revenue-Delicate Buyer Churn Prediction (2025), arXiv:2507.08860.
[6] C. Davidson-Pilon, lifelines: survival evaluation in Python (2019), Journal of Open Supply Software program 4(40), 1317.
[7] W. Verbeke, T. Verdonck and S. Maldonado, Revenue-driven determination timber for churn prediction (2018), European Journal of Operational Analysis, 284(3).
[8] N. El Attar and M. El-Hajj, A scientific assessment of buyer churn prediction approaches in telecommunications (2026), Frontiers in Synthetic Intelligence.
Code, knowledge, and reproducible scripts for each determine can be found on request. The dataset is the IBM Telco Buyer Churn dataset, absolutely artificial pattern knowledge revealed by IBM in its official repository (github.com/IBM/telco-customer-churn-on-icp4d) below the Apache License 2.0, which allows use, by-product evaluation, and publication with attribution. The info is artificial and comprises no actual clients or PII.
Thanks for studying! When you’ve got any questions or wish to join, be at liberty to achieve out to me on LinkedIn. 👋
