It began with a one-line message from a finance workforce on a Tuesday afternoon: A handful of consumers had been charged twice that day, and one was disputing a replica cost with their financial institution.
I went straight to the monitoring, anticipating to search out one thing damaged. As a substitute, every thing appeared wholesome: By the system’s personal data, each order had been paid precisely as soon as. It took the workforce a month of digging by way of manufacturing incidents to shut the hole between a dashboard that mentioned, “all good,” and a buyer billed twice.
I’ve since seen this type of failure throughout a number of cost techniques, some dealing with tons of of 1000’s of transactions a day. What follows is a composite and doesn’t describe any single system or group. The numbers, timings and figuring out particulars have been modified to maintain something proprietary out.
The retry that charged twice
A buyer clicked Pay; the order service known as the cost service, which known as the exterior supplier. The supplier charged the cardboard for $200 and recorded successful on its facet.
The one factor that went flawed was timing. The supplier was beneath load and took simply over 3 seconds to reply. The consumer gave up after 2 seconds, a default inherited from inner service calls and by no means tuned for funds. From the caller’s facet, the decision had merely failed, so nothing was marked as paid.
The retry logic did what it was constructed to do and despatched the request once more. The supplier noticed what appeared like a contemporary cost and took the cash a second time. The database recorded one cost: the retry. The primary cost lived solely within the supplier’s data, invisible to us till the complaints got here in.
Violetta Pidvolotska
The duplicates had been refunded inside hours, earlier than the dispute may turn into a chargeback. Understanding what had truly failed took for much longer.
Later, we widened the timeout nicely previous the supplier’s slowest wholesome responses, however so long as a retry can set off a second cost, an extended timeout solely makes the double cost rarer.
The actual mistake was older than the restrict itself. The system had been informed {that a} timeout means failure.
The third state
We have a tendency to think about a community name as having two outcomes: it labored or it didn’t. A timeout is the third. The request might by no means have arrived. It could have completed its work and misplaced the response on the way in which again. That’s what bit us. Or it could nonetheless be operating. From the caller’s facet, you’ll be able to’t inform which.
Code not often has a separate path for “unknown.” It will get lumped in with failure and the failure path retries. When the request strikes cash, that’s the way you cost somebody twice.
A sluggish service reveals up as rising response instances, an unreliable one as errors. A double cost reveals up as successful, and no one seen till a buyer did.
Timeouts, which I’ve written about earlier than, flip silent hangs into seen failures. And visual failures get retried, which is how we arrived at idempotency.
“Precisely-once” will get used as if it had been a setting you would activate. You can not promise exactly-once supply throughout an unreliable community, as Tyler Deal with explains. What you’ll be able to promise is exactly-once results: The request might arrive twice, whereas the cost occurs as soon as.
My first intuition was to cease retrying funds robotically and it helped. However not each retry is ours to modify off: The shopper refreshes the web page, or a retry coverage someplace within the infrastructure resends by itself.
The assumptions beneath the important thing
The usual treatment is an idempotency key: The caller attaches a novel worth to at least one try at an operation and sends the identical worth on each retry. A brand new key will get processed and its end result saved; a well-known one will get the saved end result again, so the retry has no additional impact. Brandur Leach’s walkthrough of Stripe-like idempotency keys in Postgres lays the sample out finish to finish.
The important thing was shipped and the duplicates stopped. However we relaxed too early. The important thing turned out to be the simple half.
A key like this rests on 4 assumptions. I’ve since turned them right into a guidelines I name the four-assumptions take a look at:
- Declare. Claiming a key’s only a matter of checking it’s free first.
- Intent. The identical key at all times carries the identical intent.
- Reminiscence. No matter a key remembers is protected to replay.
- Boundary. Nothing behind the important thing lies past your management.
Over the next month, all 4 broke: The race in a load take a look at, the opposite three in manufacturing.
Two requests, similar millisecond
In a load take a look at, two requests with the identical key arrived in the identical millisecond. Every checked for the important thing; neither discovered it and each began processing.

Violetta Pidvolotska
“Test whether or not the important thing exists, then write it” is a race like some other and it broke the declare assumption. We fastened it by flipping the order: now writing the important thing is the verify. Each request writes it as “began” and the database lets just one declare win. The safeguard:
-- Attempt to declare the important thing; the UNIQUE index lets just one caller win.
INSERT INTO operations (idempotency_key, state) VALUES (:key, 'began')
ON CONFLICT (idempotency_key) DO NOTHING;
The insert touches one row or none, and that rely tells you which of them path you’re on. One row means you received: Name the supplier, then mark the row ‘accomplished’ and save the response. None means you misplaced: Learn the row and return its saved response or inform the caller to retry later whether it is nonetheless ‘began’.
One element is straightforward to get flawed: Commit the declare earlier than the supplier name goes out. In any other case, a crash rolls it again and erases the one document {that a} cost could also be in flight.
The tougher case is a profitable request that crashes mid-charge: Its key’s caught at “began,” and each retry is informed to attend for a solution that can by no means come. A caught declare is identical unknown another time: As soon as it has sat in “began” longer than any wholesome name may take, ask the supplier what truly occurred earlier than anybody costs once more.
Identical key, completely different request
The second hole appeared every week into manufacturing and broke the intent assumption: A caller reused one key for 2 completely different requests, $200 and $500, and the system returned the primary request’s saved response with out noticing the quantity had modified.

Violetta Pidvolotska
We fastened it by storing a fingerprint of the request’s contents subsequent to the important thing, on the identical insert, so a request that loses the declare race can nonetheless evaluate its fingerprint towards the winner’s. If the fingerprints match, it’s a real retry. In the event that they don’t, the important thing was reused for a unique operation and we reject it.
That repair promptly rejected a legitimate retry. We had been fingerprinting all the request, together with a timestamp that modified between makes an attempt and fields that arrived in a unique order, so the fingerprints didn’t match.
A fingerprint has to seize what a request means somewhat than how its bytes are organized. Hash a hand-picked record of enterprise fields and also you danger a silent collision: The one area no one remembered so as to add lets two completely different requests match. Hash the entire request minus recognized noise like timestamps and the failure is loud as a substitute: A missed unstable area rejects a legitimate retry. We selected loud, the repair got here down to 2 traces:
intent = drop_fields(request.json, unstable={"client_ts", "trace_id"}) # strip recognized noise solely
fingerprint = sha256(canonical_json(intent)) # canonical kind: keys sorted, numbers and spacing normalized
Even “canonical” hides choices. RFC 8785 pins them down, nevertheless it runs each quantity by way of an IEEE 754 double, which loses precision on giant values, so cash quantities are safer as strings or integer cents. Change the canonical kind and each saved fingerprint stops matching, so we model it and retailer the model subsequent to the fingerprint.
The error we cached
The third hole got here in by way of assist: A buyer hit an insufficient-funds decline, added cash, tried once more with the identical key and acquired the previous “inadequate funds” again. The supplier was by no means requested. The system had been caching each response, declines included, so the failure caught to the important thing.

Violetta Pidvolotska
That pressured the query behind the reminiscence assumption: What’s a key allowed to recollect? The rule we landed on: cache solely success.
A tender decline or a validation error releases the declare as a substitute: The row flips again to claimable, fingerprint saved. The following try reclaims it with an replace that just one retry can win and the shopper who provides cash will get a stay try as a substitute of a replay. Onerous declines are the exception: A stolen-card response is last and that declare stays closed.
On a timeout, we don’t know whether or not the cost landed, so we ask the supplier whether or not the cost already went by way of and act on the reply.
The place the assure runs out
The primary three gaps had been on endpoints beneath the workforce’s management. The fourth surfaced throughout reconciliation: A cost on an older supplier’s assertion with no matching inner document. That supplier had no idempotency keys, and the assure had reached its boundary. We couldn’t make it protected to name twice.
We acquired as shut as we may: A pending document earlier than the decision, a standing verify earlier than retrying, reconciliation to catch and refund no matter slips by way of. A window stays the place the cost has landed and our document doesn’t understand it but. We saved shrinking that window, however we by no means managed to shut it.
The database that holds the keys forces a call of its personal: When it’s down, you both cease taking funds or take them unprotected. That alternative is a enterprise name. For a low-stakes write, cleansing up a uncommon duplicate can price lower than turning prospects away. A cost will not be low stakes, so we fail closed and cease taking funds till the shop is again: A misplaced sale we will get better, and we had simply spent a month studying what duplicates price.
Questions I ask in design opinions
For something that shops or modifications knowledge, I ask three questions:
- What occurs if this runs twice? Ask it out loud for each write.
- Can we show the reply? Run it twice in assessments, in sequence and in parallel; the second run ought to change nothing.
- The place does the reality stay when techniques disagree? For funds, it’s the supplier as a result of their data present whether or not cash truly moved. Settle whose reply wins earlier than an incident does.
The bottom line is a good suggestion and, in something that strikes cash, a vital one. It’s simply not a assure. The assure is the design round it: A declare that can’t race, an intent the fingerprint confirms, a reminiscence that retains solely what’s protected to replay and a boundary you’ve got mapped upfront. That’s the four-assumptions take a look at. Each assumption will get examined finally: You do it at design time or manufacturing does it for you.
This text is revealed as a part of the Foundry Skilled Contributor Community.
Need to be a part of?
