At 03:14 UTC last November, our queue caught fire. Not metaphorically: a lit-up redis cluster, a 40× spike in deliveries, three PagerDuty escalations, and a finance team in three timezones wondering why invoice.paid events were forty minutes late.
This is how we redesigned the webhook system afterward, and the principles we now ship as defaults.
What broke
A poorly-handled retry loop on a downstream merchant compounded into our own queue. The merchant's endpoint started returning 503 for unrelated reasons. Our retry logic used fixed 30-second intervals with no jitter, so every retry from 200K paying merchants hit them at the same instant. The merchant's outage cascaded into ours. Classic.
The four properties we now require
We sat down and wrote them on a whiteboard. They're not negotiable.
- Idempotent. Receiving the same event twice must be safe. Always.
- Signed. Every envelope must carry a verifiable signature. No exceptions.
- Backoff with jitter. No two retries from no two merchants land at the same instant.
- Observable. We and the merchant can see what was sent, what was received, and what failed.
Idempotency
Every webhook carries an event_id (UUID v4) and an idempotency_key derived from the underlying business action. The merchant stores the event_id they've successfully processed; if it appears again, they no-op.
If you process by
event_idyou'll be correct. If you process bypayload.invoice_idyou'll be correct most of the time, until you're not.
Signed envelopes
Every webhook ships with an X-Halfin-Signature header. We sign the raw body, not the parsed JSON, because parsing introduces ambiguity (key order, whitespace, escape sequences). Verify on raw bytes.
We migrated from HMAC-SHA256 to Ed25519 in Q2. HMAC is still supported for older keys, but new keys mint Ed25519 by default. Verification got cheaper, key rotation got cleaner.
Exponential backoff with jitter
The retry curve is now min(60s × 2^attempt, 24h) + jitter(0–30s). Jitter is the operative word. A thundering herd of 200K simultaneous retries turns any minor partner outage into a self-inflicted DDoS.
We retry up to 16 times. After that, the event lands in the dead-letter queue, where merchants can replay it manually from the dashboard.
Observability
Both sides can see:
- The exact bytes we sent (downloadable from the dashboard for 30 days)
- Every delivery attempt with its HTTP response code
- The signature verification status
This sounds boring. It's the single biggest reduction in support load we've shipped.
What we'd do again
- Sign raw bytes, not parsed objects.
- Pick UUIDs that are obviously UUIDs. Don't reuse the underlying entity ID as the event ID.
- Document the retry curve in the same place you document the endpoint format. Half of merchant integration bugs come from assuming the retry timing.
What we'd do differently
- We waited too long to ship a replay UI. Building it earlier would have saved us thousands of support tickets.
- The original "deliveries" table was indexed for write-heavy load, not for read-heavy debugging. We rebuilt it.
- We should have shipped Ed25519 from day one. The migration cost months.
A note on testing
The hardest webhook bug to find is the one your test suite passes on. Our internal test pattern: for every emitted event, we run an "evil twin" replay that drops it, duplicates it, reorders it, and corrupts the signature. If the receiving side stays correct under all four, the integration is real.
Webhooks aren't a feature. They're the contract.
A. Ribeiro, halfin platform team
