← halfin journalApr 14, 2026 · 11 min read

Engineering

Designing webhooks that survive everything

Idempotency keys, signed envelopes, exponential backoff with jitter, and the moment your queue catches fire at 03:14 UTC. Lessons from the halfin event bus.

A. RibeiroPlatform

Webhook reliability engineering deep-dive cover

engineering · cover

photo · Scott Rodgerson on Unsplash

At 03:14 UTC last November, our queue caught fire. Not metaphorically: a lit-up redis cluster, a 40× spike in deliveries, three PagerDuty escalations, and a finance team in three timezones wondering why invoice.paid events were forty minutes late.

This is how we redesigned the webhook system afterward, and the principles we now ship as defaults.

What broke

A poorly-handled retry loop on a downstream merchant compounded into our own queue. The merchant's endpoint started returning 503 for unrelated reasons. Our retry logic used fixed 30-second intervals with no jitter, so every retry from 200K paying merchants hit them at the same instant. The merchant's outage cascaded into ours. Classic.

queue depth · 03:14 UTC

cover · placeholder

Queue depth, the night of

The four properties we now require

We sat down and wrote them on a whiteboard. They're not negotiable.

Idempotent. Receiving the same event twice must be safe. Always.
Signed. Every envelope must carry a verifiable signature. No exceptions.
Backoff with jitter. No two retries from no two merchants land at the same instant.
Observable. We and the merchant can see what was sent, what was received, and what failed.

Idempotency

Every webhook carries an event_id (UUID v4) and an idempotency_key derived from the underlying business action. The merchant stores the event_id they've successfully processed; if it appears again, they no-op.

If you process by event_id you'll be correct. If you process by payload.invoice_id you'll be correct most of the time, until you're not.

Signed envelopes

Every webhook ships with an X-Halfin-Signature header. We sign the raw body, not the parsed JSON, because parsing introduces ambiguity (key order, whitespace, escape sequences). Verify on raw bytes.

We migrated from HMAC-SHA256 to Ed25519 in Q2. HMAC is still supported for older keys, but new keys mint Ed25519 by default. Verification got cheaper, key rotation got cleaner.

verify cost · hmac vs ed25519

cover · placeholder

Signature verification cost, HMAC vs Ed25519

Exponential backoff with jitter

The retry curve is now min(60s × 2^attempt, 24h) + jitter(0–30s). Jitter is the operative word. A thundering herd of 200K simultaneous retries turns any minor partner outage into a self-inflicted DDoS.

We retry up to 16 times. After that, the event lands in the dead-letter queue, where merchants can replay it manually from the dashboard.

Observability

Both sides can see:

The exact bytes we sent (downloadable from the dashboard for 30 days)
Every delivery attempt with its HTTP response code
The signature verification status

This sounds boring. It's the single biggest reduction in support load we've shipped.

What we'd do again

Sign raw bytes, not parsed objects.
Pick UUIDs that are obviously UUIDs. Don't reuse the underlying entity ID as the event ID.
Document the retry curve in the same place you document the endpoint format. Half of merchant integration bugs come from assuming the retry timing.

What we'd do differently

We waited too long to ship a replay UI. Building it earlier would have saved us thousands of support tickets.
The original "deliveries" table was indexed for write-heavy load, not for read-heavy debugging. We rebuilt it.
We should have shipped Ed25519 from day one. The migration cost months.

A note on testing

The hardest webhook bug to find is the one your test suite passes on. Our internal test pattern: for every emitted event, we run an "evil twin" replay that drops it, duplicates it, reorders it, and corrupts the signature. If the receiving side stays correct under all four, the integration is real.

Webhooks aren't a feature. They're the contract.

A. Ribeiro, halfin platform team

↳ end of articlehalfin journal · Apr 14, 2026