Webhook Design: Event-Driven API Integration
Webhooks are a fundamental building block of modern event-driven architectures. They allow one system to notify another in real time when something happens, without the receiving system needing to constantly poll for updates. From payment notifications in Stripe to push events in GitHub, webhooks power millions of integrations across the internet. This guide covers webhook design from the ground up — architecture, payload design, security, reliability, and production-grade patterns. For broader event-driven concepts, see Event-Driven Architecture.
What Are Webhooks?
A webhook is an HTTP callback: when an event occurs in a source system, it sends an HTTP POST request to a URL configured by the receiving system. The receiver processes the payload and responds with a status code indicating success or failure. Unlike traditional APIs where the client pulls data, webhooks push data from the producer to the consumer.
// Webhook flow
1. Consumer registers a URL: https://myapp.com/webhooks/payments
2. Event occurs in the provider (e.g., payment completed)
3. Provider sends HTTP POST to the registered URL with event data
4. Consumer processes the payload and returns 200 OK
5. If the consumer fails, the provider retries according to its retry policy
Webhooks vs Polling vs Server-Sent Events
There are several approaches to keeping systems in sync. Understanding when to use each is critical for building efficient integrations.
| Approach | Direction | Latency | Efficiency | Complexity | Best For |
|---|---|---|---|---|---|
| Short Polling | Client pulls | High (polling interval) | Low (many empty responses) | Low | Simple status checks |
| Long Polling | Client pulls | Medium | Medium | Medium | Near real-time when WebSockets unavailable |
| Server-Sent Events | Server pushes | Low | High | Medium | Live feeds, dashboards |
| WebSockets | Bidirectional | Very Low | High | High | Chat, gaming, live collaboration |
| Webhooks | Server pushes | Low | Very High | Medium | System-to-system integration, event notification |
Webhooks are ideal for system-to-system communication where events are infrequent or unpredictable. Polling wastes resources when events are rare. SSE and WebSockets require persistent connections, which are impractical for server-to-server integrations across the internet. For real-time client-facing patterns, see Polling, SSE, and WebSockets.
Webhook Design Patterns
Fat vs Thin Payloads
A key design decision is how much data to include in the webhook payload.
Fat payloads include all relevant data in the webhook itself:
{
"event": "order.completed",
"timestamp": "2024-06-15T14:30:00Z",
"data": {
"orderId": "ord_abc123",
"customer": {
"id": "cust_xyz",
"name": "Alice Smith",
"email": "alice@example.com"
},
"items": [
{ "sku": "WIDGET-01", "quantity": 2, "price": 29.99 },
{ "sku": "GADGET-05", "quantity": 1, "price": 49.99 }
],
"total": 109.97,
"currency": "USD"
}
}
Thin payloads include only the event type and resource identifier, requiring the consumer to fetch details:
{
"event": "order.completed",
"timestamp": "2024-06-15T14:30:00Z",
"data": {
"orderId": "ord_abc123"
}
}
// Consumer then calls GET /orders/ord_abc123 to get full details
| Approach | Pros | Cons |
|---|---|---|
| Fat Payload | Self-contained, fewer API calls, faster processing | Larger payloads, potential data staleness, more data exposure |
| Thin Payload | Smaller payloads, always fresh data, less data exposure | Requires API callback, higher latency, more coupled |
Most production systems use a hybrid approach: include commonly needed fields in the payload but keep it under a reasonable size limit (typically under 64 KB).
Event Envelope Structure
A well-designed webhook payload follows a consistent envelope structure that makes it easy for consumers to route and process events.
{
"id": "evt_1234567890",
"type": "invoice.payment_succeeded",
"apiVersion": "2024-01-15",
"created": "2024-06-15T14:30:00Z",
"data": {
"object": {
"id": "inv_abc123",
"amount": 5000,
"currency": "usd",
"status": "paid"
},
"previousAttributes": {
"status": "open"
}
}
}
Key envelope fields include: a unique event ID for idempotency, the event type for routing, a timestamp, the API version that generated the event, and the event data. Including previousAttributes helps consumers understand what changed without maintaining their own state. For API versioning strategies, see API Versioning.
Webhook Security
Webhooks introduce a security surface that must be carefully managed. Since anyone can send an HTTP request to your webhook endpoint, you must verify that incoming webhooks are authentic.
HMAC Signature Verification
The most common approach is HMAC (Hash-based Message Authentication Code) signing. The provider computes a hash of the payload using a shared secret and includes it in a header. The consumer recomputes the hash and compares.
// Provider side: signing the webhook
const crypto = require('crypto');
function signPayload(payload, secret) {
const hmac = crypto.createHmac('sha256', secret);
hmac.update(JSON.stringify(payload));
return hmac.digest('hex');
}
// Add signature to the request header
const signature = signPayload(payload, webhookSecret);
// Header: X-Webhook-Signature: sha256=a1b2c3d4e5...
// Consumer side: verifying the webhook
const crypto = require('crypto');
function verifyWebhook(payload, signature, secret) {
const expected = crypto.createHmac('sha256', secret)
.update(payload)
.digest('hex');
// Use timing-safe comparison to prevent timing attacks
return crypto.timingSafeEqual(
Buffer.from(signature),
Buffer.from(expected)
);
}
app.post('/webhooks', express.raw({ type: 'application/json' }), (req, res) => {
const signature = req.headers['x-webhook-signature'].replace('sha256=', '');
if (!verifyWebhook(req.body, signature, process.env.WEBHOOK_SECRET)) {
return res.status(401).json({ error: 'Invalid signature' });
}
// Process the verified webhook
const event = JSON.parse(req.body);
handleEvent(event);
res.status(200).json({ received: true });
});
Additional Security Measures
- TLS only — Always use HTTPS endpoints. Reject any webhook registrations with HTTP URLs.
- Timestamp validation — Include a timestamp in the signed payload and reject webhooks older than 5 minutes to prevent replay attacks.
- IP allowlisting — If the provider publishes its IP ranges, restrict incoming traffic to those ranges as an additional layer.
- Secret rotation — Support multiple active secrets during rotation periods so webhooks are not disrupted when secrets are changed.
- Webhook registration verification — When a consumer registers a URL, send a verification challenge (like a GET request with a challenge token) to confirm they own the endpoint.
Retry Policies
Network failures, temporary outages, and deployment windows mean webhook deliveries will sometimes fail. A robust retry policy is essential for reliability.
Exponential Backoff with Jitter
// Retry schedule with exponential backoff
Attempt 1: Immediate
Attempt 2: 1 minute (+ random jitter 0-30s)
Attempt 3: 5 minutes (+ random jitter 0-60s)
Attempt 4: 30 minutes (+ random jitter 0-120s)
Attempt 5: 2 hours (+ random jitter 0-300s)
Attempt 6: 8 hours
Attempt 7: 24 hours
After 7 failures: Move to dead letter queue
Jitter prevents thundering herd problems when many webhooks fail simultaneously (e.g., during a consumer outage). For more on message reliability patterns, see Message Queues.
Response Code Handling
| Response Code | Action | Rationale |
|---|---|---|
| 200-299 | Success, no retry | Consumer acknowledged receipt |
| 301, 308 | Follow redirect, update URL | Endpoint has moved permanently |
| 400 | No retry, alert consumer | Malformed request (likely a bug) |
| 401, 403 | No retry, alert consumer | Authentication or authorization failure |
| 404 | No retry, disable webhook | Endpoint does not exist |
| 410 | No retry, disable webhook | Endpoint explicitly gone |
| 429 | Retry with backoff, respect Retry-After | Consumer is rate limiting |
| 500-599 | Retry with backoff | Server error, likely transient |
| Timeout | Retry with backoff | Network issue or slow processing |
For rate limiting best practices on the consumer side, see Rate Limiting.
Idempotency
Because webhooks can be retried, consumers must handle duplicate deliveries gracefully. Idempotency means processing the same event multiple times produces the same result as processing it once.
// Idempotent webhook handler using event ID
const processedEvents = new Set(); // In production, use Redis or a database
app.post('/webhooks', (req, res) => {
const event = req.body;
// Check if we already processed this event
if (processedEvents.has(event.id)) {
return res.status(200).json({ received: true, duplicate: true });
}
// Process the event
try {
handleEvent(event);
processedEvents.add(event.id);
res.status(200).json({ received: true });
} catch (err) {
res.status(500).json({ error: 'Processing failed' });
}
});
-- Database-backed idempotency check (PostgreSQL)
CREATE TABLE processed_webhooks (
event_id VARCHAR(255) PRIMARY KEY,
event_type VARCHAR(100) NOT NULL,
processed_at TIMESTAMP DEFAULT NOW(),
payload JSONB
);
-- Insert with conflict handling
INSERT INTO processed_webhooks (event_id, event_type, payload)
VALUES ('evt_123', 'payment.completed', '{"amount": 5000}')
ON CONFLICT (event_id) DO NOTHING
RETURNING event_id;
If the INSERT returns a row, the event is new and should be processed. If it returns nothing, the event was already processed and should be skipped.
Dead Letter Queues
When all retry attempts are exhausted, events should not be silently dropped. A dead letter queue (DLQ) captures failed deliveries for manual inspection, debugging, and reprocessing.
// Dead letter queue entry
{
"eventId": "evt_1234567890",
"webhookUrl": "https://consumer.com/webhooks",
"payload": { "type": "order.completed", "data": { "orderId": "ord_abc" } },
"attempts": 7,
"lastAttemptAt": "2024-06-16T14:30:00Z",
"lastResponseCode": 503,
"lastError": "Service Unavailable",
"firstAttemptAt": "2024-06-15T14:30:00Z"
}
Provide a dashboard or API for consumers to view and replay failed webhooks from the DLQ. This turns permanent failures into temporary ones. For queue design patterns, see Message Queues.
Rate Limiting Webhook Delivery
When a burst of events occurs (e.g., a bulk import triggers thousands of webhooks), you can overwhelm the consumer. Implement rate limiting on the provider side to protect consumers.
- Per-consumer rate limits — Limit to N webhook deliveries per second per consumer endpoint.
- Batch mode — When events accumulate faster than the rate limit, batch multiple events into a single delivery.
- Priority queues — Assign higher priority to critical event types (payment events over analytics events).
- Backpressure signals — If a consumer returns 429, dynamically reduce the delivery rate for that consumer.
// Batched webhook delivery
{
"batch": true,
"events": [
{ "id": "evt_001", "type": "order.created", "data": { "orderId": "ord_a" } },
{ "id": "evt_002", "type": "order.created", "data": { "orderId": "ord_b" } },
{ "id": "evt_003", "type": "order.created", "data": { "orderId": "ord_c" } }
]
}
Monitoring and Observability
Webhooks fail silently if you do not monitor them. Both providers and consumers need visibility into webhook health.
Provider-Side Metrics
- Delivery success rate per consumer and event type
- Average delivery latency (time from event to successful delivery)
- Retry rate and retry depth distribution
- Dead letter queue depth and growth rate
- Consumer endpoint response time distribution
Consumer-Side Metrics
- Webhook receipt rate (events per minute)
- Processing success/failure rate
- Processing latency (time to handle each webhook)
- Duplicate event rate (idempotency filter hit rate)
- Queue depth if using an internal processing queue
Alerting Rules
# Example alerting rules (Prometheus-style)
- alert: WebhookDeliveryFailureRate
expr: rate(webhook_delivery_failures_total[5m]) / rate(webhook_deliveries_total[5m]) > 0.05
for: 10m
annotations:
summary: "Webhook delivery failure rate exceeds 5%"
- alert: WebhookDLQGrowing
expr: webhook_dlq_depth > 100
for: 30m
annotations:
summary: "Dead letter queue depth exceeds 100 events"
- alert: WebhookConsumerSlow
expr: histogram_quantile(0.95, webhook_consumer_response_seconds) > 10
for: 5m
annotations:
summary: "Consumer p95 response time exceeds 10 seconds"
Real-World Webhook Implementations
Stripe
Stripe is often considered the gold standard for webhook design. Key features include:
- Events are signed with HMAC-SHA256 using a per-endpoint signing secret
- Timestamps are included in the signature to prevent replay attacks
- Fat payloads with the full object state
- Up to 3 days of automatic retries with exponential backoff
- Event types use a dot-notation hierarchy (e.g.,
payment_intent.succeeded) - A dashboard for monitoring delivery status and manually retrying failed events
- Test mode webhooks for development
GitHub
GitHub webhooks power CI/CD pipelines, bots, and integrations worldwide:
- Events are signed with HMAC-SHA256 in the
X-Hub-Signature-256header - Thin-ish payloads with essential data (full repository, sender, and action details)
- Events include a
X-GitHub-Eventheader for easy routing - Delivery UUIDs in
X-GitHub-Deliveryfor idempotency - A delivery log with request/response details in the webhook settings UI
- Ping events on webhook creation to verify the endpoint
- Support for organization-level and repository-level webhooks
Twilio
Twilio uses webhooks for SMS, voice, and messaging notifications:
- Supports both request body signing and URL validation
- Uses
X-Twilio-Signaturewith HMAC-SHA1 - Provides fallback URLs for when the primary webhook fails
- Status callback webhooks track message delivery states
- Supports both synchronous responses (TwiML) and asynchronous processing
Consumer Best Practices
- Respond quickly — Return 200 within 5-10 seconds. Offload heavy processing to a background queue to avoid timeouts.
- Use a queue — Write the webhook payload to an internal message queue and return 200 immediately. Process asynchronously.
- Implement idempotency — Always check the event ID before processing. Duplicate deliveries are a certainty, not an edge case.
- Verify signatures — Never process unverified webhooks. Always validate the HMAC signature before any business logic.
- Handle unknown event types gracefully — Return 200 for event types you do not handle. Returning an error triggers unnecessary retries.
- Log everything — Log the full request (headers, body, timestamp) for debugging delivery issues.
- Implement reconciliation — Periodically poll the provider API to catch any missed webhooks. Webhooks are best-effort; they should not be your only data sync mechanism.
Provider Best Practices
- Sign every payload — Use HMAC-SHA256 at minimum. Include a timestamp in the signature.
- Use unique event IDs — Every event should have a globally unique identifier for idempotency.
- Implement exponential backoff — Never hammer a failing endpoint with rapid retries.
- Provide a test mode — Let developers trigger test events without real data.
- Offer event filtering — Let consumers subscribe only to event types they care about.
- Build a delivery dashboard — Show delivery history, response codes, and retry status.
- Support secret rotation — Allow consumers to update their signing secret without downtime.
- Document thoroughly — Document every event type, payload schema, retry policy, and security requirement. See API Design Best Practices for documentation guidelines.
Frequently Asked Questions
Are webhooks guaranteed to be delivered?
No. Webhooks provide at-least-once delivery at best. Network failures, consumer outages, and provider issues can all cause missed deliveries. This is why you should implement reconciliation: periodically poll the provider API to verify you have received all expected events. Treat webhooks as a performance optimization over polling, not as a guaranteed message bus. For guaranteed delivery, consider using a message queue like RabbitMQ or Kafka.
How long should webhook processing take?
Your webhook endpoint should return a 2xx response within 5-10 seconds. Most providers have a timeout of 10-30 seconds, after which they consider the delivery failed and schedule a retry. To handle long-running processing, immediately write the event to an internal queue and return 200. Process the event asynchronously from the queue.
How do I test webhooks locally?
Use tunneling tools like ngrok, Cloudflare Tunnel, or localtunnel to expose your local development server to the internet. Most providers also offer CLI tools for forwarding webhooks locally (e.g., stripe listen --forward-to localhost:3000/webhooks). You can also use the provider test mode or replay events from the delivery dashboard.
Should I use webhooks or message queues?
Webhooks are ideal for cross-organization or cross-platform integrations where you cannot share a message queue. They work over standard HTTP and require no shared infrastructure. Message queues are better for internal system communication where you need guaranteed delivery, ordering, and can share infrastructure. Many systems use both: webhooks for external integrations and message queues internally. See Message Queues for an in-depth comparison.
How do I handle webhook ordering?
Webhooks are generally not guaranteed to arrive in order, especially with retries. Include a sequence number or timestamp in the payload so consumers can detect out-of-order delivery. For critical ordering requirements, include the resource version or timestamp and let the consumer discard events with an older version than what they have already processed. For strict ordering requirements, consider using a message queue with ordered delivery instead.