Webhook Design: Event-Driven API Integration

Webhooks are a fundamental building block of modern event-driven architectures. They allow one system to notify another in real time when something happens, without the receiving system needing to constantly poll for updates. From payment notifications in Stripe to push events in GitHub, webhooks power millions of integrations across the internet. This guide covers webhook design from the ground up — architecture, payload design, security, reliability, and production-grade patterns. For broader event-driven concepts, see Event-Driven Architecture.

What Are Webhooks?

A webhook is an HTTP callback: when an event occurs in a source system, it sends an HTTP POST request to a URL configured by the receiving system. The receiver processes the payload and responds with a status code indicating success or failure. Unlike traditional APIs where the client pulls data, webhooks push data from the producer to the consumer.

// Webhook flow
1. Consumer registers a URL: https://myapp.com/webhooks/payments
2. Event occurs in the provider (e.g., payment completed)
3. Provider sends HTTP POST to the registered URL with event data
4. Consumer processes the payload and returns 200 OK
5. If the consumer fails, the provider retries according to its retry policy

Webhooks vs Polling vs Server-Sent Events

There are several approaches to keeping systems in sync. Understanding when to use each is critical for building efficient integrations.

Approach	Direction	Latency	Efficiency	Complexity	Best For
Short Polling	Client pulls	High (polling interval)	Low (many empty responses)	Low	Simple status checks
Long Polling	Client pulls	Medium	Medium	Medium	Near real-time when WebSockets unavailable
Server-Sent Events	Server pushes	Low	High	Medium	Live feeds, dashboards
WebSockets	Bidirectional	Very Low	High	High	Chat, gaming, live collaboration
Webhooks	Server pushes	Low	Very High	Medium	System-to-system integration, event notification

Webhooks are ideal for system-to-system communication where events are infrequent or unpredictable. Polling wastes resources when events are rare. SSE and WebSockets require persistent connections, which are impractical for server-to-server integrations across the internet. For real-time client-facing patterns, see Polling, SSE, and WebSockets.

Webhook Design Patterns

Fat vs Thin Payloads

A key design decision is how much data to include in the webhook payload.

Fat payloads include all relevant data in the webhook itself:

{
  "event": "order.completed",
  "timestamp": "2024-06-15T14:30:00Z",
  "data": {
    "orderId": "ord_abc123",
    "customer": {
      "id": "cust_xyz",
      "name": "Alice Smith",
      "email": "alice@example.com"
    },
    "items": [
      { "sku": "WIDGET-01", "quantity": 2, "price": 29.99 },
      { "sku": "GADGET-05", "quantity": 1, "price": 49.99 }
    ],
    "total": 109.97,
    "currency": "USD"
  }
}

Thin payloads include only the event type and resource identifier, requiring the consumer to fetch details:

{
  "event": "order.completed",
  "timestamp": "2024-06-15T14:30:00Z",
  "data": {
    "orderId": "ord_abc123"
  }
}
// Consumer then calls GET /orders/ord_abc123 to get full details

Approach	Pros	Cons
Fat Payload	Self-contained, fewer API calls, faster processing	Larger payloads, potential data staleness, more data exposure
Thin Payload	Smaller payloads, always fresh data, less data exposure	Requires API callback, higher latency, more coupled

Most production systems use a hybrid approach: include commonly needed fields in the payload but keep it under a reasonable size limit (typically under 64 KB).

Event Envelope Structure

A well-designed webhook payload follows a consistent envelope structure that makes it easy for consumers to route and process events.

{
  "id": "evt_1234567890",
  "type": "invoice.payment_succeeded",
  "apiVersion": "2024-01-15",
  "created": "2024-06-15T14:30:00Z",
  "data": {
    "object": {
      "id": "inv_abc123",
      "amount": 5000,
      "currency": "usd",
      "status": "paid"
    },
    "previousAttributes": {
      "status": "open"
    }
  }
}

Key envelope fields include: a unique event ID for idempotency, the event type for routing, a timestamp, the API version that generated the event, and the event data. Including previousAttributes helps consumers understand what changed without maintaining their own state. For API versioning strategies, see API Versioning.

Webhook Security

Webhooks introduce a security surface that must be carefully managed. Since anyone can send an HTTP request to your webhook endpoint, you must verify that incoming webhooks are authentic.

HMAC Signature Verification

The most common approach is HMAC (Hash-based Message Authentication Code) signing. The provider computes a hash of the payload using a shared secret and includes it in a header. The consumer recomputes the hash and compares.

// Provider side: signing the webhook
const crypto = require('crypto');

function signPayload(payload, secret) {
  const hmac = crypto.createHmac('sha256', secret);
  hmac.update(JSON.stringify(payload));
  return hmac.digest('hex');
}

// Add signature to the request header
const signature = signPayload(payload, webhookSecret);
// Header: X-Webhook-Signature: sha256=a1b2c3d4e5...

// Consumer side: verifying the webhook
const crypto = require('crypto');

function verifyWebhook(payload, signature, secret) {
  const expected = crypto.createHmac('sha256', secret)
    .update(payload)
    .digest('hex');

  // Use timing-safe comparison to prevent timing attacks
  return crypto.timingSafeEqual(
    Buffer.from(signature),
    Buffer.from(expected)
  );
}

app.post('/webhooks', express.raw({ type: 'application/json' }), (req, res) => {
  const signature = req.headers['x-webhook-signature'].replace('sha256=', '');
  if (!verifyWebhook(req.body, signature, process.env.WEBHOOK_SECRET)) {
    return res.status(401).json({ error: 'Invalid signature' });
  }
  // Process the verified webhook
  const event = JSON.parse(req.body);
  handleEvent(event);
  res.status(200).json({ received: true });
});

Additional Security Measures

TLS only — Always use HTTPS endpoints. Reject any webhook registrations with HTTP URLs.
Timestamp validation — Include a timestamp in the signed payload and reject webhooks older than 5 minutes to prevent replay attacks.
IP allowlisting — If the provider publishes its IP ranges, restrict incoming traffic to those ranges as an additional layer.
Secret rotation — Support multiple active secrets during rotation periods so webhooks are not disrupted when secrets are changed.
Webhook registration verification — When a consumer registers a URL, send a verification challenge (like a GET request with a challenge token) to confirm they own the endpoint.

Retry Policies

Network failures, temporary outages, and deployment windows mean webhook deliveries will sometimes fail. A robust retry policy is essential for reliability.

Exponential Backoff with Jitter

// Retry schedule with exponential backoff
Attempt 1: Immediate
Attempt 2: 1 minute (+ random jitter 0-30s)
Attempt 3: 5 minutes (+ random jitter 0-60s)
Attempt 4: 30 minutes (+ random jitter 0-120s)
Attempt 5: 2 hours (+ random jitter 0-300s)
Attempt 6: 8 hours
Attempt 7: 24 hours
After 7 failures: Move to dead letter queue

Jitter prevents thundering herd problems when many webhooks fail simultaneously (e.g., during a consumer outage). For more on message reliability patterns, see Message Queues.

Response Code Handling

Response Code	Action	Rationale
200-299	Success, no retry	Consumer acknowledged receipt
301, 308	Follow redirect, update URL	Endpoint has moved permanently
400	No retry, alert consumer	Malformed request (likely a bug)
401, 403	No retry, alert consumer	Authentication or authorization failure
404	No retry, disable webhook	Endpoint does not exist
410	No retry, disable webhook	Endpoint explicitly gone
429	Retry with backoff, respect Retry-After	Consumer is rate limiting
500-599	Retry with backoff	Server error, likely transient
Timeout	Retry with backoff	Network issue or slow processing

For rate limiting best practices on the consumer side, see Rate Limiting.

Idempotency

Because webhooks can be retried, consumers must handle duplicate deliveries gracefully. Idempotency means processing the same event multiple times produces the same result as processing it once.

// Idempotent webhook handler using event ID
const processedEvents = new Set(); // In production, use Redis or a database

app.post('/webhooks', (req, res) => {
  const event = req.body;

  // Check if we already processed this event
  if (processedEvents.has(event.id)) {
    return res.status(200).json({ received: true, duplicate: true });
  }

  // Process the event
  try {
    handleEvent(event);
    processedEvents.add(event.id);
    res.status(200).json({ received: true });
  } catch (err) {
    res.status(500).json({ error: 'Processing failed' });
  }
});

-- Database-backed idempotency check (PostgreSQL)
CREATE TABLE processed_webhooks (
  event_id VARCHAR(255) PRIMARY KEY,
  event_type VARCHAR(100) NOT NULL,
  processed_at TIMESTAMP DEFAULT NOW(),
  payload JSONB
);

-- Insert with conflict handling
INSERT INTO processed_webhooks (event_id, event_type, payload)
VALUES ('evt_123', 'payment.completed', '{"amount": 5000}')
ON CONFLICT (event_id) DO NOTHING
RETURNING event_id;

If the INSERT returns a row, the event is new and should be processed. If it returns nothing, the event was already processed and should be skipped.

Dead Letter Queues

When all retry attempts are exhausted, events should not be silently dropped. A dead letter queue (DLQ) captures failed deliveries for manual inspection, debugging, and reprocessing.

// Dead letter queue entry
{
  "eventId": "evt_1234567890",
  "webhookUrl": "https://consumer.com/webhooks",
  "payload": { "type": "order.completed", "data": { "orderId": "ord_abc" } },
  "attempts": 7,
  "lastAttemptAt": "2024-06-16T14:30:00Z",
  "lastResponseCode": 503,
  "lastError": "Service Unavailable",
  "firstAttemptAt": "2024-06-15T14:30:00Z"
}

Provide a dashboard or API for consumers to view and replay failed webhooks from the DLQ. This turns permanent failures into temporary ones. For queue design patterns, see Message Queues.

Rate Limiting Webhook Delivery

When a burst of events occurs (e.g., a bulk import triggers thousands of webhooks), you can overwhelm the consumer. Implement rate limiting on the provider side to protect consumers.

Per-consumer rate limits — Limit to N webhook deliveries per second per consumer endpoint.
Batch mode — When events accumulate faster than the rate limit, batch multiple events into a single delivery.
Priority queues — Assign higher priority to critical event types (payment events over analytics events).
Backpressure signals — If a consumer returns 429, dynamically reduce the delivery rate for that consumer.

// Batched webhook delivery
{
  "batch": true,
  "events": [
    { "id": "evt_001", "type": "order.created", "data": { "orderId": "ord_a" } },
    { "id": "evt_002", "type": "order.created", "data": { "orderId": "ord_b" } },
    { "id": "evt_003", "type": "order.created", "data": { "orderId": "ord_c" } }
  ]
}

Monitoring and Observability

Webhooks fail silently if you do not monitor them. Both providers and consumers need visibility into webhook health.

Provider-Side Metrics

Delivery success rate per consumer and event type
Average delivery latency (time from event to successful delivery)
Retry rate and retry depth distribution
Dead letter queue depth and growth rate
Consumer endpoint response time distribution

Consumer-Side Metrics

Webhook receipt rate (events per minute)
Processing success/failure rate
Processing latency (time to handle each webhook)
Duplicate event rate (idempotency filter hit rate)
Queue depth if using an internal processing queue

Alerting Rules

# Example alerting rules (Prometheus-style)
- alert: WebhookDeliveryFailureRate
  expr: rate(webhook_delivery_failures_total[5m]) / rate(webhook_deliveries_total[5m]) > 0.05
  for: 10m
  annotations:
    summary: "Webhook delivery failure rate exceeds 5%"

- alert: WebhookDLQGrowing
  expr: webhook_dlq_depth > 100
  for: 30m
  annotations:
    summary: "Dead letter queue depth exceeds 100 events"

- alert: WebhookConsumerSlow
  expr: histogram_quantile(0.95, webhook_consumer_response_seconds) > 10
  for: 5m
  annotations:
    summary: "Consumer p95 response time exceeds 10 seconds"

Real-World Webhook Implementations

Stripe

Stripe is often considered the gold standard for webhook design. Key features include:

Events are signed with HMAC-SHA256 using a per-endpoint signing secret
Timestamps are included in the signature to prevent replay attacks
Fat payloads with the full object state
Up to 3 days of automatic retries with exponential backoff
Event types use a dot-notation hierarchy (e.g., payment_intent.succeeded)
A dashboard for monitoring delivery status and manually retrying failed events
Test mode webhooks for development

GitHub

GitHub webhooks power CI/CD pipelines, bots, and integrations worldwide:

Events are signed with HMAC-SHA256 in the X-Hub-Signature-256 header
Thin-ish payloads with essential data (full repository, sender, and action details)
Events include a X-GitHub-Event header for easy routing
Delivery UUIDs in X-GitHub-Delivery for idempotency
A delivery log with request/response details in the webhook settings UI
Ping events on webhook creation to verify the endpoint
Support for organization-level and repository-level webhooks

Twilio

Twilio uses webhooks for SMS, voice, and messaging notifications:

Supports both request body signing and URL validation
Uses X-Twilio-Signature with HMAC-SHA1
Provides fallback URLs for when the primary webhook fails
Status callback webhooks track message delivery states
Supports both synchronous responses (TwiML) and asynchronous processing

Consumer Best Practices

Respond quickly — Return 200 within 5-10 seconds. Offload heavy processing to a background queue to avoid timeouts.
Use a queue — Write the webhook payload to an internal message queue and return 200 immediately. Process asynchronously.
Implement idempotency — Always check the event ID before processing. Duplicate deliveries are a certainty, not an edge case.
Verify signatures — Never process unverified webhooks. Always validate the HMAC signature before any business logic.
Handle unknown event types gracefully — Return 200 for event types you do not handle. Returning an error triggers unnecessary retries.
Log everything — Log the full request (headers, body, timestamp) for debugging delivery issues.
Implement reconciliation — Periodically poll the provider API to catch any missed webhooks. Webhooks are best-effort; they should not be your only data sync mechanism.

Provider Best Practices

Sign every payload — Use HMAC-SHA256 at minimum. Include a timestamp in the signature.
Use unique event IDs — Every event should have a globally unique identifier for idempotency.
Implement exponential backoff — Never hammer a failing endpoint with rapid retries.
Provide a test mode — Let developers trigger test events without real data.
Offer event filtering — Let consumers subscribe only to event types they care about.
Build a delivery dashboard — Show delivery history, response codes, and retry status.
Support secret rotation — Allow consumers to update their signing secret without downtime.
Document thoroughly — Document every event type, payload schema, retry policy, and security requirement. See API Design Best Practices for documentation guidelines.

Frequently Asked Questions

Are webhooks guaranteed to be delivered?

No. Webhooks provide at-least-once delivery at best. Network failures, consumer outages, and provider issues can all cause missed deliveries. This is why you should implement reconciliation: periodically poll the provider API to verify you have received all expected events. Treat webhooks as a performance optimization over polling, not as a guaranteed message bus. For guaranteed delivery, consider using a message queue like RabbitMQ or Kafka.

How long should webhook processing take?

Your webhook endpoint should return a 2xx response within 5-10 seconds. Most providers have a timeout of 10-30 seconds, after which they consider the delivery failed and schedule a retry. To handle long-running processing, immediately write the event to an internal queue and return 200. Process the event asynchronously from the queue.

How do I test webhooks locally?

Use tunneling tools like ngrok, Cloudflare Tunnel, or localtunnel to expose your local development server to the internet. Most providers also offer CLI tools for forwarding webhooks locally (e.g., stripe listen --forward-to localhost:3000/webhooks). You can also use the provider test mode or replay events from the delivery dashboard.

Should I use webhooks or message queues?

Webhooks are ideal for cross-organization or cross-platform integrations where you cannot share a message queue. They work over standard HTTP and require no shared infrastructure. Message queues are better for internal system communication where you need guaranteed delivery, ordering, and can share infrastructure. Many systems use both: webhooks for external integrations and message queues internally. See Message Queues for an in-depth comparison.

How do I handle webhook ordering?

Webhooks are generally not guaranteed to arrive in order, especially with retries. Include a sequence number or timestamp in the payload so consumers can detect out-of-order delivery. For critical ordering requirements, include the resource version or timestamp and let the consumer discard events with an older version than what they have already processed. For strict ordering requirements, consider using a message queue with ordered delivery instead.

Webhook Design: Event-Driven API Integration