Design a Notification System
A notification system is a critical component of virtually every modern application. It delivers timely information to users through multiple channels — push notifications, email, SMS, and in-app alerts. At scale, this system must handle billions of notifications daily with proper prioritization, rate limiting, template management, and delivery tracking. This guide covers the complete architecture for a production-grade notification platform.
1. Requirements
Functional Requirements
- Send notifications through multiple channels: push (iOS/Android), email, SMS, in-app.
- Support various notification types: transactional (order confirmation), marketing (promotions), social (likes, follows), system (security alerts).
- Template-based notification content with dynamic variables.
- User notification preferences: opt-in/opt-out per channel and per type.
- Rate limiting: prevent notification fatigue by limiting frequency.
- Priority levels: critical notifications bypass rate limits.
- Delivery tracking: track sent, delivered, opened, clicked status.
- Scheduled notifications: send at a specific time or in the user's timezone.
Non-Functional Requirements
- High availability: 99.99% uptime. Critical notifications (security, payments) must always deliver.
- Scalability: Handle 10 billion+ notifications per day.
- Low latency: Real-time notifications delivered within 1 second.
- Reliability: At-least-once delivery with deduplication.
- Extensibility: Easy to add new channels (WhatsApp, Slack, etc.).
2. Capacity Estimation
| Metric | Estimate |
|---|---|
| Total notifications per day | 10 billion |
| Average QPS | 10B / 86,400 ≈ 115,000/sec |
| Peak QPS (3x) | ~350,000/sec |
| Push notifications per day | 5 billion (50%) |
| Emails per day | 3 billion (30%) |
| SMS per day | 1 billion (10%) |
| In-app per day | 1 billion (10%) |
| Average notification size | ~500 bytes |
| Daily data volume | 10B × 500B = 5 TB/day |
3. High-Level Design
| Component | Responsibility |
|---|---|
| Notification API | Accept notification requests from internal services |
| Validation and Enrichment | Validate input, resolve templates, check preferences |
| Priority Queue | Separate queues by priority (critical, high, medium, low) |
| Rate Limiter | Enforce per-user and per-type notification limits |
| Channel Router | Route to appropriate delivery channel based on preferences |
| Push Worker | Delivers via APNs (iOS) and FCM (Android) |
| Email Worker | Delivers via email providers (SES, SendGrid) |
| SMS Worker | Delivers via SMS providers (Twilio, SNS) |
| In-App Worker | Delivers via WebSocket/SSE for in-app notifications |
| Template Service | Manages notification templates per type and locale |
| Delivery Tracker | Tracks delivery status, failures, retries |
| Analytics Service | Metrics on delivery rate, open rate, click rate |
4. Detailed Component Design
4.1 Notification Flow
- An internal service sends a notification request to the Notification API.
- The Validation Service checks the request, resolves the template, and looks up user preferences.
- If the user has opted out of this notification type or channel, it is dropped.
- The Rate Limiter checks if the user has exceeded their notification limit.
- The notification is enqueued into the appropriate priority queue.
- The Channel Router determines which channels to use (push, email, SMS, in-app).
- Channel-specific workers dequeue and deliver through third-party providers.
- Delivery status is recorded in the Delivery Tracker.
- On failure, the notification is retried with exponential backoff.
// Notification API request
POST /api/v1/notifications
{
"type": "order_shipped",
"user_id": "user_12345",
"priority": "high",
"channels": ["push", "email"],
"template_data": {
"order_id": "ORD-789",
"tracking_number": "1Z999AA10123456784",
"estimated_delivery": "2025-01-28"
},
"idempotency_key": "order_shipped_ORD-789"
}
4.2 Priority Queue Architecture
Different notification types have different urgency levels. Use separate message queues per priority:
| Priority | Examples | SLA | Rate Limit |
|---|---|---|---|
| Critical | Security alerts, 2FA codes, payment failures | <5 sec | No limit |
| High | Order updates, direct messages | <30 sec | 50/hour |
| Medium | Social interactions (likes, comments) | <5 min | 20/hour |
| Low | Marketing, recommendations, newsletters | <1 hour | 5/day |
Workers consume from high-priority queues first. Critical queue workers have dedicated capacity that is never shared with lower priorities.
4.3 Rate Limiting
class NotificationRateLimiter:
def __init__(self, redis_client):
self.redis = redis_client
def check_rate_limit(self, user_id, notification_type, priority):
# Critical priority bypasses rate limits
if priority == "critical":
return True
# Per-type limit (e.g., max 5 marketing emails per day)
type_key = f"ratelimit:{user_id}:{notification_type}:daily"
type_count = self.redis.incr(type_key)
if type_count == 1:
self.redis.expire(type_key, 86400)
type_limit = self.get_type_limit(notification_type)
if type_count > type_limit:
return False
# Global per-user limit (e.g., max 100 notifications per day)
global_key = f"ratelimit:{user_id}:global:daily"
global_count = self.redis.incr(global_key)
if global_count == 1:
self.redis.expire(global_key, 86400)
if global_count > 100:
return False
return True
4.4 Template System
Templates separate content from delivery logic, enabling non-engineers to update notification copy:
// Template example (stored in database)
{
"template_id": "order_shipped",
"locale": "en-US",
"channels": {
"push": {
"title": "Your order is on its way!",
"body": "Order {{order_id}} shipped. Tracking: {{tracking_number}}. ETA: {{estimated_delivery}}."
},
"email": {
"subject": "Order {{order_id}} has shipped",
"html_body": "<h1>Great news!</h1><p>Your order {{order_id}} is on its way...</p>"
},
"sms": {
"body": "Your order {{order_id}} shipped. Track: {{tracking_url}}"
}
}
}
// Template rendering
function renderTemplate(template, data) {
return template.replace(/\{\{(\w+)\}\}/g, (match, key) => data[key] || match);
}
4.5 Delivery Tracking
Track the lifecycle of each notification through status transitions:
| Status | Meaning |
|---|---|
| created | Notification request received |
| queued | Enqueued to priority queue |
| sent | Handed off to third-party provider (APNs, SES, Twilio) |
| delivered | Provider confirmed delivery to device/inbox |
| opened | User opened/read the notification |
| clicked | User clicked a link in the notification |
| failed | Delivery failed (invalid token, bounced email) |
| rate_limited | Dropped due to rate limiting |
| opted_out | Dropped because user opted out |
5. Database Schema
CREATE TABLE notification_templates (
id VARCHAR(50) PRIMARY KEY,
type VARCHAR(50) NOT NULL,
locale VARCHAR(10) DEFAULT 'en-US',
channel VARCHAR(20) NOT NULL,
title_template TEXT,
body_template TEXT NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
UNIQUE (type, locale, channel)
);
CREATE TABLE user_preferences (
user_id BIGINT NOT NULL,
notification_type VARCHAR(50) NOT NULL,
channel VARCHAR(20) NOT NULL,
enabled BOOLEAN DEFAULT TRUE,
PRIMARY KEY (user_id, notification_type, channel)
);
CREATE TABLE user_devices (
id BIGINT PRIMARY KEY,
user_id BIGINT NOT NULL,
platform ENUM('ios','android','web') NOT NULL,
device_token TEXT NOT NULL,
is_active BOOLEAN DEFAULT TRUE,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX idx_devices_user ON user_devices(user_id, is_active);
CREATE TABLE notifications (
id BIGINT PRIMARY KEY,
user_id BIGINT NOT NULL,
type VARCHAR(50) NOT NULL,
channel VARCHAR(20) NOT NULL,
priority ENUM('critical','high','medium','low') NOT NULL,
status ENUM('created','queued','sent','delivered','opened','clicked',
'failed','rate_limited','opted_out') DEFAULT 'created',
title TEXT,
body TEXT,
idempotency_key VARCHAR(128) UNIQUE,
retry_count INT DEFAULT 0,
scheduled_at TIMESTAMP,
sent_at TIMESTAMP,
delivered_at TIMESTAMP,
opened_at TIMESTAMP,
failed_reason TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX idx_notif_user ON notifications(user_id, created_at DESC);
CREATE INDEX idx_notif_status ON notifications(status, created_at);
CREATE INDEX idx_notif_scheduled ON notifications(scheduled_at) WHERE status = 'created';
6. Key Trade-offs
| Decision | Trade-off |
|---|---|
| At-most-once vs at-least-once delivery | At-most-once risks missing notifications. At-least-once with idempotency keys prevents duplicates while ensuring delivery. Use idempotency_key to dedup retries. |
| Single queue vs priority queues | A single queue with priority tags is simpler but can starve critical notifications during surges. Separate queues per priority with dedicated workers ensure critical notifications are never delayed. |
| Sync vs async delivery | Synchronous delivery gives immediate feedback but blocks the caller. Asynchronous (queue-based) decouples producers from delivery, handles spikes via buffering, and enables retry logic. Always use async for notifications. |
| Build vs buy for channel delivery | Building direct SMTP/SMS delivery is complex and requires managing sender reputation. Using providers (SES, Twilio) is easier but adds cost and dependency. Most companies use providers for email/SMS and direct integration for push (APNs/FCM). |
7. Scaling Considerations
7.1 Queue Scaling
At 350K notifications/sec peak, Kafka is ideal. Partition by channel and priority. Each channel worker group scales independently. Use Kafka consumer groups for horizontal scaling. Ensure ordering within user_id partitions for deduplication.
7.2 Third-Party Provider Rate Limits
APNs, FCM, SES, and Twilio all have rate limits. Use connection pooling and token management. Maintain multiple provider accounts and distribute load. Monitor provider health and failover between providers (e.g., SES to SendGrid for email).
7.3 Database Scaling
Notifications table grows at 10B rows/day. Use time-partitioned tables (partition by month). Archive old notifications to cold storage after 90 days. Shard by user_id for real-time queries. Use database sharding strategies. Analytics queries run on a separate read replica or data warehouse.
7.4 Handling Spikes
Mass notifications (e.g., app-wide announcements) can generate billions of requests simultaneously. Pre-compute the recipient list, enqueue gradually (rate-controlled fan-out), and load balance across workers. Cache rendered templates to avoid re-rendering for each recipient.
Use swehelper.com tools to practice notification system capacity planning.
8. Frequently Asked Questions
Q1: How do you prevent notification fatigue for users?
Multiple layers: (1) Per-type rate limits (e.g., max 3 marketing notifications per day). (2) Global per-user daily caps (e.g., max 50 notifications total per day). (3) Intelligent batching — aggregate similar notifications ("User A and 5 others liked your post"). (4) User-controlled preferences per channel and type. (5) ML-based send-time optimization — send when the user is most likely to engage.
Q2: How do you handle notification delivery failures?
Use exponential backoff with jitter for retries (1s, 2s, 4s, 8s, up to max 5 retries). For push notifications, if a device token is invalid (APNs returns 410 Gone), mark the device as inactive and stop sending. For email bounces, categorize as hard bounce (permanent, remove address) or soft bounce (temporary, retry). Track failure rates and alert on anomalies.
Q3: How does idempotency work for notifications?
Each notification request includes an idempotency_key (e.g., "order_shipped_ORD-789"). Before processing, the system checks if a notification with this key was already created. If yes, it returns the existing result without sending a duplicate. This is stored as a unique constraint in the database. The caller is responsible for generating meaningful, unique idempotency keys.
Q4: How do you handle timezone-aware scheduled notifications?
Store the scheduled time in UTC. When scheduling, convert the user's desired local time to UTC using their timezone. A scheduler service polls for notifications where scheduled_at is in the past and status is "created," then enqueues them. For "send at 9 AM local time" across all users, pre-compute each user's UTC send time and store individually.