Load Balancing: The Complete Guide for System Design
Load balancing is a critical component of any scalable system architecture. It distributes incoming network traffic across multiple servers to ensure no single server bears too much load, improving availability, reliability, and performance. Whether you are handling thousands or millions of requests per second, understanding load balancing concepts is essential for designing robust systems. This guide covers Layer 4 vs Layer 7 load balancing, algorithms, health checks, sticky sessions, and popular tools like HAProxy and Nginx.
What is Load Balancing?
A load balancer sits between clients and a pool of backend servers, distributing incoming requests across the available servers. It acts as a traffic cop, ensuring efficient utilization of server resources while maintaining high availability. If a server goes down, the load balancer automatically routes traffic to healthy servers.
Load balancers provide several key benefits: horizontal scalability (add more servers to handle more traffic), high availability (continue operating when servers fail), performance optimization (distribute load evenly), and SSL termination (offload encryption from backend servers).
Layer 4 vs Layer 7 Load Balancing
Load balancers operate at different layers of the OSI model, each with distinct capabilities and trade-offs.
Layer 4 (Transport Layer) Load Balancing
Layer 4 load balancers operate at the TCP/UDP level. They route traffic based on IP addresses and port numbers without inspecting the application-layer content. L4 load balancers are extremely fast because they simply forward packets without parsing HTTP headers, URLs, or cookies.
Layer 7 (Application Layer) Load Balancing
Layer 7 load balancers understand HTTP/HTTPS and can make routing decisions based on URL paths, HTTP headers, cookies, query parameters, and request body content. This enables sophisticated routing like content-based routing, A/B testing, and canary deployments.
| Feature | Layer 4 | Layer 7 |
|---|---|---|
| OSI Layer | Transport (TCP/UDP) | Application (HTTP/HTTPS) |
| Routing Based On | IP address, port number | URL, headers, cookies, content |
| Performance | Very fast (no content inspection) | Slower (parses application data) |
| SSL Termination | No (passes encrypted traffic) | Yes (decrypts and inspects) |
| Content-Based Routing | No | Yes |
| WebSocket Support | Yes (transparent) | Yes (with upgrade handling) |
| Use Case | High-throughput TCP services, databases | Web applications, APIs, microservices |
| Examples | AWS NLB, HAProxy (TCP mode) | AWS ALB, Nginx, HAProxy (HTTP mode) |
Load Balancing Algorithms
Round Robin
The simplest algorithm. Requests are distributed sequentially across servers. Server 1 gets the first request, server 2 gets the second, and so on, cycling back to server 1 after the last server. Works best when all servers have equal capacity and requests are roughly uniform.
Weighted Round Robin
Each server is assigned a weight proportional to its capacity. A server with weight 3 receives three times as many requests as a server with weight 1. Useful when servers have different hardware specifications.
Least Connections
Routes each new request to the server with the fewest active connections. This adapts naturally to varying request processing times and server capacities. Ideal when requests have significantly different processing times.
Weighted Least Connections
Combines least connections with server weights. The algorithm considers both the current connection count and the server's weight to make routing decisions.
IP Hash
Computes a hash of the client's IP address to determine which server receives the request. The same client IP always maps to the same server, providing basic session affinity without cookies.
Least Response Time
Routes requests to the server with the fastest average response time and fewest active connections. This provides the best user experience but requires the load balancer to track response time metrics.
Random
Randomly selects a server for each request. Simple to implement and provides reasonable distribution. The "power of two random choices" variant picks two random servers and routes to the one with fewer connections, providing near-optimal distribution.
| Algorithm | Best For | Drawback |
|---|---|---|
| Round Robin | Uniform servers and requests | Does not account for server load |
| Weighted Round Robin | Heterogeneous server capacities | Static weights need manual tuning |
| Least Connections | Variable request processing times | Does not consider server capacity |
| IP Hash | Session affinity without cookies | Uneven distribution with NAT/proxies |
| Least Response Time | Best user experience | Higher LB overhead for tracking |
Health Checks
Health checks allow the load balancer to detect unhealthy servers and stop routing traffic to them. There are two main types:
Active Health Checks
The load balancer periodically sends probe requests to each backend server. If a server fails to respond or returns an error, it is marked as unhealthy and removed from the pool.
Passive Health Checks
The load balancer monitors actual client traffic for errors. If a server returns too many errors or timeouts during normal operations, it is marked as unhealthy.
# Nginx health check and load balancing configuration
upstream backend {
least_conn;
server 10.0.1.1:8080 weight=3 max_fails=3 fail_timeout=30s;
server 10.0.1.2:8080 weight=2 max_fails=3 fail_timeout=30s;
server 10.0.1.3:8080 weight=1 max_fails=3 fail_timeout=30s;
server 10.0.1.4:8080 backup; # Only used when all others are down
}
server {
listen 80;
server_name api.swehelper.com;
location / {
proxy_pass http://backend;
proxy_next_upstream error timeout http_500 http_502 http_503;
proxy_connect_timeout 5s;
proxy_read_timeout 30s;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
# Health check endpoint
location /health {
proxy_pass http://backend;
access_log off;
}
}
# HAProxy configuration with health checks
global
maxconn 50000
log stdout format raw local0
defaults
mode http
timeout connect 5s
timeout client 30s
timeout server 30s
option httplog
option dontlognull
frontend http_front
bind *:80
bind *:443 ssl crt /etc/ssl/certs/swehelper.pem
# Route based on URL path (Layer 7)
acl is_api path_beg /api/
acl is_static path_beg /static/
acl is_ws hdr(Upgrade) -i websocket
use_backend api_servers if is_api
use_backend static_servers if is_static
use_backend ws_servers if is_ws
default_backend web_servers
backend api_servers
balance leastconn
option httpchk GET /health
http-check expect status 200
server api1 10.0.1.1:8080 check inter 5s fall 3 rise 2 weight 3
server api2 10.0.1.2:8080 check inter 5s fall 3 rise 2 weight 3
server api3 10.0.1.3:8080 check inter 5s fall 3 rise 2 weight 1
backend web_servers
balance roundrobin
option httpchk GET /
cookie SERVERID insert indirect nocache
server web1 10.0.2.1:3000 check cookie web1
server web2 10.0.2.2:3000 check cookie web2
backend ws_servers
balance source
option httpchk GET /ws/health
timeout tunnel 3600s
server ws1 10.0.3.1:8081 check
server ws2 10.0.3.2:8081 check
listen stats
bind *:8404
stats enable
stats uri /stats
stats refresh 10s
Sticky Sessions
Sticky sessions (session affinity) ensure that requests from the same client are always routed to the same backend server. This is necessary when the server maintains session state (like shopping carts or login sessions).
Methods for Implementing Sticky Sessions
Cookie-Based: The load balancer inserts a cookie identifying the backend server. Subsequent requests include this cookie, and the LB routes to the same server.
IP Hash: The client IP address is hashed to consistently map to the same server.
URL/Header-Based: Use a session token in the URL or a custom header for routing.
However, sticky sessions reduce the effectiveness of load balancing and create problems during server failures (stuck sessions are lost). The preferred approach is to make your application stateless by storing session data in an external store like Redis or a database.
SSL Termination
SSL/TLS termination at the load balancer offloads the CPU-intensive encryption and decryption from backend servers. The load balancer handles HTTPS from clients and forwards unencrypted HTTP to backend servers within the trusted internal network. For end-to-end encryption, use SSL passthrough or re-encryption to the backend.
Global Server Load Balancing (GSLB)
GSLB distributes traffic across data centers in different geographic regions using DNS-based routing. It provides disaster recovery, latency-based routing, and geographic load distribution. Services like AWS Route 53, Cloudflare, and Azure Traffic Manager provide GSLB capabilities that integrate with CDN architectures.
Cloud Load Balancers
| Service | Layer | Key Features |
|---|---|---|
| AWS ALB | Layer 7 | Path-based routing, WebSocket, gRPC, WAF integration |
| AWS NLB | Layer 4 | Ultra-high performance, static IPs, TLS passthrough |
| GCP Cloud Load Balancing | L4 and L7 | Global anycast, auto-scaling, CDN integration |
| Azure Load Balancer | Layer 4 | Zone-redundant, HA ports, outbound rules |
| Azure Application Gateway | Layer 7 | WAF, SSL termination, URL-based routing |
Load balancing works closely with reverse proxies, CDNs, and DNS to create a resilient, high-performance architecture. Understanding how these components interact is essential for system design. Test your server configurations using our API and Network Tools and explore more resources at our tools page.
Frequently Asked Questions
What happens when a load balancer itself fails?
Load balancers are typically deployed in a high-availability pair using active-passive or active-active configuration. If the primary load balancer fails, a secondary instance takes over automatically (using technologies like VRRP or floating IPs). Cloud-managed load balancers (AWS ALB, GCP LB) are inherently redundant and distributed, so they handle failover automatically without user intervention.
Should I use L4 or L7 load balancing?
Use L7 when you need content-based routing (different backends for different URL paths), SSL termination, HTTP header manipulation, or WebSocket handling. Use L4 when you need maximum throughput for non-HTTP protocols (database connections, game servers, custom TCP protocols) or when you do not need to inspect application content. Many architectures use both: L4 for initial traffic distribution and L7 for application-level routing.
How many backend servers do I need?
Start with at least 2 servers for redundancy (N+1 rule). Size based on your traffic patterns: measure requests per second, CPU utilization, and response times per server under load. Plan for peak traffic (which can be 3-10x average). Auto-scaling groups in cloud environments can dynamically adjust the number of servers based on metrics like CPU utilization or request queue depth.
What is the difference between a load balancer and a reverse proxy?
A reverse proxy sits in front of servers and handles requests on their behalf, providing caching, SSL termination, and request routing. A load balancer is a specific function that distributes traffic across multiple servers. In practice, tools like Nginx and HAProxy serve as both reverse proxies and load balancers. The terms overlap significantly, but load balancing specifically refers to distributing traffic across multiple server instances.