Load Balancing: The Complete Guide for System Design

Load balancing is a critical component of any scalable system architecture. It distributes incoming network traffic across multiple servers to ensure no single server bears too much load, improving availability, reliability, and performance. Whether you are handling thousands or millions of requests per second, understanding load balancing concepts is essential for designing robust systems. This guide covers Layer 4 vs Layer 7 load balancing, algorithms, health checks, sticky sessions, and popular tools like HAProxy and Nginx.

What is Load Balancing?

A load balancer sits between clients and a pool of backend servers, distributing incoming requests across the available servers. It acts as a traffic cop, ensuring efficient utilization of server resources while maintaining high availability. If a server goes down, the load balancer automatically routes traffic to healthy servers.

Load balancers provide several key benefits: horizontal scalability (add more servers to handle more traffic), high availability (continue operating when servers fail), performance optimization (distribute load evenly), and SSL termination (offload encryption from backend servers).

Layer 4 vs Layer 7 Load Balancing

Load balancers operate at different layers of the OSI model, each with distinct capabilities and trade-offs.

Layer 4 (Transport Layer) Load Balancing

Layer 4 load balancers operate at the TCP/UDP level. They route traffic based on IP addresses and port numbers without inspecting the application-layer content. L4 load balancers are extremely fast because they simply forward packets without parsing HTTP headers, URLs, or cookies.

Layer 7 (Application Layer) Load Balancing

Layer 7 load balancers understand HTTP/HTTPS and can make routing decisions based on URL paths, HTTP headers, cookies, query parameters, and request body content. This enables sophisticated routing like content-based routing, A/B testing, and canary deployments.

Feature	Layer 4	Layer 7
OSI Layer	Transport (TCP/UDP)	Application (HTTP/HTTPS)
Routing Based On	IP address, port number	URL, headers, cookies, content
Performance	Very fast (no content inspection)	Slower (parses application data)
SSL Termination	No (passes encrypted traffic)	Yes (decrypts and inspects)
Content-Based Routing	No	Yes
WebSocket Support	Yes (transparent)	Yes (with upgrade handling)
Use Case	High-throughput TCP services, databases	Web applications, APIs, microservices
Examples	AWS NLB, HAProxy (TCP mode)	AWS ALB, Nginx, HAProxy (HTTP mode)

Load Balancing Algorithms

Round Robin

The simplest algorithm. Requests are distributed sequentially across servers. Server 1 gets the first request, server 2 gets the second, and so on, cycling back to server 1 after the last server. Works best when all servers have equal capacity and requests are roughly uniform.

Weighted Round Robin

Each server is assigned a weight proportional to its capacity. A server with weight 3 receives three times as many requests as a server with weight 1. Useful when servers have different hardware specifications.

Least Connections

Routes each new request to the server with the fewest active connections. This adapts naturally to varying request processing times and server capacities. Ideal when requests have significantly different processing times.

Weighted Least Connections

Combines least connections with server weights. The algorithm considers both the current connection count and the server's weight to make routing decisions.

IP Hash

Computes a hash of the client's IP address to determine which server receives the request. The same client IP always maps to the same server, providing basic session affinity without cookies.

Least Response Time

Routes requests to the server with the fastest average response time and fewest active connections. This provides the best user experience but requires the load balancer to track response time metrics.

Random

Randomly selects a server for each request. Simple to implement and provides reasonable distribution. The "power of two random choices" variant picks two random servers and routes to the one with fewer connections, providing near-optimal distribution.

Algorithm	Best For	Drawback
Round Robin	Uniform servers and requests	Does not account for server load
Weighted Round Robin	Heterogeneous server capacities	Static weights need manual tuning
Least Connections	Variable request processing times	Does not consider server capacity
IP Hash	Session affinity without cookies	Uneven distribution with NAT/proxies
Least Response Time	Best user experience	Higher LB overhead for tracking

Health Checks

Health checks allow the load balancer to detect unhealthy servers and stop routing traffic to them. There are two main types:

Active Health Checks

The load balancer periodically sends probe requests to each backend server. If a server fails to respond or returns an error, it is marked as unhealthy and removed from the pool.

Passive Health Checks

The load balancer monitors actual client traffic for errors. If a server returns too many errors or timeouts during normal operations, it is marked as unhealthy.

# Nginx health check and load balancing configuration
upstream backend {
    least_conn;

    server 10.0.1.1:8080 weight=3 max_fails=3 fail_timeout=30s;
    server 10.0.1.2:8080 weight=2 max_fails=3 fail_timeout=30s;
    server 10.0.1.3:8080 weight=1 max_fails=3 fail_timeout=30s;
    server 10.0.1.4:8080 backup;  # Only used when all others are down
}

server {
    listen 80;
    server_name api.swehelper.com;

    location / {
        proxy_pass http://backend;
        proxy_next_upstream error timeout http_500 http_502 http_503;
        proxy_connect_timeout 5s;
        proxy_read_timeout 30s;

        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }

    # Health check endpoint
    location /health {
        proxy_pass http://backend;
        access_log off;
    }
}

# HAProxy configuration with health checks
global
    maxconn 50000
    log stdout format raw local0

defaults
    mode http
    timeout connect 5s
    timeout client 30s
    timeout server 30s
    option httplog
    option dontlognull

frontend http_front
    bind *:80
    bind *:443 ssl crt /etc/ssl/certs/swehelper.pem

    # Route based on URL path (Layer 7)
    acl is_api path_beg /api/
    acl is_static path_beg /static/
    acl is_ws hdr(Upgrade) -i websocket

    use_backend api_servers if is_api
    use_backend static_servers if is_static
    use_backend ws_servers if is_ws
    default_backend web_servers

backend api_servers
    balance leastconn
    option httpchk GET /health
    http-check expect status 200

    server api1 10.0.1.1:8080 check inter 5s fall 3 rise 2 weight 3
    server api2 10.0.1.2:8080 check inter 5s fall 3 rise 2 weight 3
    server api3 10.0.1.3:8080 check inter 5s fall 3 rise 2 weight 1

backend web_servers
    balance roundrobin
    option httpchk GET /
    cookie SERVERID insert indirect nocache

    server web1 10.0.2.1:3000 check cookie web1
    server web2 10.0.2.2:3000 check cookie web2

backend ws_servers
    balance source
    option httpchk GET /ws/health
    timeout tunnel 3600s

    server ws1 10.0.3.1:8081 check
    server ws2 10.0.3.2:8081 check

listen stats
    bind *:8404
    stats enable
    stats uri /stats
    stats refresh 10s

Sticky Sessions

Sticky sessions (session affinity) ensure that requests from the same client are always routed to the same backend server. This is necessary when the server maintains session state (like shopping carts or login sessions).

Methods for Implementing Sticky Sessions

Cookie-Based: The load balancer inserts a cookie identifying the backend server. Subsequent requests include this cookie, and the LB routes to the same server.

IP Hash: The client IP address is hashed to consistently map to the same server.

URL/Header-Based: Use a session token in the URL or a custom header for routing.

However, sticky sessions reduce the effectiveness of load balancing and create problems during server failures (stuck sessions are lost). The preferred approach is to make your application stateless by storing session data in an external store like Redis or a database.

SSL Termination

SSL/TLS termination at the load balancer offloads the CPU-intensive encryption and decryption from backend servers. The load balancer handles HTTPS from clients and forwards unencrypted HTTP to backend servers within the trusted internal network. For end-to-end encryption, use SSL passthrough or re-encryption to the backend.

Global Server Load Balancing (GSLB)

GSLB distributes traffic across data centers in different geographic regions using DNS-based routing. It provides disaster recovery, latency-based routing, and geographic load distribution. Services like AWS Route 53, Cloudflare, and Azure Traffic Manager provide GSLB capabilities that integrate with CDN architectures.

Cloud Load Balancers

Service	Layer	Key Features
AWS ALB	Layer 7	Path-based routing, WebSocket, gRPC, WAF integration
AWS NLB	Layer 4	Ultra-high performance, static IPs, TLS passthrough
GCP Cloud Load Balancing	L4 and L7	Global anycast, auto-scaling, CDN integration
Azure Load Balancer	Layer 4	Zone-redundant, HA ports, outbound rules
Azure Application Gateway	Layer 7	WAF, SSL termination, URL-based routing

Load balancing works closely with reverse proxies, CDNs, and DNS to create a resilient, high-performance architecture. Understanding how these components interact is essential for system design. Test your server configurations using our API and Network Tools and explore more resources at our tools page.

Frequently Asked Questions

What happens when a load balancer itself fails?

Load balancers are typically deployed in a high-availability pair using active-passive or active-active configuration. If the primary load balancer fails, a secondary instance takes over automatically (using technologies like VRRP or floating IPs). Cloud-managed load balancers (AWS ALB, GCP LB) are inherently redundant and distributed, so they handle failover automatically without user intervention.

Should I use L4 or L7 load balancing?

Use L7 when you need content-based routing (different backends for different URL paths), SSL termination, HTTP header manipulation, or WebSocket handling. Use L4 when you need maximum throughput for non-HTTP protocols (database connections, game servers, custom TCP protocols) or when you do not need to inspect application content. Many architectures use both: L4 for initial traffic distribution and L7 for application-level routing.

How many backend servers do I need?

Start with at least 2 servers for redundancy (N+1 rule). Size based on your traffic patterns: measure requests per second, CPU utilization, and response times per server under load. Plan for peak traffic (which can be 3-10x average). Auto-scaling groups in cloud environments can dynamically adjust the number of servers based on metrics like CPU utilization or request queue depth.

What is the difference between a load balancer and a reverse proxy?

A reverse proxy sits in front of servers and handles requests on their behalf, providing caching, SSL termination, and request routing. A load balancer is a specific function that distributes traffic across multiple servers. In practice, tools like Nginx and HAProxy serve as both reverse proxies and load balancers. The terms overlap significantly, but load balancing specifically refers to distributing traffic across multiple server instances.

Load Balancing: The Complete Guide for System Design