Data Modeling: Designing Data Structures for Real-World Applications

Data modeling is the process of defining how data is structured, stored, and related. It bridges the gap between business requirements and database implementation. Whether you are working with relational, document, or graph databases, thoughtful data modeling determines your application's performance, scalability, and maintainability. This guide covers ER diagrams, modeling approaches for different database types, and practical design patterns.

Entity-Relationship (ER) Diagrams

ER diagrams are the standard visual tool for planning relational database structures. They define entities (things), attributes (properties), and relationships (connections) before writing a single line of SQL.

Core Concepts

Concept	Description	Example
Entity	A real-world object or concept	User, Product, Order
Attribute	A property of an entity	User.name, Product.price
Primary Key	Unique identifier for an entity	User.id, Order.id
Relationship	Association between entities	User PLACES Order
Cardinality	Number of instances in a relationship	1:1, 1:N, M:N

Cardinality Types

One-to-One (1:1): A user has one profile. Implement with a foreign key on either table or a shared primary key.

One-to-Many (1:N): An author writes many books. Implement with a foreign key on the "many" side (books.author_id).

Many-to-Many (M:N): Students enroll in many courses; courses have many students. Implement with a junction (bridge) table.

Relational Data Modeling

Relational modeling follows a structured process: identify entities, define attributes, establish relationships, determine cardinality, normalize to 3NF, and add indexes for performance.

-- Relational model for a blog platform
CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    username VARCHAR(50) UNIQUE NOT NULL,
    email VARCHAR(255) UNIQUE NOT NULL,
    bio TEXT,
    created_at TIMESTAMP DEFAULT NOW()
);

CREATE TABLE posts (
    id SERIAL PRIMARY KEY,
    author_id INT NOT NULL REFERENCES users(id),
    title VARCHAR(300) NOT NULL,
    content TEXT NOT NULL,
    status VARCHAR(20) DEFAULT 'draft',
    published_at TIMESTAMP,
    created_at TIMESTAMP DEFAULT NOW()
);

CREATE TABLE tags (
    id SERIAL PRIMARY KEY,
    name VARCHAR(50) UNIQUE NOT NULL
);

-- Many-to-many: Posts can have multiple tags
CREATE TABLE post_tags (
    post_id INT REFERENCES posts(id) ON DELETE CASCADE,
    tag_id INT REFERENCES tags(id) ON DELETE CASCADE,
    PRIMARY KEY (post_id, tag_id)
);

CREATE TABLE comments (
    id SERIAL PRIMARY KEY,
    post_id INT NOT NULL REFERENCES posts(id) ON DELETE CASCADE,
    author_id INT NOT NULL REFERENCES users(id),
    parent_comment_id INT REFERENCES comments(id),  -- Self-ref for nested comments
    content TEXT NOT NULL,
    created_at TIMESTAMP DEFAULT NOW()
);

-- Indexes for common access patterns
CREATE INDEX idx_posts_author ON posts(author_id);
CREATE INDEX idx_posts_status ON posts(status) WHERE status = 'published';
CREATE INDEX idx_comments_post ON comments(post_id);
CREATE INDEX idx_comments_parent ON comments(parent_comment_id);

Document Data Modeling (MongoDB)

Document databases like MongoDB take a fundamentally different approach. Instead of normalizing into tables, you design documents around query patterns. The central question is: embed or reference?

Embedding vs Referencing

Pattern	When to Use	Example
Embed	Data is always accessed together, 1:1 or 1:few	User with address, Post with author name
Reference	Data is accessed independently, 1:many or M:N	User and orders, Post and comments

// MongoDB: Embedding (denormalized)
// Good when comments are always loaded with the post
{
  "_id": ObjectId("..."),
  "title": "Introduction to Graph Databases",
  "author": {
    "name": "Alice Johnson",
    "avatar": "/images/alice.jpg"
  },
  "content": "Graph databases store data as nodes and edges...",
  "tags": ["databases", "graph", "neo4j"],
  "comments": [
    {
      "author": "Bob",
      "text": "Great article!",
      "created_at": ISODate("2024-03-15")
    },
    {
      "author": "Carol",
      "text": "Very helpful explanation.",
      "created_at": ISODate("2024-03-16")
    }
  ]
}

// MongoDB: Referencing (normalized)
// Better when comments grow large or are accessed separately
// posts collection
{
  "_id": ObjectId("post-1"),
  "title": "Introduction to Graph Databases",
  "author_id": ObjectId("user-alice"),
  "content": "...",
  "tags": ["databases", "graph", "neo4j"]
}

// comments collection
{
  "_id": ObjectId("comment-1"),
  "post_id": ObjectId("post-1"),
  "author_id": ObjectId("user-bob"),
  "text": "Great article!",
  "created_at": ISODate("2024-03-15")
}

The 16MB Document Limit

MongoDB documents have a 16MB size limit. If embedded arrays can grow unboundedly (e.g., millions of comments on a viral post), you must use referencing. A common pattern is the subset pattern: embed the most recent 20 comments in the post document and store the rest in a separate collection.

Graph Data Modeling

Graph databases model data as nodes and relationships. The key design principle is: design for traversal patterns. Unlike relational modeling where you start with entities, graph modeling starts with the questions you need to answer.

// Graph model for a social media platform
// Nodes
(:User {name, email, joined_date})
(:Post {content, created_at})
(:Tag {name})
(:Location {city, country})

// Relationships
(:User)-[:FOLLOWS]->(:User)
(:User)-[:PUBLISHED]->(:Post)
(:User)-[:LIKES]->(:Post)
(:Post)-[:TAGGED_WITH]->(:Tag)
(:User)-[:LIVES_IN]->(:Location)

// Cypher: Create the graph
CREATE (alice:User {name: 'Alice', email: 'alice@example.com'})
CREATE (bob:User {name: 'Bob', email: 'bob@example.com'})
CREATE (post1:Post {content: 'Hello world!', created_at: datetime()})
CREATE (sf:Location {city: 'San Francisco', country: 'US'})
CREATE (alice)-[:FOLLOWS]->(bob)
CREATE (alice)-[:PUBLISHED]->(post1)
CREATE (bob)-[:LIKES]->(post1)
CREATE (alice)-[:LIVES_IN]->(sf)

Document Modeling Design Patterns

Polymorphic Pattern

Store different entity types in the same collection with a type discriminator. Useful for inheritance hierarchies.

// Polymorphic: Different product types in one collection
{
  "_id": "prod-1",
  "type": "book",
  "name": "System Design Interview",
  "price": 39.99,
  "author": "Alex Xu",
  "pages": 320,
  "isbn": "978-1736049709"
}

{
  "_id": "prod-2",
  "type": "electronics",
  "name": "Mechanical Keyboard",
  "price": 149.99,
  "brand": "Keychron",
  "switch_type": "Brown",
  "connectivity": ["USB-C", "Bluetooth"]
}

Bucket Pattern

Group related time-series data into buckets to reduce document count and improve query performance.

// Bucket pattern: Group sensor readings by hour
{
  "sensor_id": "temp-sensor-01",
  "bucket_start": ISODate("2024-03-15T10:00:00Z"),
  "bucket_end": ISODate("2024-03-15T11:00:00Z"),
  "count": 60,
  "readings": [
    {"time": ISODate("2024-03-15T10:00:00Z"), "value": 22.5},
    {"time": ISODate("2024-03-15T10:01:00Z"), "value": 22.6},
    {"time": ISODate("2024-03-15T10:02:00Z"), "value": 22.4}
  ],
  "summary": {
    "avg": 22.5,
    "min": 22.1,
    "max": 22.9
  }
}

Outlier Pattern

Handle exceptional documents differently. Most books have under 100 reviews, but bestsellers may have millions. Embed reviews for normal books; use a separate overflow collection for outliers.

Attribute Pattern

When entities have many optional attributes that are queried similarly, restructure them into an array of key-value pairs that can be indexed uniformly.

Schema-on-Read vs Schema-on-Write

Approach	Description	Database Examples
Schema-on-Write	Schema defined upfront; data validated on write	PostgreSQL, MySQL
Schema-on-Read	No schema enforced; structure interpreted at read time	MongoDB, Cassandra, data lakes

Schema-on-write provides safety (invalid data is rejected) but requires migrations for changes. Schema-on-read provides flexibility (any structure is accepted) but pushes validation to the application layer. In practice, most MongoDB applications use schema validation to get benefits of both approaches.

A practical example modeling a social media platform shows how different approaches handle the same domain:

-- Relational approach (PostgreSQL)
CREATE TABLE users (
    id BIGSERIAL PRIMARY KEY,
    username VARCHAR(50) UNIQUE NOT NULL,
    display_name VARCHAR(100),
    bio TEXT,
    follower_count INT DEFAULT 0,
    following_count INT DEFAULT 0,
    created_at TIMESTAMP DEFAULT NOW()
);

CREATE TABLE follows (
    follower_id BIGINT REFERENCES users(id),
    following_id BIGINT REFERENCES users(id),
    created_at TIMESTAMP DEFAULT NOW(),
    PRIMARY KEY (follower_id, following_id)
);

CREATE TABLE posts (
    id BIGSERIAL PRIMARY KEY,
    author_id BIGINT NOT NULL REFERENCES users(id),
    content TEXT NOT NULL,
    like_count INT DEFAULT 0,
    reply_count INT DEFAULT 0,
    created_at TIMESTAMP DEFAULT NOW()
);

CREATE TABLE likes (
    user_id BIGINT REFERENCES users(id),
    post_id BIGINT REFERENCES posts(id),
    created_at TIMESTAMP DEFAULT NOW(),
    PRIMARY KEY (user_id, post_id)
);

-- Feed query: Get posts from people I follow
SELECT p.id, p.content, u.username, p.like_count, p.created_at
FROM posts p
JOIN users u ON p.author_id = u.id
WHERE p.author_id IN (
    SELECT following_id FROM follows WHERE follower_id = 42
)
ORDER BY p.created_at DESC
LIMIT 20;

For related topics, see schema design patterns, SQL vs NoSQL, and graph databases.

Frequently Asked Questions

Should I embed or reference in MongoDB?

Embed when: data is always accessed together (user + address), the embedded data is bounded (max 5 addresses), and updates to the embedded data are infrequent. Reference when: data is accessed independently, the referenced data can grow unboundedly, or many documents reference the same data.

How do I model hierarchical data (categories, org charts)?

In SQL, common patterns include: adjacency list (parent_id column, simple but recursive queries needed), materialized path (store full path like "/electronics/phones/smartphones"), and nested sets (left/right values for fast subtree queries). In MongoDB, use the materialized path or store ancestors as an array. In a graph database, hierarchy is natural.

When should I use a graph model vs relational model?

Use a graph model when relationships ARE the data — social networks, recommendation engines, fraud detection, knowledge graphs. Use a relational model when data has clear structure, relationships are simple (1:1, 1:N), and queries are primarily aggregations, filtering, and sorting rather than traversals.

How do I evolve my data model without downtime?

In relational databases, use migration tools (Flyway, Alembic) with backward-compatible changes: add nullable columns, create new tables, backfill data, then switch application code. In MongoDB, the schema-flexible nature makes evolution easier — new fields can be added to new documents without affecting existing ones.

What is the difference between conceptual, logical, and physical data models?

The conceptual model defines business entities and relationships (no technical details). The logical model adds attributes, data types, and constraints (database-agnostic). The physical model includes database-specific implementations — table definitions, indexes, partitions, and storage configuration. Start with conceptual, refine to logical, then implement as physical.

Data Modeling: Designing Data Structures for Real-World Applications