Graph Databases: Modeling and Querying Connected Data

Graph databases store data as nodes (entities) and edges (relationships), making them ideal for highly connected data. While relational databases struggle with deep joins across many-to-many relationships, graph databases traverse connections in constant time regardless of dataset size. This guide covers the property graph model, Cypher queries, and real-world use cases from social networks to fraud detection.

Why Graph Databases?

Consider finding a friend-of-a-friend in a relational database with 1 million users. A SQL query requires multiple self-joins on a relationship table, and performance degrades rapidly as the depth of traversal increases. In a graph database, this is a simple two-hop traversal that executes in milliseconds regardless of total dataset size.

-- SQL: Friend-of-friend query (multiple JOINs, slow at scale)
SELECT DISTINCT f2.friend_id
FROM friendships f1
JOIN friendships f2 ON f1.friend_id = f2.user_id
WHERE f1.user_id = 42
AND f2.friend_id != 42
AND f2.friend_id NOT IN (SELECT friend_id FROM friendships WHERE user_id = 42);

-- With 1M users and 50M friendships, this can take seconds or minutes
-- Adding a third hop (friend-of-friend-of-friend) makes it even worse

The Property Graph Model

The property graph model is the most common graph data model. It consists of three elements:

Nodes (Vertices): Entities like people, products, or locations. Each node has labels (types) and properties (key-value attributes).
Relationships (Edges): Connections between nodes with a type, direction, and optional properties. A relationship always has a start node and end node.
Properties: Key-value pairs attached to both nodes and relationships.

Element	Example	Properties
Node (:Person)	Alice	name: "Alice", age: 30, city: "SF"
Node (:Movie)	The Matrix	title: "The Matrix", year: 1999
Relationship [:FRIENDS_WITH]	Alice → Bob	since: "2020-01-15"
Relationship [:RATED]	Alice → The Matrix	score: 5, date: "2024-03-01"

Neo4j and Cypher Query Language

Neo4j is the most popular graph database, and Cypher is its declarative query language. Cypher uses ASCII-art syntax to represent graph patterns, making queries intuitive and readable.

Creating Data

// Cypher: Create nodes and relationships
CREATE (alice:Person {name: 'Alice', age: 30, city: 'San Francisco'})
CREATE (bob:Person {name: 'Bob', age: 28, city: 'New York'})
CREATE (carol:Person {name: 'Carol', age: 35, city: 'San Francisco'})
CREATE (matrix:Movie {title: 'The Matrix', year: 1999, genre: 'Sci-Fi'})
CREATE (inception:Movie {title: 'Inception', year: 2010, genre: 'Sci-Fi'})

// Create relationships
CREATE (alice)-[:FRIENDS_WITH {since: '2020-01-15'}]->(bob)
CREATE (bob)-[:FRIENDS_WITH {since: '2021-06-20'}]->(carol)
CREATE (alice)-[:RATED {score: 5}]->(matrix)
CREATE (bob)-[:RATED {score: 4}]->(matrix)
CREATE (carol)-[:RATED {score: 5}]->(inception)

Querying Patterns

// Find all of Alice's friends
MATCH (alice:Person {name: 'Alice'})-[:FRIENDS_WITH]->(friend)
RETURN friend.name, friend.city

// Friend-of-friend (2 hops) — effortless in a graph database
MATCH (alice:Person {name: 'Alice'})-[:FRIENDS_WITH*2]->(fof)
WHERE fof.name <> 'Alice'
RETURN DISTINCT fof.name

// Movies rated 5 stars by friends of Alice
MATCH (alice:Person {name: 'Alice'})-[:FRIENDS_WITH]->(friend)-[r:RATED]->(movie)
WHERE r.score = 5
RETURN friend.name, movie.title, r.score

// Shortest path between two people
MATCH path = shortestPath(
    (alice:Person {name: 'Alice'})-[:FRIENDS_WITH*..6]-(carol:Person {name: 'Carol'})
)
RETURN path, length(path)

// Find all people within 3 hops
MATCH (alice:Person {name: 'Alice'})-[:FRIENDS_WITH*1..3]->(connected)
RETURN DISTINCT connected.name, connected.city

Social networks are the classic graph database use case. Friend recommendations, mutual friends, influence analysis, and community detection are all natural graph operations.

// Recommend friends: People who are friends-of-friends but not yet direct friends
MATCH (user:Person {name: 'Alice'})-[:FRIENDS_WITH]->(friend)-[:FRIENDS_WITH]->(suggestion)
WHERE NOT (user)-[:FRIENDS_WITH]->(suggestion)
  AND suggestion <> user
RETURN suggestion.name, COUNT(friend) AS mutual_friends
ORDER BY mutual_friends DESC
LIMIT 10

// Find mutual friends between two people
MATCH (alice:Person {name: 'Alice'})-[:FRIENDS_WITH]->(mutual)<-[:FRIENDS_WITH]-(bob:Person {name: 'Bob'})
RETURN mutual.name

// Community detection: Find clusters of densely connected people
MATCH (p:Person)-[:FRIENDS_WITH]->(friend)
WITH p, COUNT(friend) AS connections
WHERE connections > 10
RETURN p.name, connections ORDER BY connections DESC

Use Case: Recommendation Engines

Collaborative filtering naturally maps to graph traversal. Find users similar to you (shared ratings), then recommend items they liked that you have not seen.

// Collaborative filtering: Recommend movies
// "Users who liked what you liked also liked..."
MATCH (user:Person {name: 'Alice'})-[r1:RATED]->(movie)<-[r2:RATED]-(similar_user)
WHERE r1.score >= 4 AND r2.score >= 4
WITH similar_user, COUNT(movie) AS shared_likes
ORDER BY shared_likes DESC
LIMIT 5
MATCH (similar_user)-[r:RATED]->(recommendation:Movie)
WHERE r.score >= 4
  AND NOT EXISTS {
    MATCH (user:Person {name: 'Alice'})-[:RATED]->(recommendation)
  }
RETURN recommendation.title, COUNT(*) AS recommended_by
ORDER BY recommended_by DESC

Use Case: Fraud Detection

Fraud patterns often involve suspicious connections — shared addresses, phone numbers, devices, or IP addresses across accounts that should be independent. Graph databases reveal these connections instantly.

// Fraud detection: Find accounts sharing suspicious connections
MATCH (a1:Account)-[:USES_DEVICE]->(device)<-[:USES_DEVICE]-(a2:Account)
WHERE a1 <> a2
WITH a1, a2, COUNT(device) AS shared_devices
WHERE shared_devices > 1
RETURN a1.id, a2.id, shared_devices

// Find rings of accounts connected through shared attributes
MATCH ring = (a:Account)-[:USES_PHONE|USES_EMAIL|USES_ADDRESS*2..5]->(a)
RETURN ring, length(ring)

// Identify suspicious money transfer chains
MATCH path = (source:Account)-[:TRANSFERRED*3..7]->(destination:Account)
WHERE source.risk_score > 0.8
RETURN path, reduce(total = 0, r IN relationships(path) | total + r.amount) AS chain_total

Use Case: Knowledge Graphs

Knowledge graphs model real-world entities and their relationships. Google's Knowledge Graph, Wikipedia's Wikidata, and enterprise knowledge management systems all use graph structures to connect information.

Graph Databases vs Relational JOINs

Aspect	Relational (SQL)	Graph (Neo4j)
Relationship traversal	JOIN (O(n) per hop)	Pointer follow (O(1) per hop)
Multi-hop queries	Multiple self-JOINs, slow	Variable-length paths, fast
Schema flexibility	Fixed schema, ALTER TABLE	Schema-optional, add properties anytime
Aggregations	Excellent (GROUP BY, window functions)	Basic aggregations
Transactions	Full ACID	ACID (Neo4j supports transactions)

Other Graph Databases

Amazon Neptune is a fully managed graph database supporting both property graphs (Gremlin) and RDF (SPARQL). ArangoDB is a multi-model database supporting graphs, documents, and key-value in a single engine. JanusGraph is an open-source distributed graph database built on top of various storage backends.

For related database concepts, see our guides on SQL vs NoSQL, data modeling patterns, and schema design.

Frequently Asked Questions

When should I use a graph database instead of a relational database?

Use a graph database when your core queries involve traversing relationships — friend-of-friend, shortest path, pattern matching, and recommendation algorithms. If your queries are primarily table scans, aggregations, or simple JOINs on structured data, a relational database is the better choice. The rule of thumb: if you would need more than 3-4 JOINs in SQL to answer a question, consider a graph database.

Can Neo4j handle large datasets?

Neo4j can handle billions of nodes and relationships on a single server with enough memory. For larger datasets, Neo4j offers clustering with primary/secondary nodes for high availability and read scaling. However, graph databases generally do not shard data as easily as distributed NoSQL databases.

Is Cypher a standard query language?

Cypher originated in Neo4j but has been standardized as GQL (Graph Query Language) by ISO. The openCypher project makes the language available to other databases. Amazon Neptune supports Gremlin (Apache TinkerPop) as an alternative graph query language.

Can I combine a graph database with a relational database?

Yes, and this is a common pattern. Store your core transactional data in PostgreSQL and sync graph-relevant data to Neo4j for relationship queries. For example, an e-commerce platform might keep orders in PostgreSQL but product recommendations in Neo4j. Tools like Neo4j Streams and Kafka connectors automate the synchronization.

What about performance for write-heavy workloads?

Graph databases are optimized for read-heavy, relationship-traversal workloads. Write throughput is generally lower than key-value stores or wide-column databases like Cassandra. If your workload is primarily write-heavy (logging, IoT ingestion), a graph database is not the right choice. Use it alongside other databases in a polyglot architecture.

Graph Databases: Modeling and Querying Connected Data