Neo4j
Graph Data Modeling
Departament de Ciències de la Computació
Graph Data Modeling
1. Introduction to Graph Data Modeling
2. Designing the Initial Graph Data Model
3. Graph Data Modeling Core Principles
4. Common Graph Structures
5. Refactoring and Evolving a Graph Data Model
Bases de Dades no Relacionals. Neo4j 2
What is Graph Data Modeling?
Graph data modeling is a collaborative effort by stakeholders
including developers
Stakeholders include business analysts, architects, managers,
project leaders…
The application domain is analyzed by stakeholders and developers
▪ They develop a data model
▪ Stakeholders must understand the domain and provide answers
Neo4j is a full-featured graph database
▪ It includes tools used to create property graphs
▪ It supports application access in retrieving data for business use cases by
traversing the graph
3
Bases de Dades no Relacionals. Neo4j
Neo4j Property Graph Model
• Nodes (Entities)
• Relationships
• Properties
• Labels
Graph Traversal
MATCH (r:Residence)<-[:OWNS]-(p:Person)
WHERE r.address = '475 Broad Street'
RETURN p
5
Graph Data Modeling
1. Introduction to Graph Data Modeling
2. Designing the Initial Graph Data Model
3. Graph Data Modeling Core Principles
4. Common Graph Structures
5. Refactoring and Evolving a Graph Data Model
Bases de Dades no Relacionals. Neo4j 6
Designing the Initial Data Model
1. Understand the domain
2. Create high-level sample data
3. Define specific questions for the application
4. Identify entities
5. Identify connections between entities
6. Test the questions against the model
7. Test scalability
7
Identify Entities from Questions
Entities are the nouns in the application questions:
▪ What ingredients are used in a recipe?
▪ Who is married to this person?
o The generic nouns often become labels in the model
o Use domain knowledge when deciding how to
further groupe or differentiate entities
8
Define Properties
Two purposes for properties:
1. Unique identification
2. Answering application questions
Otherwise properties are decoration (these properties should not be added)
Properties are used for:
– Anchoring (where to begin the query)
– Traversing the graph (navigation)
– Returning data from the query
9
Identify Connections Between Entities
Connections are the verbs in the application questions:
▪ What ingredients are used in a recipe?
▪ Who is married to this person?
10
Naming Relationships
▪ Stakeholders must agree upon name (type) for the relationship
▪ Avoid names that could be construed as nouns (e.g. email)
Do not do this: Instead do this:
11
Direction and Type
Direction and type are required for relationships
Select direction and type based on expected questions:
1. What episode follows ‘The Ark in Space’? (NEXT )
2. What episode came before ‘Genesis of the Daleks’? ( PREVIOUS)
12
Node Fanout
firstName: ‘Patrick’
lastName: ‘Scott’
age: 34
addr1: ‘Flat 3B’
addr2: ’83 Landor St’
city: ‘Axebridge'
postalCode: ‘DF3 0AS’
Person
addr1: ‘Flat 3B’
addr2: ’83 Landor St’ firstName: ‘Patrick’
lastName: ‘Scott’
city: ‘Axebridge'
age: 34
postalCode: ‘DF3 0AS’
Residence :LIVES_AT Person
13
How Much Node Fanout?
14
Graph Data Modeling
1. Introduction to Graph Data Modeling
2. Designing the Initial Graph Data Model
3. Graph Data Modeling Core Principles
4. Common Graph Structures
5. Refactoring and Evolving a Graph Data Model
Bases de Dades no Relacionals. Neo4j 15
Graph Modeling Core Principles
● Nodes
○ Uniqueness
○ Fanout ● Properties
● Relationships ● Data object accessibility
○ Naming best practices
○ Semantic redundancy
○ Types vs. Properties
16
Node Best Practices
Uniqueness of Nodes: Before
Notes:
▪ Country nodes are
considered super nodes
(a node with lots of fan-in
or fan-out)
▪ Be careful when using
them in a design
▪ Be aware of queries that
might select all paths in or
out of a super node
17
Node Best Practices
Uniqueness of Nodes: After
18
Complex Data
Use Fanout Judiciously for Complex Data
▪ Reduce property duplication
▪ Reduce gather-and-inspect
20
Best Practices for Modeling Relationships
Data models should address:
• Using specific relationship types
• Using types vs. properties
• Reducing symmetric relationships
Using Specific Relationship Types
22
But Not Too Specific
23
Do Not Use Symmetric Relationships
24
Semantics of Symmetry are Important
25
Using Types vs. Properties
26
Property Best Practices
▪ Property lookups have a cost
▪ Parsing a complex property adds more cost
▪ Anchors and properties used for traversal should be as simple as possible
▪ Identifiers, outputs, and decoration are OK as complex values
27
Best practices for Data Accessibility
For each query, how much work must Neo4j do to evaluate if the
traversal represents a “good” or a “bad” path?
28
Hierarchy of Accessibility
For each data object, how much work must Neo4j do to evaluate if this is a “good”
path or a “bad” one?
Most 1. Anchor node label
Anchor node properties (indexed)
accessible
Anchor
Node
Least processing required
2. Relationship type
3. Anchor node properties (non-
Downstream indexed)
Nodes
4. Downstream node labels
Least 5. Relationship properties
accessible Downstream node properties
Most processing required
Graph Data Modeling
1. Introduction to Graph Data Modeling
2. Designing the Initial Graph Data Model
3. Graph Data Modeling Core Principles
4. Common Graph Structures
5. Refactoring and Evolving a Graph Data Model
Bases de Dades no Relacionals. Neo4j 30
Common Graph Structures
● Intermediate node
● Linked list
● Timeline tree
● Multiple structures in a single model
31
Intermediate Nodes
Create intermediate nodes when you need to:
▪ Connect more than two nodes in a single context
▪ Relate something to a relationship
32
Intermediate Nodes
33
Intermediate Nodes: Sharing Context
34
Intermediate Nodes: Sharing Data
35
Intermediate Nodes: Organizing Data
36
Linked Lists
Do NOT
37
Interleaved Linked List
38
Head and Tail of Linked List
Some possible use cases:
▪ Add episodes as they are broadcast
▪ Maintain pointer to first and last episodes
▪ Find all broadcast episodes
▪ Find latest broadcast episode
39
Timeline Tree
40
Using Multiple Structures
41
Using the Timeline Tree
42
Using Intermediate Nodes
43
Using Linked Lists
44
Graph Data Modeling
1. Introduction to Graph Data Modeling
2. Designing the Initial Graph Data Model
3. Graph Data Modeling Core Principles
4. Common Graph Structures
5. Refactoring and Evolving a Graph Data Model
Bases de Dades no Relacionals. Neo4j 45
What is Refactoring?
Important: Your model depends on your data and your queries
Refactoring is the process of …
– Changing the data structure ...
– Without altering its semantic meaning
Refactoring often involves moving data from one structure to another
Sometimes refactoring involves adding additional data from other
sources
The most common type of refactoring is ...
– Restructure the graph to use a property value
– A property value is used to create a label, a node, or a relationship
46
Hierarchy of Accessibility (reminder)
For each data object, how much work must Neo4j do to evaluate if this is a “good”
path or a “bad” one?
Most 1. Anchor node label
Anchor node properties (indexed)
accessible
Anchor
Node
Least processing required
2. Relationship type
3. Anchor node properties (non-
Downstream indexed)
Nodes
4. Downstream node labels
Least 5. Relationship properties
accessible Downstream node properties
Most processing required
Why Refactor?
Data models can be optimized for:
Note: Improving
– Query performance behavior in one of
these areas
– Model simplicity & intuitiveness
frequently involves
– Query simplicity (i.e., simpler Cypher strings) sacrifices in others
– Easier data updates
Another important reason to refactor is to accommodate new
application questions in the same model
48
Goal: Eliminate Duplicate Data in Properties
49
Refactor Example: Extracting Nodes From Properties
50
Goal: Use Labels Instead of Property Values
51
Refactor Example: Turn Property Values
into Labels for Nodes
52
Goal: Use Nodes Instead of Properties for relationships
Possible dense node
53
Refactor: Extract Nodes from Relationship Properties
54
Graph Data Modeling
1. Introduction to Graph Data Modeling
2. Designing the Initial Graph Data Model
3. Graph Data Modeling Core Principles
4. Common Graph Structures
5. Refactoring and Evolving a Graph Data Model
– Example
Bases de Dades no Relacionals. Neo4j 55
Refactoring example: Modeling airline flights
Leonardo DiCaprio as Frank Abagnale in the Steven Spielberg movie “Catch Me If You Can”
Credit: Max De Marzi https://maxdemarzi.com/2015/08/26/modeling-airline-flights-in-neo4j/
56
Refactoring example: Modeling airline flights
Important: Your model depends on your data and your queries
Our data → Airports and Flights between them
Ask yourself:
• What are the entities?
• What are the connections between the entities?
• What properties do we need?
Initial Question for Our Model
▪ What flights will take me from Malmo to New York on Friday?
57
Initial Model
Question: What flights will take me from Malmo to New York on Friday?
Comment: The concept of a Flight is expressed as a relationship
The model can answer this question, so the model seems fine
58
Initial Model
Question 1: What flights will take me from Malmo to New York on Friday?
Question 2: Mom is on flight AY189. When will she land?
Comment: To find flight AY189, we need to traverse every relationship in the graph,
because it is impossible to anchor on relationships. This query is very inefficient!
59
Initial Model
Question 1: What flights will take me from Malmo to New York on Friday?
More questions:
• What if we want to connect Customers or Staff to a flight? → Not possible!
• What if a flight was rerouted to another Airport due to weather? → Not possible!
Given some of the queries we imagine for our data a flight really should be a node
60
Refactor: Create Intermediate Flight Nodes
Question 1: What flights will take me from Malmo to New York on Friday?
Question 2: Mom is on flight AY189. When will she land?
Adding Flight nodes allows to anchor on flight data, reducing traversal
61
Refactor: Create Intermediate Flight Nodes
Question 1: What flights will take me from Malmo to New York on Friday?
Question 2: Mom is on flight AY189. When will she land?
Adding Flight nodes allows to anchor on flight data, reducing traversal
Note: Airlines are required to publish flight plans 12 months in advance.
How much work must Neo4j do to answer Question 1?
• Neo4j must check every flight leaving Malmo, then consult the flight data.
Then we check which of those flights land in the desired place!
How can we elevate the flight date for better efficiency?
62
Refactor: Create AirportDay Intermediate Nodes
Question 1: What flights will take me from Malmo to New York on Friday?
Question 2: Mom is on flight AY189. When will she land?
Adding the Intermediate node AirportDay:
▪ It reduces the number of relationships in Airport nodes, since there are fewer days than flights
▪ We still need to check every AirportDay to find the right date, but the traversals are reduced
63
Refactor: Create AirportDay Intermediate Nodes
Question 1: What flights will take me from Malmo to New York on Friday?
Question 2: Mom is on flight AY189. When will she land?
Adding the Intermediate node AirportDay
If model changes, we must check if older queries are still OK. What about Q1 and Q2? All OK
But… how to reduce wasted traversal even further for DATES?
64
Possible Refactor: Change Relationship Type to Date
Question 1: What flights will take me from Malmo to New York on Friday?
Question 2: Mom is on flight AY189. When will she land?
We make date a relationship type
▪ It hardly changes the model, but performance improves. Now, we can traverse only to the
relevant AirportDay. And Q2 is unaffected.
65
Possible Refactor: Change Relationship Type to Date
Question 1: What flights will take me from Malmo to New York on Friday?
Question 2: Mom is on flight AY189. When will she land?
We make date a relationship type
Comment: Are Airport nodes necessary? If we remove them, then:
• We could remove a modest number of Airport nodes and many HAS_DAY relationships
66
Possible Refactor: Remove Airport Nodes
Question 1: What flights will take me from Malmo to New York on Friday?
Question 2: Mom is on flight AY189. When will she land?
We remove Airport nodes → it is less intuitive but more efficient
Comment: But what if no direct flight available? How to find an itinerary (connecting flights)?
It must check each flight and its destinations, and second-order destinations... → Inefficient!!
67
Refactor: Add Destination Intermediate Nodes
Question 1: What flights will take me from Malmo to New York on Friday?
Question 2: Mom is on flight AY189. When will she land?
Adding the intermediate node Destination → queries on destination are efficient
The scope of the graph grows proportionally to the number of Destinations served by an airport,
not the number of Flights. Airports have multiple flights per destination (at different times of day)
Comment: Is this refactor affecting Q2? No
68