Graph model refactoring
Introduction
Building on the Cypher® basic concepts, this guide provides a worked example of changing a graph model. Upon finishing this guide, you should be able to evolve your graph model based on changing requirements.
Airports dataset
In this guide we’re going to use an airports dataset that contains connections between US airports in January 2008. We have the data in a CSV file, and this is the graph model that we’re going to import it into:
Before we import any data, we’re going to create a unique constraint on the Airport
label and code
property to ensure that we don’t accidentally import duplicate airports.
The following query creates this constraint:
CREATE CONSTRAINT airport_id
FOR (airport:Airport) REQUIRE airport.code IS UNIQUE
0 rows available after 86 ms, consumed after another 0 ms. Added 1 constraints |
And the following query loads the data from a CSV file using the LOAD CSV
tool:
LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/neo4j-contrib/training/master/modeling/data/flights_1k.csv" AS row
MERGE (origin:Airport {code: row.Origin})
MERGE (destination:Airport {code: row.Dest})
MERGE (origin)-[connection:CONNECTED_TO {
airline: row.UniqueCarrier,
flightNumber: row.FlightNum,
date: date({year: toInteger(row.Year), month: toInteger(row.Month), day: toInteger(row.DayofMonth)}),
cancelled: row.Cancelled,
diverted: row.Diverted}]->(destination)
ON CREATE SET connection.departure = localtime(apoc.text.lpad(row.CRSDepTime, 4, "0")),
connection.arrival = localtime(apoc.text.lpad(row.CRSArrTime, 4, "0"))
This query:
-
Creates a node with an
Airport
label with acode
property that has a value from theOrigin
column in the CSV file -
Creates a node with an
Airport
label with acode
property that has a value from theDest
column in the CSV file -
Creates a relationship of type
CONNECTED_TO
with several properties based on columns in the CSV file.
If we run this query we’ll see the following output:
Added 62 labels, created 62 nodes, set 7062 properties, created 1000 relationships, completed after 376 ms. |
This model is a good starter one, but there are some improvements that we can make.
Convert property to boolean
The diverted
and cancelled
properties on the CONNECTED_TO
relationships contain string values of 1
and 0
.
Since these values are representing booleans, we can use the apoc.refactor.normalizeAsBoolean
procedure to convert the values from strings to booleans.
The following query does the conversion for the diverted
property:
MATCH (:Airport)-[connectedTo:CONNECTED_TO]->(:Airport)
CALL apoc.refactor.normalizeAsBoolean(connectedTo, "diverted", ["1"], ["0"])
RETURN count(*)
count(*) |
---|
1000 |
And the following query does the conversion for the cancelled
property:
MATCH (origin:Airport)-[connectedTo:CONNECTED_TO]->(departure)
CALL apoc.refactor.normalizeAsBoolean(connectedTo, "cancelled", ["1"], ["0"])
RETURN count(*)
count(*) |
---|
1000 |
If we have a lot of relationships to update, we may get an OutOfMemory exception if we try to refactor them all in one transaction.
We can therefore process them in batches using the apoc.periodic.iterate
procedure.
The following query does this for the cancelled
and reverted
properties in the same query:
UNWIND ["cancelled", "reverted"] AS propertyToDelete
CALL apoc.periodic.iterate(
"MATCH (:Airport)-[connectedTo:CONNECTED_TO]->(:Airport) RETURN connectedTo",
"CALL apoc.refactor.normalizeAsBoolean(connectedTo, $propertyToDelete, ['1'], ['0'])
RETURN count(*)",
{params: {propertyToDelete: propertyToDelete}, batchSize: 100})
YIELD batches
RETURN propertyToDelete, batches
The apoc.periodic.iterate
procedure in this query takes in three parameters:
-
An outer Cypher query that finds and returns a stream of
CONNECTED_TO
relationships to be processed. -
An inner Cypher query that processes those
CONNECTED_TO
relationships, converting to boolean any values for the specified property on those relationships. It does this using theapoc.refactor.normalizeAsBoolean
procedure, which itself takes in several parameters:-
the entity on which the property exists
-
the name of the property to normalize
-
a list of values that should be considered
true
-
a list of values that should be considered
false
-
-
Configuration values for the procedure, including:
-
params
- parameters passed into those Cypher queries -
batchSize
- controls the number of inner statements that are run within a single transaction
-
When we run this query we see the following output:
propertyToDelete | batches |
---|---|
"cancelled" |
10 |
"reverted" |
10 |
Once we’ve done this, we can write the following query to return all cancelled connections:
MATCH (origin:Airport)-[connectedTo:CONNECTED_TO]->(destination)
WHERE connectedTo.cancelled
RETURN origin.code AS origin,
destination.code AS destination,
connectedTo.date AS date,
connectedTo.departure AS departure,
connectedTo.arrival AS arrival
origin | destination | date | departure | arrival |
---|---|---|---|---|
"LAS" |
"OAK" |
2008-01-03 |
07:00 |
08:30 |
"LAX" |
"SFO" |
2008-01-03 |
09:05 |
10:25 |
"LAX" |
"OAK" |
2008-01-03 |
11:00 |
12:15 |
"LAX" |
"SJC" |
2008-01-03 |
19:30 |
20:35 |
"LAX" |
"SFO" |
2008-01-03 |
16:20 |
17:40 |
"MDW" |
"STL" |
2008-01-03 |
11:10 |
12:15 |
"MDW" |
"BDL" |
2008-01-03 |
08:45 |
11:40 |
"MDW" |
"DTW" |
2008-01-03 |
06:00 |
08:05 |
"MDW" |
"STL" |
2008-01-03 |
14:45 |
15:50 |
"MDW" |
"BNA" |
2008-01-03 |
19:25 |
20:45 |
"OAK" |
"BUR" |
2008-01-03 |
13:10 |
14:15 |
"OAK" |
"BUR" |
2008-01-03 |
17:05 |
18:10 |
Create node from relationship
Next, imagine that we want to write a query that finds a specific flight.
This is quite difficult with our existing model because flights are represented as relationships.
We can evolve our model to create a Flight
node from the properties stored on the CONNECTED_TO
relationship.
The following query does this refactoring:
CALL apoc.periodic.iterate(
"MATCH (origin:Airport)-[connected:CONNECTED_TO]->(destination:Airport) RETURN origin, connected, destination",
"CREATE (flight:Flight {
date: connected.date,
airline: connected.airline,
number: connected.flightNumber,
departure: connected.departure,
arrival: connected.arrival,
cancelled: connected.cancelled,
diverted: connected.diverted
})
MERGE (origin)<-[:ORIGIN]-(flight)
MERGE (flight)-[:DESTINATION]->(destination)
DELETE connected",
{batchSize: 100})
As with our previous query, this query uses the apoc.periodic.iterate
procedure so that we can do the refactoring in batches rather than within a single transaction.
The procedure takes in three parameters:
-
An outer Cypher query that finds and returns a stream of
CONNECTED_TO
relationships, and origin and destination airports that need to be processed. -
An inner Cypher query that processes those entities, creating a node with the label
Flight
and creating relationships from that node to the origin and destination airports. -
batchSize
configuration, which sets to100
the number of inner statements that are run within a single transaction.
If we execute the query we’ll see the following output:
batches | total | timeTaken | committedOperations | failedOperations | failedBatches | retries | errorMessages | batch | operations | wasTerminated |
---|---|---|---|---|---|---|---|---|---|---|
10 |
1000 |
0 |
1000 |
0 |
0 |
0 |
{} |
{total: 10, committed: 10, failed: 0, errors: {}} |
{total: 1000, committed: 1000, failed: 0, errors: {}} |
FALSE |
We can also do this refactoring using the apoc.refactor.extractNode
procedure.
CALL apoc.periodic.iterate(
"MATCH (origin:Airport)-[connected:CONNECTED_TO]->(destination:Airport)
RETURN origin, connected, destination",
"CALL apoc.refactor.extractNode([connected], ['Flight'], 'DESTINATION', 'ORIGIN')
YIELD input, output, error
RETURN input, output, error",
{batchSize: 100});
This does the same as the previous query, but the outer Cypher query uses the apoc.refactor.extractNode
procedure to create the Flight
node and create relationships to origin and destination airports.
If we run this query we’ll see the following output:
batches | total | timeTaken | committedOperations | failedOperations | failedBatches | retries | errorMessages | batch | operations | wasTerminated |
---|---|---|---|---|---|---|---|---|---|---|
10 |
1000 |
0 |
1000 |
0 |
0 |
0 |
{} |
{total: 10, committed: 10, failed: 0, errors: {}} |
{total: 1000, committed: 1000, failed: 0, errors: {}} |
FALSE |
Create node from property
At the moment the airline for our flights is stored in the airline
property on Flight
nodes.
This means that if we wanted to return a stream of all airlines we’d need to scan through every flight and check the airline
property on each of those flights.
We can make it easier, and more efficient, to write this query by creating a node with an Airline
label for each airline:
Let’s first create a constraint on the Airline
label and name
property so that we don’t create duplicate airline nodes:
CREATE CONSTRAINT airline_id
FOR (airline:Airline) REQUIRE airline.name IS UNIQUE
0 rows available after 107 ms, consumed after another 0 ms. Added 1 constraints |
And now we can execute the following query to do the refactoring:
CALL apoc.periodic.iterate(
'MATCH (flight:Flight) RETURN flight',
'MERGE (airline:Airline {name:flight.airline})
MERGE (flight)-[:AIRLINE]->(airline)
REMOVE flight.airline',
{batchSize:10000, iterateList:true, parallel:false}
)
Again we’re using the apoc.periodic.iterate
procedure, with the following parameters:
-
An outer Cypher statement that returns a stream of
Flight
nodes to be processed -
An inner Cypher statementthat processes these flight nodes, creating
Airline
nodes based on flights'airline
property and created anAIRLINE
relationship from theFlight
to theAirline
node. We then remove theairline
property from theFlight
node.
If we run this query we’ll see the following output:
batches | total | timeTaken | committedOperations | failedOperations | failedBatches | retries | errorMessages | batch | operations | wasTerminated |
---|---|---|---|---|---|---|---|---|---|---|
1 |
1000 |
0 |
1000 |
0 |
0 |
0 |
{} |
{total: 1, committed: 1, failed: 0, errors: {}} |
{total: 1000, committed: 1000, failed: 0, errors: {}} |
FALSE |
We can then write the following query to find the airlines and number of flights involving each:
MATCH (airline:Airline)<-[:AIRLINE]-(:Flight)
RETURN airline.name AS airline, count(*) AS numberOfFlights
This does the same as the previous query, but the outer Cypher query uses the apoc.refactor.extractNode
procedure to create the Flight
node and create relationships to origin and destination airports.
If we run this query we’ll see the following output:
airline | numberOfFlights |
---|---|
"WN" |
1000 |
Resources
This guide has shown how to refactor a graph model, with help from procedures in the APOC Library. Below are some resources for learning more about refactoring in Neo4j:
Was this page helpful?