Exploring Stackoverflow
Every developer has a tab open Stack Overflow, the massively popular question-and-answer site. This dataset explores users, questions, answers, comments, and tags related to the neo4j
- and cypher
- tagged questions on Stack Overflow.
This guide will show you how to:
-
Create: load questions, answers, comments, tags, and users into the graph
-
Find: Find unanswered questions
-
Query: discover the most engaged users and most popular questions
Throughout the guide you’ll find Cypher statements that you can execute, by clicking on them and then executing them by hitting the run button.
The Model
Relationships
Users ask questions, comment, and provide answers.
(:User)-[:ASKED]→(:Question)
(:User)-[:COMMENTED]→(:Comment)
(:User)-[:PROVIDED]→(:Answer)
Answers attempt to answer questions, comments are made on questions, and questions are associated with tags.
(:Answer)-[:ANSWERED]→(:Question)
(:Comment)-[:COMMENTED_ON]→(:Question)
(:Question)-[:TAGGED]→(:Tag)
If you want to see it yourself, run:
CALL db.schema.visualization;
Load JSON Import
There is already data preloaded in this graph, if you want to extend (more pages or other tags) or update it with the most recent questions, please modify and run the statement below.
Update this dataset using apoc.load.json
. Run the query as-is to update or add a tag to the tags
array to extend.
// look for several pages of questions
WITH ["neo4j","cypher"] as tags
UNWIND tags as tagName
UNWIND range(1,2) as page
WITH "https://api.stackexchange.com/2.3/questions?page="+page+"&pagesize=25&order=desc&sort=creation&tagged="+tagName+"&site=stackoverflow&filter=!5-i6Zw8Y)4W7vpy91PMYsKM-k9yzEsSC1_Uxlf" as url
CALL apoc.load.json(url) YIELD value
CALL apoc.util.sleep(250) // careful with throttling
UNWIND value.items AS q
// create the questions
MERGE (question:Question {uuid:q.question_id})
ON CREATE SET question.title = q.title,
question.link = q.share_link,
question.creation_date = q.creation_date,
question.accepted_answer_id=q.accepted_answer_id,
question.view_count=q.view_count,
question.answer_count=q.answer_count,
question.body_markdown=q.body_markdown
// who asked the question
MERGE (owner:User {uuid:coalesce(q.owner.user_id,'deleted')})
ON CREATE SET owner.display_name = q.owner.display_name
MERGE (owner)-[:ASKED]->(question)
// what tags do the questions have
FOREACH (tagName IN q.tags |
MERGE (tag:Tag {name:tagName})
ON CREATE SET tag.link = "https://stackoverflow.com/questions/tagged/" + tag.name
MERGE (question)-[:TAGGED]->(tag))
// who answered the questions?
FOREACH (a IN q.answers |
MERGE (question)<-[:ANSWERED]-(answer:Answer {uuid:a.answer_id})
ON CREATE SET answer.is_accepted = a.is_accepted,
answer.link=a.share_link,
answer.title=a.title,
answer.body_markdown=a.body_markdown,
answer.score=a.score,
answer.favorite_score=a.favorite_score,
answer.view_count=a.view_count
MERGE (answerer:User {uuid:coalesce(a.owner.user_id,'deleted')})
ON CREATE SET answerer.display_name = a.owner.display_name
MERGE (answer)<-[:PROVIDED]-(answerer)
)
// who commented ont he question
FOREACH (c in q.comments |
MERGE (question)<-[:COMMENTED_ON]-(comment:Comment {uuid:c.comment_id})
ON CREATE SET comment.link=c.link, comment.score=c.score
MERGE (commenter:User {uuid:coalesce(c.owner.user_id,'deleted')})
ON CREATE SET commenter.display_name = c.owner.display_name
MERGE (comment)<-[:COMMENTED]-(commenter)
);
Read More: Import from StackOverflow API
Basic Queries
Labels in the graph and counts for each:
MATCH (n)
RETURN labels(n) as label, count(*) as freq
ORDER BY freq DESC;
Relationship-types in the graph and counts for eachabel:
MATCH ()-[r]->()
RETURN type(r) as type, count(*) as freq
ORDER BY freq DESC;
Top Tags
Which are the most popular tags?
MATCH (q:Question)-[:TAGGED]->(t:Tag)
RETURN t.name, count(q) AS questions
ORDER BY questions DESC
LIMIT 5;
Exploring Users
Who are the top users asking questions?
MATCH (u:User)-[:ASKED]->(q:Question)
RETURN u.display_name, count(*) AS questions
ORDER by questions DESC
LIMIT 10;
Who’s answering?
Ordered by number of answers
MATCH (u:User)-[:PROVIDED]->(a:Answer)-[:ANSWERED]->(q:Question)
RETURN u.display_name as user,COUNT(a) AS answers, avg(a.score) as avg_score
ORDER BY answers DESC LIMIT 10;
Ordered by max score, filtered for a particular tag
MATCH (u:User)-[:PROVIDED]->(a:Answer)-[:ANSWERED]->
(q:Question)-[:TAGGED]->(:Tag {name:"cypher"})
RETURN u.display_name as user,COUNT(a) AS answers, max(a.score) as max_score
ORDER BY max_score DESC LIMIT 10;
What’s the shortest path between users?
MATCH path = allShortestPaths(
(u1:User {display_name:"alexanoid"})-[*]-(u2:User {display_name:"InverseFalcon"})
)
RETURN path LIMIT 1;
User Engagement
User engagement over time:
MATCH (u:User)-[:PROVIDED]->()-[:ANSWERED]->
(q:Question)-[:TAGGED]->(t:Tag)
WHERE u.display_name = "InverseFalcon"
RETURN apoc.date.format(q.creation_date,'s','yyyy-MM') as month,
count(distinct q) as count, collect(distinct t.name) as tags
ORDER BY month asc
Unanswered Questions
What are the tags for unanswered questions?
MATCH (q:Question)-[:TAGGED]->(t:Tag)
WHERE NOT t.name IN ['neo4j','cypher']
AND NOT (q)<-[:ANSWERED]-()
RETURN t.name as tag, count(q) AS questions
ORDER BY questions DESC LIMIT 10;
How are tags related to other tags?
Tag correlations:
MATCH (t1:Tag)<-[:TAGGED]-()-[:TAGGED]->(t2:Tag)
WHERE id(t1) < id(t2) and t1.name <> 'neo4j' and t2.name <> 'neo4j'
RETURN t1.name, t2.name,count(*) as freq
ORDER BY freq desc LIMIT 10;
Virtual Graphs: Tags
Project tags via co-occurrence with virtual relationships.
MATCH (t1:Tag)<-[:TAGGED]-()-[:TAGGED]->(t2:Tag)
WHERE id(t1) < id(t2) and t1.name <> 'neo4j' and t2.name <> 'neo4j'
WITH t1, t2,count(*) as freq where freq > 3
RETURN t1,t2, apoc.create.vRelationship(t1,'OCCURRED',{freq:freq},t2) as rel
If you want to you can also materialize those relationships in the graph and then explore the data in Browser or Bloom visually forming clusters.
MATCH (t1:Tag)<-[:TAGGED]-()-[:TAGGED]->(t2:Tag)
WHERE id(t1) < id(t2) and t1.name <> 'neo4j' and t2.name <> 'neo4j'
WITH t1, t2,count(*) as freq where freq > 3
MERGE (t1)-[r:OCCURRED]-(t2) SET r.freq=freq
RETURN count(*)
Virtual Graphs: Social Network
We can do something similar for commenters - what users keep on running into each other in the comments?
MATCH p1=(u1:User)-[:COMMENTED]->(c1:Comment)-[:COMMENTED_ON]-(q:Question)
MATCH p2=(u2:User)-[:COMMENTED]->(c2:Comment)-[:COMMENTED_ON]-(q)
WHERE id(u1) < id(u2)
WITH u1, u2, count(distinct q) as freq
WHERE freq > 2
RETURN u1, u2, apoc.create.vRelationship(u1,'OCCURRED',{freq:freq},u2) as rel