In this week’s episode, Lju and Alexander explore a National Basketball Association (NBA) dataset, from FiveThirtyEight. Catch-up on the episode now!
This week we explored a NBA data set. The data consisted of teams and all of their matches, along with points scored, match dates and location, as well as pre and post match Elo ratings.
The following questions for the data set came up during the stream:
-
Historical results between two teams
-
The teams with the biggest Elo rating difference
-
Find out home/away matches
-
Can we predict results based on past experiencees?
You can view and play with the data model in Arrows App here.
One of the questions that came up was rather than using the above model, why didn’t we choose using PLAYED_IN_AWAY
and PLAYED_IN_HOME
relationship types instead. This is an option too - and if primarily the questions you are asking are around home/away matches, then this would give a slightly faster performance.
To keep the data model as simple as possible, we decided to keep the relationship between Team
and Match
generic for now. We’ll give an example query of how to bring back Home/Away teams shortly.
One thing to bear about the data is the use of _iscopy
property - you will spot that the rows 'repeat', but they are reflecting the view point of each of the teams, i.e. the home team as the primary team in the row, and the away team as the primary team. As a result, we can safely skip _iscopy=1
when loading the data.
To load the data into the database, we use the following query:
LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/fivethirtyeight/data/master/nba-elo/nbaallelo.csv" AS row
//skipping doubles, also Aura Free limit of 50k nodes, ~ 100 teams
WITH row LIMIT 99900 WHERE row._iscopy="0"
MERGE (ht:Team {code:row.team_id})
ON CREATE SET ht.name = row.fran_id
MERGE (at:Team {code:row.opp_id})
ON CREATE SET at.name = row.opp_fran
CREATE (m:Match {id:row.game_id, venue:right(row.game_id, 3), date:row.date_game})
WITH ht, at, m, row.pts AS hpoints, row.opp_pts AS apoints,
row.elo_i AS hse, row.elo_n AS hee, row.opp_elo_i AS ase,
row.opp_elo_n AS aee
CREATE (ht)-[:PLAYED_IN {points:hpoints, startElo:hse, endElo:hee}]->(m)
CREATE (at)-[:PLAYED_IN {points:apoints, startElo:ase, endElo:aee}]->(m)
Note
|
When using LOAD CSV , integers and floats may be treated as strings. As a result, don’t forget to check, especially if you get unexpected results, e.g. row._iscopy="0" versus row._iscopy=0
|
Some of you may have spotted that all of our numerical data is being stored as strings. Whilst we could have sorted that at the load stage, we can also quickly handle it now. We’re going to change points to integers, and Elo ratings to floats, found as properties in the PLAYED_IN
relationships:
MATCH ()-[r:PLAYED_IN]->()
SET r.points = tointeger(r.points)
SET r.startElo = tofloat(r.startElo)
SET r.endElo = tofloat(r.endElo)
Question 1 - Historical results between two teams
//Look at the Huskies/Knicks rivalries
//Using the Match ID for the ISO 'format' date
MATCH (t1:Team {code:"TRH"})
MATCH (t2:Team {code:"NYK"})
MATCH (t1)-[r1]->(m)<-[r2]-(t2)
//Using the Match ID for the ISO 'format' date
RETURN left(m.id, size(m.id)-4) AS Date, t1.name, t2.name,
r1.points > r2.points AS `Huskies won`, m.venue ORDER BY Date
Question 2 - The teams with the biggest Elo rating difference
MATCH (t:Team)-[r]->(m)
WITH t.name AS name, min(r.endElo) AS lowestElo, max(r.endElo) AS highestElo
WITH name, lowestElo, highestElo, highestElo - lowestElo AS diffElo
RETURN name, lowestElo, highestElo, diffElo ORDER BY diffElo DESC
Question 3 - Find out home/away matches
MATCH (t:Team {name:"Sixers"})-[r:PLAYED_IN]->(m:Match)<-[r2:PLAYED_IN]-(opp:Team)
//match venue to team code for home matches, or <> for away matches
WHERE m.venue = t.code
RETURN left(m.id, size(m.id)-4) AS Date, opp.name AS `Played Home Against`,
r.points + " - " + r2.points AS `Score`
Question 4 - Can we predict results based on past experiences?
We'll come back to you on this one! Watch this space :)