Skip to content

Latest commit

 

History

History
122 lines (90 loc) · 8.84 KB

week-1.adoc

File metadata and controls

122 lines (90 loc) · 8.84 KB

Week 1 - The NBA data set

In this week’s episode, Lju and Alexander explore a National Basketball Association (NBA) dataset, from FiveThirtyEight. Catch-up on the episode now!

maxresdefault
Week 1 catch-up video

The data set

This week we explored a NBA data set. The data consisted of teams and all of their matches, along with points scored, match dates and location, as well as pre and post match Elo ratings.

img\img repo

The questions

The following questions for the data set came up during the stream:

  • Historical results between two teams

  • The teams with the biggest Elo rating difference

  • Find out home/away matches

  • Can we predict results based on past experiencees?

The data model

You can view and play with the data model in Arrows App here.

img\NBAGraph
Figure 1. The NBA data model we used in the stream

One of the questions that came up was rather than using the above model, why didn’t we choose using PLAYED_IN_AWAY and PLAYED_IN_HOME relationship types instead. This is an option too - and if primarily the questions you are asking are around home/away matches, then this would give a slightly faster performance.

To keep the data model as simple as possible, we decided to keep the relationship between Team and Match generic for now. We’ll give an example query of how to bring back Home/Away teams shortly.

Loading the data

One thing to bear about the data is the use of _iscopy property - you will spot that the rows 'repeat', but they are reflecting the view point of each of the teams, i.e. the home team as the primary team in the row, and the away team as the primary team. As a result, we can safely skip _iscopy=1 when loading the data.

img\img data

To load the data into the database, we use the following query:

LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/fivethirtyeight/data/master/nba-elo/nbaallelo.csv" AS row
//skipping doubles, also Aura Free limit of 50k nodes, ~ 100 teams
WITH row LIMIT 99900 WHERE row._iscopy="0"
MERGE (ht:Team {code:row.team_id})
    ON CREATE SET ht.name = row.fran_id
MERGE (at:Team {code:row.opp_id})
    ON CREATE SET at.name = row.opp_fran
CREATE (m:Match {id:row.game_id, venue:right(row.game_id, 3), date:row.date_game})
WITH ht, at, m, row.pts AS hpoints, row.opp_pts AS apoints,
    row.elo_i AS hse, row.elo_n AS hee, row.opp_elo_i AS ase,
    row.opp_elo_n AS aee
CREATE (ht)-[:PLAYED_IN {points:hpoints, startElo:hse, endElo:hee}]->(m)
CREATE (at)-[:PLAYED_IN {points:apoints, startElo:ase, endElo:aee}]->(m)
Note
When using LOAD CSV, integers and floats may be treated as strings. As a result, don’t forget to check, especially if you get unexpected results, e.g. row._iscopy="0" versus row._iscopy=0

Querying the data

Some of you may have spotted that all of our numerical data is being stored as strings. Whilst we could have sorted that at the load stage, we can also quickly handle it now. We’re going to change points to integers, and Elo ratings to floats, found as properties in the PLAYED_IN relationships:

MATCH ()-[r:PLAYED_IN]->()
SET r.points = tointeger(r.points)
SET r.startElo = tofloat(r.startElo)
SET r.endElo = tofloat(r.endElo)

Question 1 - Historical results between two teams

//Look at the Huskies/Knicks rivalries
//Using the Match ID for the ISO 'format' date
MATCH (t1:Team {code:"TRH"})
MATCH (t2:Team {code:"NYK"})
MATCH (t1)-[r1]->(m)<-[r2]-(t2)
//Using the Match ID for the ISO 'format' date
RETURN left(m.id, size(m.id)-4) AS Date, t1.name, t2.name,
    r1.points > r2.points AS `Huskies won`, m.venue ORDER BY Date

Question 2 - The teams with the biggest Elo rating difference

MATCH (t:Team)-[r]->(m)
WITH t.name AS name, min(r.endElo) AS lowestElo, max(r.endElo) AS highestElo
WITH name, lowestElo, highestElo, highestElo - lowestElo AS diffElo
RETURN name, lowestElo, highestElo, diffElo ORDER BY diffElo DESC

Question 3 - Find out home/away matches

MATCH (t:Team {name:"Sixers"})-[r:PLAYED_IN]->(m:Match)<-[r2:PLAYED_IN]-(opp:Team)
//match venue to team code for home matches, or <> for away matches
WHERE m.venue = t.code
RETURN left(m.id, size(m.id)-4) AS Date, opp.name AS `Played Home Against`,
    r.points + " - " + r2.points AS `Score`

Question 4 - Can we predict results based on past experiences?

We'll come back to you on this one! Watch this space :)