Skip to content

Latest commit

 

History

History
125 lines (89 loc) · 9.53 KB

week-2.adoc

File metadata and controls

125 lines (89 loc) · 9.53 KB

Week 2 - The Board Game Geek data set

In this week’s episode, Lju and Alexander explore a scraped data set from Board Game Geek, from Sean Beck’s Github repository. Catch-up on the episode now!

maxresdefault
Week 1 catch-up video

The data set

This week we explored a scraped dataset from Board Game Geek. The data consisted of games, ratings, complexity ratings, number of players, playing time, and so forth.

As we’re keen to continue with the theme of 'no download', Alexander managed to locate a cut of data on Sean Beck’s public Github repository, which we use in this session.

img\bb csv

The questions

The following questions for the data set came up during the stream:

  • What are the top rated games and what they are called?

  • What games that are most popular?

  • What games are available for x players with complexity of y?

  • What games can I play with x friends which require at least y minutes?

The data model

You can view and play with the data model in Arrows App here.

The Board Game Geek model we used in the stream

img\bbgmodel

Based on the questions we’re looking to answer, we’re interested in the following fields in the CSV:

  • Game name and id

  • Rating

  • Min and max playeers

  • Game playing time

  • Game complexity/weight

  • Number of users who rated the game (we forgot to add this in the stream! We will revisit this data set and add it accordingly)

This was a bumper session for modelling decisions! Based on the questions, we could have easily just had a single Game node, with all the mentioned fields as properties. There are a couple of issues with this:

  • Firstly, it wouldn’t be a very graphy graph! We’d have a lot of lonely nodes :)

  • Secondly, we need to think about what questions we are asking - we’ve got questions that ask about games based on number of players - and as a general rule of thumb when thinking if something is a node/relationship/property - if we use something a lot, it’s probably going to be a node

In the end, as we are interested in games around player count, as well as identifying that player count are discrete values, this is what lead to the chosen model. This is by no means a complete job! We still need to think about playing time, weight and rating, and we may well decide that those shouldn’t live on Game. But for brevity, we ran with this model during the session.

Loading the data

To load the data into the database, we use the following query:

LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/ThaWeatherman/scrapers/b8757711296703356891f8f4e7757de6698a6b5b/boardgamegeek/games.csv" AS ROW
WITH ROW WHERE tointeger(row.playingtime) > 9
    AND tointeger(row.playingtime) <61
CREATE (g:Game {name:row.name, weight:tofloat(row.average_weight),
    rating:tofloat(row.average_rating),
    playingTime:tointeger(row.playingtime), id:row.id})
MERGE (pmin:PlayerCount {value:tointeger(row.minplayers)})
MERGE (pmax:PlayerCount {value:tointeger(row.maxplayers)})
WITH g, pmin, pmax
CREATE (g)-[:HAS_MIN_PLAYERS]->(pmin)
CREATE (g)-[:HAS_MAX_PLAYERS]->(pmax)

Bearing in mind the 50k node limit in Aura Free, we decided to limit the games loaded based on playing time of between 10 and 60 minutes.

Querying the data

Whereas last week we cast strings to integers and floats after we loaded the data, this week, we did it whilst loading the data. You can see what we did in the load query.

Question 1 - What are the top rated games and what they are called?

MATCH (g:Game)
RETURN g.name, g.rating
    ORDER BY g.rating DESC LIMIT 10

Question 2 - What games that are most popular?

We forgot to load the data for this, so we’ll come back to this in a future episode! You are, of course, welcome to import the data (field name: users_rated) and have a go yourself!

Question 3 - What games are available for x players with complexity of y?

WITH 4 AS complexity, 3 AS min, 5 AS max
MATCH (g:Game)-[:HAS_MIN_PLAYERS]->(pcmin:PlayerCount),
(g)-[:HAS_MAX_PLAYERS]->(pcmax:PlayerCount)
WHERE g.weight >= complexity AND pcmin.value <= max
    AND pcmax.value >= min
RETURN g.name AS name, g.weight AS complexity,
    [pcmin.value, pcmax.value] AS `player range`
    ORDER BY complexity
Note
We’ve corrected the query as it was flagged that the query put together on the stream would miss certain games!

Question 4 - What games can I play with x friends which require at least y minutes?

WITH 30 AS playingTime, 2 AS min, 2 AS max
MATCH (g:Game)-[:HAS_MIN_PLAYERS]->(pcmin:PlayerCount),
(g)-[:HAS_MAX_PLAYERS]->(pcmax:PlayerCount)
WHERE g.playingTime >= playingTime AND pcmin.value <= max
    AND pcmax.value >= min
RETURN g.name AS name, g.playingTime AS `playing time`,
    [pcmin.value, pcmax.value] AS `player range`
    ORDER BY `playing time`