This week we want to explore the GDelt project, which collects and analyses global news from hundreds of sources and makes them available in many formats.
For our needs their API is good enough but if you want to go deeper there is much more to be found.
We’ll be using the CSV endpoint to get news for a certain topic with geolocation and country.
We could also use the JSON result but it misses the geo-information.
Unfortunately the REST API is a bit too limited for real use, as it does not return
-
language
-
timestamp
-
entities (people, locations, organizations)
So for properly using GDelt you have to query the bigquery tables to extract all that detail.
David Allen has explored that in this repository: https://github.com/moxious/gdelt
WITH "elections" AS topic
LOAD CSV WITH HEADERS FROM "https://api.gdeltproject.org/api/v2/geo/geo?query="+topic+"&MODE=country&format=csv&TIMESPAN=7d&GEORES=0" AS row
RETURN row LIMIT 5
WITH "elections" AS topic
LOAD CSV WITH HEADERS FROM "https://api.gdeltproject.org/api/v2/geo/geo?query="+topic+"&MODE=country&format=csv&TIMESPAN=7d&GEORES=0" AS row
MERGE (t:Topic {name:topic})
MERGE (n:News {url:row.URL})
ON CREATE SET n.title=row.Title, n.location=point({latitude:toFloat(row.Latitude), longitude:toFloat(row.Longitude)}), n.image=row.ImageURL
MERGE (n))-[:TOPIC]->(t)
MERGE (c:Country {name:row.Location}) ON CREATE SET c.location = n.location
MERGE (n)-[:COUNTRY]->(c)
MERGE (t)-[r:RESULTS]->(c) SET r.count=toInteger(row.LocationResultCount)
Feel free to use other topics to query, we used these additional ones in the session:
-
climate
-
china
-
usa
-
taiwan
-
corona
-
vaccines
MATCH (c:Country)<-[:COUNTRY]-(n:News)-[:TOPIC]->(t:Topic)
RETURN * LIMIT 20
match (c:Country)
return c, size( (c)<-[:COUNTRY]-()) as newsCount
order by newsCount desc LIMIT 5
MATCH path=(t1:Topic)<-[:TOPIC]-(n:News)-[:TOPIC]->(t2:Topic)
RETURN path LIMIT 20
We also looked at how to expand our news graph by extracting new topics from titles.
match (n:News)
unwind apoc.text.split(toLower(n.title), "\W+") as word
with word where size(word) > 3
return word, count(*) order by count(*) desc limit 100