A fork of steam scraper

Description

This is a fork of steam scraper. Key differences:

Updated to later version
Output is now stored in SQLite database
Additional fields are being scraped (ie. description)
Added script for fetching News from API
Added script for minimizing SQLite dataset

Steam Scraper

This repository contains Scrapy spiders for crawling products and scraping all user-submitted reviews from the Steam game store. A few scripts for more easily managing and deploying the spiders are included as well.

This repository contains code accompanying the Scraping the Steam Game Store article published on the Scrapinghub blog and the Intoli blog.

Installation

After cloning the repository with

git clone [email protected]:lkrsnik/steam-scraper.git

start and activate a Python 3.6+ virtualenv with

cd steam-scraper
virtualenv -p python3 venv
. venv/bin/activate

Install Python requirements via:

pip install -r requirements.txt

By the way, on macOS you can install Python 3.6 via homebrew:

brew install python3

On Ubuntu you can use instructions posted on askubuntu.com.

Crawling the Products

The purpose of ProductSpider is to discover product pages on the Steam product listing and extract useful metadata from them. A neat feature of this spider is that it automatically navigates through Steam's age verification checkpoints. You can initiate the multi-hour crawl with

mkdir output
scrapy crawl products --logfile=output/products_all.log --loglevel=INFO -s JOBDIR=output/products_all_job -s HTTPCACHE_ENABLED=False -a sqlite_path=output/db.sqlite3

When it completes you should have metadata for all games (products) on Steam stored in db.sqlite3.

Extracting the Reviews

The purpose of ReviewSpider is to scrape all user-submitted reviews of a particular product from the Steam community portal. By default, it scrapes reviews of products, where column reviews_scraped is empty (NULL) and n_reviews is larger than 10.

scrapy crawl reviews --logfile=output/reviews_all.log --loglevel=INFO -s JOBDIR=output/reviews -s HTTPCACHE_ENABLED=False -a sqlite_path=output/db.sqlite3

If you want to scrape all reviews, the whole job takes a few days with Steam's generous rate limits.

Obtaining news

The repository also includes a script that gives you an option to add news of all projects to the database. This is done by accessing Steam API and not scraping.

python -m scripts.get_news_api --sqlite_path output/db.sqlite3

Minimizing database

If you manage to get complete database, but would like to get a sample database from it, you may use minimize_dataset.py script.

python -m scripts.minimize_dataset --sqlite_path output/db.sqlite3 --minimized_sqlite_path output/db_mini.psql --size 1000

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
scripts		scripts
steam		steam
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A fork of steam scraper

Description

Steam Scraper

Installation

Crawling the Products

Extracting the Reviews

Obtaining news

Minimizing database

About

Releases

Packages

Languages

lkrsnik/steam-scraper

Folders and files

Latest commit

History

Repository files navigation

A fork of steam scraper

Description

Steam Scraper

Installation

Crawling the Products

Extracting the Reviews

Obtaining news

Minimizing database

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages