A web scraper that reads the information from the public UW-Madison course guide and saves it in a SQLite database. It uses Scrapy, an open source web scraping library designed using Python. To learn more about it, check out its documentation
Before you start, you'll need to install a few system libraries:
- Python 2.7
- libxml2 / libxslt
- libffi
- OpenSSL
- SQLite 3
You'll also need a C compiler. The Scrapy installation notes will help you get started.
Then, clone the repo and install the project's dependencies with pip:
pip install -r requirements.txt
Currently the only way to run the crawler is with the scrapy
command-line tool. This must be run inside the UWMadCrawler
directory, as so:
cd UWMadCrawler
scrapy crawl UWMad
This will crawl the website and print out the courses on the register.
It will create a SQLite database inside this directory, named classes.db
. Courses and their sections are saved into
different tables, along with (almost) all of the information available about them in the course guide. Sections are
related to courses by sharing the same department, course number, and course title.
Courses are not expected to change very often, so the scraper will avoid re-adding them if they are already in the database. However, sections are saved along with the time of last modification found on the course guide itself. This means that you can scrape multiple times and (hopefully!) have the right thing happen.
This program is released under the MIT license. Copyright (c) 2014 The Badger Herald, Inc.
This program is a fork of the UWMadCrawler by Joe Kelley, originally released under the MIT license (see NOTICE).