Skip to content

Running the scripts

Nate Wessel edited this page Nov 26, 2019 · 9 revisions

There are two main ways of accessing the code. You can either scrape the data first and process it later, or you can scrape and process at the same time. Either way, there is a little setup to do first.

Dependencies

  • Python3
  • Python shapely module
  • PostGIS
  • a running OSRM-backend server

Set up the database

You need a PostGIS-enabled PostgreSQL database set up. To create the necessary tables run create-agency-tables.sql. You will likely need to change the projection on each of the tables - make sure it is based in meters and appropriate to your agency's locale. You may also set a prefix for table names to distinguish multiple agencies.

Set up OSRM

Have an OSRM server running somewhere with data loaded for your area of interest. OSRM is configured with Lua files, and several of these come as defaults with OSRM. We have transit.lua, a modified profile which works pretty well in cities with both streetcars and buses (tested on Toronto and San Francisco) or with buses only. You'll likely want to fiddle with this and get it optimized for your city.

Create the configuration file

sample_conf.py is a configuration file template. rename it to conf.py and fill in the details of your setup. Be sure to use the same meter-based projection which you used in the PostGIS setup in the previous step. Fill in the names of the (prefixed) PostGIS tables if you changed them.

The agency field should contain the agency tag from the NextBus API.

Running store.py

store.py is responsible for scraping the API and storing the data. If you would like to process trips into GTFS in real-time, add the doMatching parameter, i.e. python3 path/to/store.py doMatching. If this is your first time running the script (in a while) you should also add the getRoutes parameter, which will pull down all the schedule information at the beginning. If this last parameter isn't specified, then every ending trip has a 0.1 probability of initiating a new request for that route's schedule information. In this way, we keep abreast of any changes to the schedule without unduly burdening the API.

There is one further option: you can run store.py with the flag truncateData. This will eliminate any results of prior trip processing from the database, while retaining any recorded trips and schedule information.

Trips will be stored in the database as they end. Until then, they are held in memory. Interrupting the script will result in some lost data - this is designed to run for days or weeks at a time without interruption. It's probably best to put it somewhere where its internet connection won't be disturbed in any way.

Running process.py

process.py is for processing trips which are already stored in the database. There are four modes to choose from:

  1. single: Do one trip at a time, identified by the trip_id. This is useful for debugging.
  2. all: Process all trips in the DB.
  3. route: Process all trips belonging to a given route_id.
  4. unfinished: Process all trips that have not been processed yet.

The last three options can run in parallel and will ask you how many processes you would like to allocate. If OSRM is running locally, you can go up to the number of processors on your computer. If OSRM is running remotely, your bottleneck will either be local or on the remote server. Choose appropriately.

Debugging the Results

More documentation will be here eventually

You may find that some quirk in the OSM data yields weird map-matching results for a whole route. Fix the data problem, restart OSRM, and rerun just the one route to see if better match results are produced. This will tell you how many trips the selected route has and ask for a range of them to process.