An API to provide predictions of bike availability for Washington D.C.'s Capital Bikeshare.
This API uses Capital Bikeshare data and weather data to predict the chances a bikeshare station will have at least one bike available and at least one empty bike dock available, given a date and time. For each bikeshare station, models are built using random forests based on historical station status data and historical weather data. The API then accepts a datetime string and a location string (such as a street address or neighborhood), and returns the closest stations and the availability predictions.
This project uses Python 3. Data analysis and machine learning were done using Pandas and scikit-learn. The API was built using Falcon.
I built this project during the summer of 2016 while I was an intern at Mission Data. I greatly appreciate their permission to open source this project and post it here on GitHub, as well as their help and guidance on this project.
Mission Data is currently hosting this API (not currently for public use), with some minor enhancements, as well as a web application where you can get predictions. The original version of the web application was written by me in Angular 2 with TypeScript, but significant UI and style improvements have since been made by others at Mission Data.
A series of three lab notes have been published detailing the progression of this project. The first two explain the machine learning that is used to make the predictions, and the third discusses a conversational Slack bot (which I did not have a part in building) that uses the API. I also gave a brief presentation about the project at the sixth CaBi Hack Night meetup in November 2016, slides are available here.
This project was my first time using Python and doing any machine learning. Given that, I'm pretty happy with how it turned out. As with any complex project (but perhaps this one in particular due to my inexperience and limited timeframe), there are many potential improvements and expansions. This code is licensed under the open source MIT License, and you are more than welcome to build on it or adapt it for your own uses.
The API accepts a POST request and expects a json body of the following structure:
{
"datetime": string (date and time),
"location": string[,
"station_count": integer]
}
The datetime
string is passed to Panda's to_datetime
method, so any
reasonable format (including unix timestamps) should be fine.
The location
string is passed to the Google Maps API to get
coordinates, so specific addresses or general neighboorhoods (anything
that works with Google Maps, actually) will work.
An optional station_count
parameter can be passed to specify the
number of closest stations that should be returned. The default is
five.
For example:
{
"datetime": "November 08, 2016 5:00 pm",
"location": "Capitol Hill",
"station_count": 5
}
The response may look like this (some stations removed for brevity):
{
"address": "Capitol Hill, Washington, DC, USA",
"date": "Tuesday November 08, 2016",
"time": "05:00 PM",
"forecast": {
"tempC": 17,
"tempF": 62,
"condition": "Partly Cloudy"
},
"status": "success",
"stations": [
{
"name": "Eastern Market \/ 7th & North Carolina Ave SE",
"distance": 256.5717285669,
"prediction": {
"empty": 0.20066666666667,
"full": 0
}
},
...
]
}
For each bikeshare station, models are built using random forests based on historical station status data and historical weather data. Specifically, the features used are:
- Seven binary features, one for each day of the week
- Minute of the day
- Month
- Temperature
- Hourly Precipitation
Values are roughly scaled to be between 0 and 1, inclusive. (The one exception is temperature, which is scaled to be between -1 and 1.)
All dates in the database are stored in US/Eastern time. Included scripts to import data convert dates and times to US/Eastern (if needed) before storing the data.
Scripts are provided to store current bikeshare and weather data for future use in building models. See the Setup section of this file for setup instructions.
Data Sources are the official Capital Bikeshare station status feed and the Weather Underground API.
Data Source: Capital Bikeshare Tracker
- Times are in US/Eastern.
- Data can be exported into CSV files.
Import with tools/import_data.py
.
Data Source: NOAA's ISD Lite Datasets
- Times are in UTC.
Import with tools/import_weather_data.py
.
- bike_count: Bike station data in the format of counts of bikes and empty
spaces. This is the format saved by the
cabi_recorder
script. - outage: Bike station data in the format of outages (instances of empty or
full stations) This is the format imported by the
import_data
tool. - weather: Weather data saved by the
weather_recorder
script. - weather_isd: Weather data imported by the
import_weather_data
tool. - forecast: Upcoming hourly forecasts.
- station_info: Stores metadata about stations.
Python 3 must be installed.
A postgres database is required. The environmental variable CABI_DB
should
be set to a connection string of the format
username:password@host:port/db_name
.
The environmental variable WUNDERGROUND_KEY
must be set to a valid API key
for the Weather Underground
API.
The environmental variable GMAPS_KEY
must be set to a valid API key for
Google Maps Web
Services.
The environmental variables API_USER
and API_PASS
must be set to the
username and password that is required to access the prediction API.
Dependencies can be installed with pip:
pip install -r requirements.txt
The API is built using Falcon, and can be run on any WSGI server.
Review the start of the passenger_wsgi.py
file and modify it to fit your
setup. INTERP
allows the correct Python executable to be found if you are
using virtualenv, and must be set
correctly (or removed along with the corresponding code if not needed).
The API is set up to run easily on
Passenger using passenger_wsgi.py
.
Alternatively, to run on Gunicorn:
pip install gunicorn
gunicorn passenger_wsgi:app
A number of scripts should be run regularly to keep the data and models
updated. Shell scripts are provided in the tools
folder to run the proper
python code and keep logs in the log
folder. These scripts should be kept
in the tools
folder, otherwise the paths of the files they refer to will
have to be changed.
The included scripts are:
cabi_recorder.sh
: Stores current bikeshare station data, for future use in building models.weather_recorder.sh
: Stores current weather data, for future use in building models.get_station_info.sh
: Stores the current list of bikeshare stations.build_models.sh
: Builds models for all bikeshare stations.get_forecast.sh
: Stores upcoming hourly forecasts.
Technically, the only script that must be run regularly is get_forecast.sh
.
Hourly weather forecasts are only available for ten days, so the API will only
be able to make predictions for ten days from the last time the script was
run.
get_station_info.sh
must be run at least once, but should be run more often,
as Capital Bikeshare adds and removes stations somewhat often.
Running cabi_recorder.sh
and weather_recorder.sh
regularly, and then
running build_models.sh
occasionally, will allow new models to be created
automatically that include recent data. Alternatively, you can just import
historical data once with import_data.py
and import_weather_data.py
, and
then run build_models.sh
once. However, your models will likely get less
accurate as time progresses.
All scripts are quick except for build_models.sh
, which may take several
hours to run. We recommend running this weekly or every other week.
Running the scripts with cron can be done as follows:
# Feel free to change the frequencies as needed.
@weekly /full/path/CapitalBikeshareML/tools/build_models.sh
*/5 * * * * /full/path/CapitalBikeshareML/tools/cabi_recorder.sh
*/20 * * * * /full/path/CapitalBikeshareML/tools/get_forecast.sh
@daily /full/path/CapitalBikeshareML/tools/get_station_info.sh
*/30 * * * * /full/path/CapitalBikeshareML/tools/weather_recorder.sh
cron must be set up so the scripts can find the python
command and can
access the CABI_DB
environmental variable. weather_recorder.sh
also needs
to access the WUNDERGROUND_KEY
variable. The environmental variables (and the
PATH, if needed) can be set in the crontab with some versions of cron. Other
options can be found
here.