This project is no longer maintained by Two Sigma. We continue to encourage independent development.
Accessing, processing, documenting wildlife and environmental datasets for the National Wildlife Federation
The National Wildlife Federation (NWF) is a large non-profit dedicated to conservation and wildlife advocacy. NWF is working with another organization to create an interactive mapping tool that shows the intersection of potential carbon management project with wildlife and envirornmental considerations in the state of Wyoming. Data Clinic is tasked with finding and processing these wildlife and envirornmental datasets and providing them to NWF. NWF has given us a spreadsheet outlining the desired datasets, which we have augmented with additional metadata.
The data pipeline we build will contain a few distinct steps, with each step depending on the previous. Roughly, these steps are:
- Access data and upload to s3
- The metadata spreadsheet contains links to APIs and hosted files matching the requested datasets. The code in
download.py
should iterate through the datasets with links and download each locally before uploading to thenwf-dataclinic
s3 bucket.
- The metadata spreadsheet contains links to APIs and hosted files matching the requested datasets. The code in
- Simple data processing
- The raw data on s3 will have different file formats, projects, and extents. We want to provide NWF with data that has been minimally processed to ensure compatibility. The code in
process.py
should traverse the raw datasets and apply these basic processing steps and save the results to s3.
- The raw data on s3 will have different file formats, projects, and extents. We want to provide NWF with data that has been minimally processed to ensure compatibility. The code in
- Documenting processed data
- The final step is to create simple documentation for each dataset. These should be pdf files generated for each processed dataset. These documents will contain information from the metadata, such as dataset description, licence, years covered, etc. as well as some additional information dervived from the data itself such as column names/types and number of rows. The code in
document.py
will iterate through the processed datasets and create the documentation for each.
- The final step is to create simple documentation for each dataset. These should be pdf files generated for each processed dataset. These documents will contain information from the metadata, such as dataset description, licence, years covered, etc. as well as some additional information dervived from the data itself such as column names/types and number of rows. The code in
These steps are composed in run.py
- which also exports the full contents of the repository to a specified local folder. You can run the full pipeline by executing poetry run python3 src/run.py --s3bucket <your_bucket>
(or by running the script from the envirornment of your choice). Flags exist to export the data locally, skip the s3 upload, or overwrite data which has already been processed. To see the full list, run poetry run python3 src/run.py --help
.
This project uses Poetry to provide an easy way to manage dependencies. You can set it up by following these steps:
- Ensure you have a python 3.9 or higher installation on your external machine
- Install poetry following the instructions here
- From the root project directory, install the depencies with
poetry install
- Ensure the envirornment has been installed by running
poetry shell
. You should see something like(nwf-process-geodata-py3.9)
in your terminal.
We encourage people to follow the git feature branch workflow which you can read more about here: How to use git as a Data Scientist
For each feature you are adding to the code
- Switch to the main branch and pull the most recent changes
git checkout main
git pull
- Make a new branch for your addition
git checkout -b cleaning_script
- Write your awesome code.
- Once it's done add it to git
git status
git add {files that have changed}
git commit -m {some descriptive commit message}
- Push the branch to gitlab
git push -u origin cleaning_script
- Go to GitHub and create a merge request.
- Either merge the branch yourself if your confident it's good or request that someone else reviews the changes and merges it in.
- Repeat
- ...
- Profit.
Project based on the cookiecutter data science project template. #cookiecutterdatascience