-
Notifications
You must be signed in to change notification settings - Fork 1
Developer Setup
Welcome to Data Engineering at DCP! This guide is intended to help you get you set up to contribute to our codebase.
This repository is our primary location for code, issues, and automated workflows.
-
If you don't already have a Github account, create one, and have a team member add you to the NYCPlanning organization. You can either use a personal account to link to the organization or make one for DCP purposes (some of us on the team do each).
-
Generate SSH Keys and add your key to Github account.
-
Create a
.env
file in the localdata-engineering
directory. Add environment variables to the.env
file; they will be used when creating a Dockerdev
container (see docker). A few others are included, but the basic ones needed for most of our pipelines areBUILD_ENGINE AWS_S3_ENDPOINT AWS_SECRET_ACCESS_KEY AWS_ACCESS_KEY_ID
Most of the relevant secrets can be found in 1password.
Most developers at DCP use VSCode
Definite extensions to install:
- Python
- Dev Containers
- Pylance
- Docker
Other potentially useful ones
- Jupyter
- GitLens
- CodeRunner
- Rainbow CSV
- Data Wrangler
- power user for dbt
We store secrets, credentials, etc in 1password. Talk to a teammate to get set up
- Homebrew
-
IPython
- If running notebooks in VSCode, extensions can take care of install/setup
- Docker
- Postgres or Postgres.app
- DBeaver
- Cyberduck
For Spatial Analysis:
- R
- Poetry (Python package manager)
This section describes general workflow how to run code (qa app, data pipeline, etc) locally
The simplest way to develop and run pipelines locally is using a dev container. This is a dockerized environment that VSCode can connect to. While it's an effective way to simply set up a complete, production-ready environment (and ensure that code runs the same locally as it does on the cloud), it's also often less performant than running locally outside of a container. For now though, it's certainly still the best place to start (and generally we try to avoid running computationally expensive jobs on our own machines anyways).
All files needed for the container are stored in the data-engineering/.devcontainer/
directory:
-
Dockerfile
describes how to build "initial" image for our container. It's largely setting variables that VS Code expects to run in the container properly. -
docker-compose.yml
describes how to set up ourdev
container. It also specifies that we need to build from theDockerfile
prior to initiating the container. We used to specify a postgres service as well, but have moved in favor of using a lighter-weight container and connecting to our persisted cloud dbs even when running locally. Now, this mainly is exposing a port for running streamlit from inside the container and making sure volumes are properly mounted. -
devcontainer.json
is specifically used to create ourdev
container in VSCode. We don't need this file if we createdev
container from a terminal. It handles things like expected extensions for VSCode while running in the container, commands that should be run before or after starting the container, etc.
There are (at least) 2 ways to spin up the container:
-
From VSCode:
-
Open VSCode
-
Open the cloned
data-engineering
directory -
VSCode will detect an existing container config file. Click on "Reopen in Container":
-
VSCode may ask for a passphrase associated with your Github SSH key: If you don't remember the passphrase but saved the it in the Keychain Access app upon creation, you can find the password in the app.
-
If the container was started successfully, the page will look like this:
-
-
From terminal:
- navigate to
data-engineering/.devcontainer/
directory - run command
docker-compose up
. This command will use existing.yml
file to set up the container.
- navigate to
We specify instructions in the docker-compose.yml
file which directories (Docker refers to them as 'volumes') we need in the container. In our case, it's ..:/workspace:cache
which is our entire data-engineering
directory.
We can interact with git/GitHub as usual while working from a Docker container: i.e. pulling and pushing code.
Running docker compose up -d && docker exec -ti de bash
will setup the container and open a terminal prompt in it.
Outside of a dev container, we can use python virtual environments. There are many ways to do this and one of our favorites is pyenv
(repo, usage, tutorial).
To install this repo's python packages and the dcpy
package:
python3 -m pip install --requirement ./admin/run_environment/requirements.txt
python3 -m pip install --editable . --constraint ./admin/run_environment/constraints.txt