Skip to content

Developer Setup

Damon McCullough edited this page Jan 14, 2025 · 6 revisions

Welcome to Data Engineering at DCP! This guide is intended to help you get you set up to contribute to our codebase.

Code

This repository is our primary location for code, issues, and automated workflows.

  1. If you don't already have a Github account, create one, and have a team member add you to the NYCPlanning organization. You can either use a personal account to link to the organization or make one for DCP purposes (some of us on the team do each).

  2. Generate SSH Keys and add your key to Github account.

  3. Clone the repo

  4. Create a .env file in the local data-engineering directory. Add environment variables to the .env file; they will be used when creating a Docker dev container (see docker). A few others are included, but the basic ones needed for most of our pipelines are

    BUILD_ENGINE
    AWS_S3_ENDPOINT
    AWS_SECRET_ACCESS_KEY
    AWS_ACCESS_KEY_ID
    

    Most of the relevant secrets can be found in 1password.

Tools

VSCode

Most developers at DCP use VSCode

Definite extensions to install:

  • Python
  • Dev Containers
  • Pylance
  • Docker

Other potentially useful ones

  • Jupyter
  • GitLens
  • CodeRunner
  • Rainbow CSV
  • Data Wrangler
  • power user for dbt

1password

We store secrets, credentials, etc in 1password. Talk to a teammate to get set up

Other Tools

For Spatial Analysis:

  • QGIS
  • Carto Request a login from a team member

Tools that are less used or being phased out

  • R
  • Poetry (Python package manager)

Environment

This section describes general workflow how to run code (qa app, data pipeline, etc) locally

The simplest way to develop and run pipelines locally is using a dev container. This is a dockerized environment that VSCode can connect to. While it's an effective way to simply set up a complete, production-ready environment (and ensure that code runs the same locally as it does on the cloud), it's also often less performant than running locally outside of a container. For now though, it's certainly still the best place to start (and generally we try to avoid running computationally expensive jobs on our own machines anyways).

All files needed for the container are stored in the data-engineering/.devcontainer/ directory:

  • Dockerfile describes how to build "initial" image for our container. It's largely setting variables that VS Code expects to run in the container properly.
  • docker-compose.yml describes how to set up our dev container. It also specifies that we need to build from the Dockerfile prior to initiating the container. We used to specify a postgres service as well, but have moved in favor of using a lighter-weight container and connecting to our persisted cloud dbs even when running locally. Now, this mainly is exposing a port for running streamlit from inside the container and making sure volumes are properly mounted.
  • devcontainer.json is specifically used to create our dev container in VSCode. We don't need this file if we create dev container from a terminal. It handles things like expected extensions for VSCode while running in the container, commands that should be run before or after starting the container, etc.

There are (at least) 2 ways to spin up the container:

  • From VSCode:

    • Open VSCode

    • Open the cloned data-engineering directory

    • VSCode will detect an existing container config file. Click on "Reopen in Container": Alt text

    • VSCode may ask for a passphrase associated with your Github SSH key: Alt text If you don't remember the passphrase but saved the it in the Keychain Access app upon creation, you can find the password in the app.

    • If the container was started successfully, the page will look like this: Alt text

  • From terminal:

    • navigate to data-engineering/.devcontainer/ directory
    • run command docker-compose up. This command will use existing .yml file to set up the container.

We specify instructions in the docker-compose.yml file which directories (Docker refers to them as 'volumes') we need in the container. In our case, it's ..:/workspace:cache which is our entire data-engineering directory.

We can interact with git/GitHub as usual while working from a Docker container: i.e. pulling and pushing code.

Running docker compose up -d && docker exec -ti de bash will setup the container and open a terminal prompt in it.

Outside of a dev container, we can use python virtual environments. There are many ways to do this and one of our favorites is pyenv (repo, usage, tutorial).

To install this repo's python packages and the dcpy package:

python3 -m pip install --requirement ./admin/run_environment/requirements.txt
python3 -m pip install --editable . --constraint ./admin/run_environment/constraints.txt