The main goal of this project is to deliver the tools and practice required to pass the dbt Analytics Engineering Exam.
- dbt Certification Project
- Scenario
- 1. Development Setup
- 2. Clone the dbt Certification Project Repository
- Docker Version (Optional)
- 3. Setup Your Python Environment
- 4. Activate Your Virtual Environment
- 5. Install Python packages required
- 6. Setup dbt Project
- 7. Install dbt Power Extension for VSCode
- 8. How to use Git on this Project
- 9. Project Challenges
- 10. Additional Content
- You have been hired by a new company that wants to publish the next big hit of boardgame and needs your help to identify the factors to create this new boardgame, like:
- Countries with highest demand (based on number of reviews and positive reviews, what are the kind of games that users rank the best).
- Define metrics to calculate the average rating for each dimension (designers, categories, artists, mechanics)
To do that the company relies on your expertise to perform insights on the datasets available and build recommendations from it.
Components:
- Snowflake L&D Sandbox
- Sign in using OKTA SSO
- GitHub Project Repository
- Data Sources:
The following steps will help you to configure your mac for Python development, automation, and command line use.
You can also check this recorded session that goes through project setup step-by-step. (Passcode: 7%#DU9#b)
You should install some command line tools, and configure your terminal and editor:
- Setup Hombebrew
- Setup your Mac terminal and Zsh
- Setup your Mac OS with Visual Studio Code
-
Confirm
git
is installed:git --version
-
Sign up on GitHub using your Hakkoda email
-
Sign in to GitHub using your Hakkoda account
-
Click on the top left button (alongside the GitHub logo) to expand the left panel and press the Single sign-on button like shown below
-
Finally you have the main page with top panel with the links to access repositories, projects, teams and so on.
git config --global user.name "[github_username]"
git config --global user.email "[github_email]"
git config --global init.defaultBranch main
git config --global --add push.default current
git config --global push.autoSetupRemote true
git config --global pull.rebase false
Note: make sure to replace [github_username] and [github_email] in the path above with your own
To confirm your git settings run the following command:
git config --global --list
Like all Hakkoda repos, the dbt Certification Project repo can only be access locally using SSH keys. If you haven't already setup SSH follow these steps or skip straight to the git clone
command below.
Open a Terminal. If you haven't already, setup your terminal. Then create an SSH key. Ensure that you save your key passphrase in Bitwarden.
# Generate the SSH key
# NOTE:
# * Update your email address before running the keygen command
ssh-keygen -t ed25519 -C "[email protected]"
Check the example below on how to adjust the above command to your e-mail account.
When asked to enter file in which to save the key just hit Enter on your keyboard to accept to save on the suggested file. If it already exists, it will ask if you want to overwrite.
When asked to Enter passphrase you can use Bitwarden to generate the passphrase and save as something like LaptopSSHKey
. You will need to provide this passphrase every time you need to unlock your SSH key
# Add your SSH key to the ssh-agent
eval "$(ssh-agent -s)"
# Update your SSH config to automatically load keys into the ssh-agent and
# store passphrases in your keychain.
cat <<- EOF >> ~/.ssh/config
Host github.com
AddKeysToAgent yes
UseKeychain yes
IdentityFile ~/.ssh/id_ed25519
EOF
# For MacOS, you can also store your passphrase in the Apple keychain. You may
# need to provide your mac password to access your keychain.
ssh-add --apple-use-keychain ~/.ssh/id_ed25519
Follow instructions to add your public key to your GitHub account.
# Copy public key to clipboard
pbcopy < ~/.ssh/id_ed25519.pub
Go to the GitHub Keys page and click the New SSH key
button. Then:
- Add a description title:
LaptopSSHKey
- Paste the key: cmd + v
If requested, enter your GitHub password to save.
Then to enable SSO for your key, click Configure SSO
and then Authorize
. Click continue
to enable SSO. Go through Okta authentication steps. Click continue
to complete.
When finalized it should look like this.
Before cloning your repo, you should open your terminal to a development folder to store our project code in.
For example:
# Ensure that you are in your "home" directory
cd ~
# Create a new development directory, if it doesn't already exist
mkdir Develop
# Change to the development directory
cd Develop
# Clone project repository (NOTE: uses SSH key)
git clone [email protected]:Hakkoda1/dbt-certification-project.git
There is a docker version available on this repository that you can use instead of setting up a python environment on your local machine.
Check this video tutorial on how to setup the docker-compose.
To install docker desktop on Mac you can use this url.
Note: If you choose this path, you can jump directly to chapter 6 of the tutorial.
Setup Python for dbt use with Snowflake. Because dbt-core is currently compatible with Python versions from 3.8 - 3.11, we have setup the project to use Python 3.10, but feel free to install any version that is supported by dbt (source).
brew install [email protected] virtualenv
Create your Python virtual environment. (If you have used another Python version, please change the code below accordingly)
# Change to dbt-certification-project repository root folder
cd dbt-certification-project/
# Setup a project python virtual environement
virtualenv .dbt-env --python=python3.10
Every time you open a new terminal, you will need to source your Project Python environment:
# Source your project environment
source .dbt-env/bin/activate
Run the following command to install the required Python packages for the project:
pip install -r requirements.txt
When you clone the repository the dbt project was already initialized, so you can skip the command dbt init
.
After that, you need to use the sample-profiles.yml
template file to create your own profiles.yml
by running the following command inside dbt folder:
# Change to dbt project folder
cd dbt/
# Rename the file sample-profiles.yml to profiles.yml
cp sample-profiles.yml profiles.yml
dbt Power User extension is a great tool to bring a similar experience to dbt Cloud version using dbt-core on VSCode.
Below are listed the features that this extension provides:
- Setting up the dbt Power User extension
- Column Level Lineage
- dbt query data results preview with file export and analysis
- Generate dbt models from source files or SQL queries
- Generate model and column descriptions
- Click to execute parent/child models and tests
- Query explanations for complex dbt models
Follow the video below on a step-by-step tutorial on how to install.
Below there are a couple of additional videos with useful information on how to setup and take advantage of the features from dbt Power User extension.
Note: Those are from the dbt Zero to Hero Udemy course that is part of the learning path for the dbt Analytics Engineering Certification Exam.
- How to get an API Key for the Advanced Features
- Use AI to Generate Documenation
- Generate dbt Model from Source Definition or SQL
- Working with Column-Level Lineage
- Find Problems in your dbt Project with Health Check
- Use AI to Interpret Queries via Query Explanations
Git is an integral part of version control and CI/CD. It makes contributing to a project with multiple contributors extremely easy. Version control is essential and allows any project to move to any snapshot of time in the development history.
dbt Certification Project uses the Simple Flow branching convention with a few types of branches:
-
Short running branches (features): temporary dedicated branches for work-in-progress features
-
start from
develop
, to develop changes that are then integrated (merged) back todevelop
-
enables parallel development with minimal conflicts with other features
-
For example, to create a new feature branch we need to ensure we are basing it off of the most recent
develop
branch and then creating the new branch:git checkout develop git pull # In this scenario we should use the [feature_first-initial-last-name] structure to identify each member developing branch (ex: feature_fbalseiro) git checkout -b feature_fbalseiro
-
-
Long running branches (protected): branches that require pull requests to merge in changes
develop
: changes staged for the next production release, should be merged to main when a release version is ready. This is is part of our CI/CD process.main
: current version in production, each merge on this branch should be tag and deployed.
graph BT;
feature_1 --> develop;
feature_2 --> develop;
feature_x --> develop;
develop --> main;
As you're going through the project and working on your branch ex: feature_fbalseiro
there can be new updates on main
branch due to new features or challenges that have been added to the repository.
In that scenario, if you want to to get access to that on your local machine, you can perform the following steps:
- Check first in which branch are you working with the command
git branch
and the branch should be identified with a*
before the name, like this:
❯ git branch
add-pr-template
add-snapshot
develop
* feature_fbalseiro
github-cronjob
main
- Check if you have all your files staged and pushed to the remote repository by running the command
git status
2.1. If you get an output like the following you should first rungit push
to push your development to the remote repository branch:
❯ git status
On branch develop
Your branch is ahead of 'origin/develop' by 1 commit.
(use "git push" to publish your local commits)
Changes to be committed:
(use "git restore --staged <file>..." to unstage)
modified: ../README.md
2.2. If you get an output that your branch is up to date and nothing to commit, then you can move to the next step
-
Checkout of your branch to the main branch by running the command
git checkout main
to switch to the main branch -
Finally run
git pull
to pull the latest version of main branch on remote repository to your local machine -
If you want to go back again to your working branch just run
git checkout feature_fbalseiro
(adjust to your branch name)
- Customize Profile Directory
- Setup Environment Variables
- Add Sources
- Testing and Documenting Sources
- Add Snapshot
- Add Staging Models
- Add Source Freshness
- Add Singular Test
- Add Generic Test
- Configure Test Severity
- Update Staging Models
- Customize Data Test Name
- Add Intermediate Models
- Add Variables
- Add Tags
- Add Marts Core Models
- Generate Documentation
- Add Macros
- dbt run-operation
- Hooks
- Understanding DAGs
- Add dbt Expectations
- Add CI Workflow
- Add Schedule Daily Run Workflow
- Add Deploy to Production (Post-Merge) Workflow
- dbt Project Checklist
- dbt Sytle Guide
- GitHub Workflows for GitHub Actions:
- Continuous Integration: Runs on Pull Requests to main branch
- Daily dbt job: Runs daily on a schedule using cron to build and test source freshness, models, snapshots and seeds
- Upload to Snowflake: Runs daily on a schedule using cron to execute a python script that loads a
.csv
file to a stage on Snowflake
- Pull Request Template: Template that is applied each time a Pull Request is triggered
- Snowflake Scripts:
- Snowflake Setup
- Stored Procedure on Snowflake to load data from internal stage to table Rankings
- Task to automate Stored Procedure to load data from stage