Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Github workflow to populate the persistent source schema #715

Merged
merged 1 commit into from
Sep 19, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
110 changes: 110 additions & 0 deletions .github/workflows/cd-sql-engine-populate-persistent-source-schema.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
# See [Persistent Source Schema](/GLOSSARY.md#persistent-source-schema)
# Populating the source schema via this workflow ensures that it's done with the same settings as the tests.

name: Reload Test Data in SQL Engines

# We don't want multiple workflows trying to create the same table.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I guess if this ever happens in different schemas the last one in will be wrong anyway.

concurrency:
group: POPULATE_PERSISTENT_SOURCE_SCHEMA
cancel-in-progress: true
Comment on lines +8 to +9
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am very curious to see how this works.


on:
pull_request:
types: [labeled]
Comment on lines +11 to +13
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add workflow_dispatch so we don't have to label random PRs if something gets borked off main.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, but iterating on this PR has been a huge pain as it can't be tested locally.

workflow_dispatch:

env:
# Unclear on how to make 'Reload Test Data in SQL Engines' a constant here as it does not work here.
PYTHON_VERSION: "3.8"

jobs:
snowflake-populate:
environment: DW_INTEGRATION_TESTS
if: >
github.event.action == 'workflow_dispatch'
|| (github.event.action == 'labeled' && github.event.label.name == 'Reload Test Data in SQL Engines')
name: Snowflake
runs-on: ubuntu-latest
steps:
- name: Check-out the repo
uses: actions/checkout@v3

- name: Populate w/Python ${{ env.PYTHON_VERSION }}
uses: ./.github/actions/run-mf-tests
with:
python-version: ${{ env.PYTHON_VERSION }}
mf_sql_engine_url: ${{ secrets.MF_SNOWFLAKE_URL }}
mf_sql_engine_password: ${{ secrets.MF_SNOWFLAKE_PWD }}
parallelism: 1
make-target: "populate-persistent-source-schema-snowflake"

redshift-populate:
environment: DW_INTEGRATION_TESTS
name: Redshift
if: >
github.event.action == 'workflow_dispatch'
|| (github.event.action == 'labeled' && github.event.label.name == 'Reload Test Data in SQL Engines')
runs-on: ubuntu-latest
steps:
- name: Check-out the repo
uses: actions/checkout@v3

- name: Populate w/Python ${{ env.PYTHON_VERSION }}
uses: ./.github/actions/run-mf-tests
with:
python-version: ${{ env.PYTHON_VERSION }}
mf_sql_engine_url: ${{ secrets.MF_REDSHIFT_URL }}
mf_sql_engine_password: ${{ secrets.MF_REDSHIFT_PWD }}
parallelism: 1
make-target: "populate-persistent-source-schema-redshift"

bigquery-populate:
environment: DW_INTEGRATION_TESTS
name: BigQuery
if: >
github.event.action == 'workflow_dispatch'
|| (github.event.action == 'labeled' && github.event.label.name == 'Reload Test Data in SQL Engines')
runs-on: ubuntu-latest
steps:
- name: Check-out the repo
uses: actions/checkout@v3

- name: Populate w/Python ${{ env.PYTHON_VERSION }}
uses: ./.github/actions/run-mf-tests
with:
python-version: ${{ env.PYTHON_VERSION }}
MF_SQL_ENGINE_URL: ${{ secrets.MF_BIGQUERY_URL }}
MF_SQL_ENGINE_PASSWORD: ${{ secrets.MF_BIGQUERY_PWD }}
parallelism: 1
make-target: "populate-persistent-source-schema-bigquery"

databricks-populate:
environment: DW_INTEGRATION_TESTS
name: Databricks SQL Warehouse
if: >
github.event.action == 'workflow_dispatch'
|| (github.event.action == 'labeled' && github.event.label.name == 'Reload Test Data in SQL Engines')
runs-on: ubuntu-latest
steps:
- name: Check-out the repo
uses: actions/checkout@v3

- name: Populate w/Python ${{ env.PYTHON_VERSION }}
uses: ./.github/actions/run-mf-tests
with:
python-version: ${{ env.PYTHON_VERSION }}
mf_sql_engine_url: ${{ secrets.MF_DATABRICKS_SQL_WAREHOUSE_URL }}
mf_sql_engine_password: ${{ secrets.MF_DATABRICKS_PWD }}
parallelism: 1
make-target: "populate-persistent-source-schema-databricks"

remove-label:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh if this works we should TOTALLY add it to the sql engine tests...... I've got some updates I want to make over there so I can do that once this is in.

name: Remove Label After Populating Test Data
runs-on: ubuntu-latest
needs: [ snowflake-populate, redshift-populate, bigquery-populate, databricks-populate]
if: github.event.action == 'labeled' && github.event.label.name == 'Reload Test Data in SQL Engines'
steps:
- name: Remove Label
uses: actions-ecosystem/action-remove-labels@v1
with:
labels: 'Reload Test Data in SQL Engines'
13 changes: 13 additions & 0 deletions GLOSSARY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Glossary

## Persistent source schema
Many tests generate and execute SQL that depend on tables containing test data. By default, a
pytest fixture creates a temporary schema and populates it with the tables that are required by
the tests. This schema is referred to the source schema. Creating the source schema (and
the associated tables) can be a slow process for some SQL engines. Since these tables generally
do not change often, functionality was added to use a source schema that is assumed to already
exist when running tests and persists between runs (a persistent source schema). In addition,
functionality was added to create the persistent source schema based on table definitions in the
repo. Because the name of the source schema is generated based on the hash of the data that's
supposed to be in the schema, the creating and populating the persistent source schema should
not be done concurrently as there are race conditions when creating tables and inserting data.
25 changes: 23 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,10 @@ PARALLELISM = "auto"
# Additional command line options to pass to pytest.
ADDITIONAL_PYTEST_OPTIONS = ""

# Pytest that can populate the persistent source schema
USE_PERSISTENT_SOURCE_SCHEMA = "--use-persistent-source-schema"
POPULATE_PERSISTENT_SOURCE_SCHEMA = "metricflow/test/source_schema_tools.py::populate_source_schema"

# Install Hatch package / project manager
.PHONY: install-hatch
install-hatch:
Expand All @@ -21,24 +25,41 @@ test:
test-postgresql:
hatch -v run postgres-env:pytest -vv -n $(PARALLELISM) $(ADDITIONAL_PYTEST_OPTIONS) metricflow/test/

# Engine-specific test environments. In most cases you should run these with
# `make -e ADDITIONAL_PYTEST_OPTIONS="--use-persistent-source-schema" test-<engine_type>`
# Engine-specific test environments.
.PHONY: test-bigquery
test-bigquery:
hatch -v run bigquery-env:pytest -vv -n $(PARALLELISM) $(ADDITIONAL_PYTEST_OPTIONS) metricflow/test/

.PHONY: populate-persistent-source-schema-bigquery
populate-persistent-source-schema-bigquery:
hatch -v run bigquery-env:pytest -vv $(ADDITIONAL_PYTEST_OPTIONS) $(USE_PERSISTENT_SOURCE_SCHEMA) $(POPULATE_PERSISTENT_SOURCE_SCHEMA)

.PHONY: test-databricks
test-databricks:
hatch -v run databricks-env:pytest -vv -n $(PARALLELISM) $(ADDITIONAL_PYTEST_OPTIONS) metricflow/test/

.PHONY: populate-persistent-source-schema-databricks
populate-persistent-source-schema-databricks:
hatch -v run databricks-env:pytest -vv $(ADDITIONAL_PYTEST_OPTIONS) $(USE_PERSISTENT_SOURCE_SCHEMA) $(POPULATE_PERSISTENT_SOURCE_SCHEMA)

.PHONY: test-redshift
test-redshift:
hatch -v run redshift-env:pytest -vv -n $(PARALLELISM) $(ADDITIONAL_PYTEST_OPTIONS) metricflow/test/

.PHONY: populate-persistent-source-schema-redshift
populate-persistent-source-schema-redshift:
hatch -v run redshift-env:pytest -vv $(ADDITIONAL_PYTEST_OPTIONS) $(USE_PERSISTENT_SOURCE_SCHEMA) $(POPULATE_PERSISTENT_SOURCE_SCHEMA)


.PHONY: test-snowflake
test-snowflake:
hatch -v run snowflake-env:pytest -vv -n $(PARALLELISM) $(ADDITIONAL_PYTEST_OPTIONS) metricflow/test/

.PHONY: populate-persistent-source-schema-snowflake
populate-persistent-source-schema-snowflake:
hatch -v run snowflake-env:pytest -vv $(ADDITIONAL_PYTEST_OPTIONS) $(USE_PERSISTENT_SOURCE_SCHEMA) $(POPULATE_PERSISTENT_SOURCE_SCHEMA)


.PHONY: lint
lint:
hatch -v run dev-env:pre-commit run --all-files
Expand Down