Skip to content

Commit

Permalink
Merge pull request #102 from bqbooster/fix-project-list
Browse files Browse the repository at this point in the history
Fix project list
  • Loading branch information
Kayrnt authored Jan 13, 2025
2 parents 5fbdb0a + ed9e7b6 commit f1c89bd
Show file tree
Hide file tree
Showing 80 changed files with 446 additions and 1,883 deletions.
6 changes: 6 additions & 0 deletions .changes/unreleased/Features-20250111-142613.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
kind: Features
body: Improve the scalability of the project approach by factoring them into a table
time: 2025-01-11T14:26:13.652366+01:00
custom:
Author: Kayrnt
Issue: ""
2 changes: 1 addition & 1 deletion .github/workflows/pr_lint_models.yml
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ jobs:
python-version: '3.11'

- name: Install Python packages
run: python -m pip install dbt-bigquery~=1.8.2 sqlfluff-templater-dbt
run: python -m pip install dbt-bigquery~=1.9.1 sqlfluff-templater-dbt

- name: Write keyfile if secret is defined
run: |
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/pr_run_models.yml
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ jobs:
python-version: '3.11'

- name: Install Python packages
run: python -m pip install dbt-bigquery~=1.8.2 sqlfluff-templater-dbt
run: python -m pip install dbt-bigquery~=1.9.1 sqlfluff-templater-dbt

- name: Write keyfile if secret is defined
run: |
Expand Down
96 changes: 2 additions & 94 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,96 +1,4 @@
# Contributing to dbt-bigquery-monitoring

## Install setup

You're free to use the environment management tools you prefer but if you're familiar with those, you can use the following:

- pipx (to isolate the global tools from your local environment)
- tox (to run the tests)
- pre-commit (to run the linter)
- SQLFluff (to lint SQL)
- changie (to generate CHANGELOG entries)

### tool setup guide

To install pipx:

```bash
pip install pipx
pipx ensurepath
```

Then you'll be able to install tox, pre-commit and sqlfluff with pipx:

```bash
pipx install tox
pipx install pre-commit
pipx install sqlfluff
```

To install changie, there are few options depending on your OS.
See the [installation guide](https://changie.dev/guide/installation/) for more details.

To configure pre-commit hooks:

```bash
pre-commit install
```

To configure your dbt profile, run following command and follow the prompts:

```bash
dbt init
```

## Development workflow

- Fork the repo
- Create a branch from `main`
- Make your changes
- Run `tox` to run the tests
- Create your changelog entry with `changie new` (don't edit directly the CHANGELOG.md)
- Commit your changes (it will run the linter through pre-commit)
- Push your branch and open a PR on the repository

## Adding a CHANGELOG Entry

We use changie to generate CHANGELOG entries. Note: Do not edit the CHANGELOG.md directly. Your modifications will be lost.

Follow the steps to [install changie](https://changie.dev/guide/installation/) for your system.

Once changie is installed and your PR is created, simply run `changie new` and changie will walk you through the process of creating a changelog entry. Commit the file that's created and your changelog entry is complete!

### SQLFluff

We use SQLFluff to keep SQL style consistent. By installing `pre-commit` per the initial setup guide above, SQLFluff will run automatically when you make a commit locally. A GitHub action automatically tests pull requests and adds annotations where there are failures.

Lint all models in the /models directory:
```bash
tox -e lint_all
```

Fix all models in the /models directory:
```bash
tox -e fix_all
```

Lint (or subsitute lint to fix) a specific model:
```bash
tox -e lint -- models/path/to/model.sql
```

Lint (or subsitute lint to fix) a specific directory:
```bash
tox -e lint -- models/path/to/directory
```

#### Rules

Enforced rules are defined within `tox.ini`. To view the full list of available rules and their configuration, see the [SQLFluff documentation](https://docs.sqlfluff.com/en/stable/rules.html).

## Generation of dbt base google models

dbt base google models are generated in another dedicated project hosted in:
https://github.com/bqbooster/dbt-bigquery-monitoring-parser

It was separated to ensure that users don't install the parser (and tests) when they install the dbt package.
See related documentation page:
https://bqbooster.github.io/dbt-bigquery-monitoring/contributing
7 changes: 4 additions & 3 deletions dbt_project.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ models:
- "dbt-bigquery-monitoring-storage"
base:
google:
+materialized: "ephemeral"
+materialized: "{{ var('google_information_schema_model_materialization', 'placeholder') if var('google_information_schema_model_materialization', 'placeholder') != 'placeholder' else 'ephemeral' }}"

vars:
# Environment configuration
Expand All @@ -51,15 +51,16 @@ vars:
# Project input configuration
# The number of days to look back for regular tables, you might use up to 180 days usually.
# Expiration on intermediate tables is aligned so that it can store data as old your maximum lookback window as it's partitioned by time.
lookback_window_days: "{{ env_var('DBT_BQ_MONITORING_LOOKBACK_WINDOW_DAYS', 7) }}"
lookback_window_days: "{{ env_var('DBT_BQ_MONITORING_LOOKBACK_WINDOW_DAYS', 7) }}"
# Billing data can be late, a safe window is to refresh data for the past 3 days but you can increase it for exceptional cases
lookback_incremental_billing_window_days: "{{ env_var('DBT_BQ_MONITORING_LOOKBACK_INCREMENTAL_BILLING_WINDOW_DAYS', 3) }}"
# Project output configuration
output_materialization: "{{ env_var('DBT_BQ_MONITORING_OUTPUT_MATERIALIZATION', 'table') }}"
output_limit_size: "{{ env_var('DBT_BQ_MONITORING_OUTPUT_LIMIT_SIZE', 1000) }}"
output_partition_expiration_days: "{{ env_var('DBT_BQ_MONITORING_TABLE_EXPIRATION_DAYS', 365) }}"
use_copy_partitions: "{{ env_var('DBT_BQ_MONITORING_USE_COPY_PARTITIONS', true) }}"

google_information_schema_model_materialization: "{{ env_var('DBT_BQ_MONITORING_GOOGLE_INFORMATION_SCHEMA_MODELS_MATERIALIZATION', 'placeholder') }}"

# GCP Billing export (required for storage cost monitoring over time)
# The values are configured during the export setup in the GCP Console
enable_gcp_billing_export: "{{ env_var('DBT_BQ_MONITORING_ENABLE_GCP_BILLING_EXPORT', false) }}"
Expand Down
8 changes: 8 additions & 0 deletions docs/configuration/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ vars:

### Project mode

Project mode is useful when you have multiple GCP projects or you want to store the dbt-bigquery-monitoring models in a project different from the one used for execution.
To enable the "project mode", you'll need to define explicitly one mandatory setting to set in the `dbt_project.yml` file:

```yml
Expand All @@ -53,6 +54,13 @@ vars:
input_gcp_projects: [ 'my-gcp-project', 'my-gcp-project-2' ]
```

:::warning

When using the "project mode", the package will create intermediate tables to avoid issues from BigQuery when too many projects are used.
That process is done only on tables that are project related. The package leverages a custom materialiation (`project_by_project_table`) designed specifically for that need that can found in the `macros` folder.

:::

## Add metadata to queries (Recommended but optional)

To enhance your query metadata with dbt model information, the package provides a dedicated macro that leverage "dbt query comments" (the header set at the top of each query)
Expand Down
1 change: 1 addition & 0 deletions docs/configuration/package-settings.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ These settings are used to configure how dbt will run and materialize the models
| `output_limit_size` | `DBT_BQ_MONITORING_OUTPUT_LIMIT_SIZE` | Limit size to use for the models | `1000` |
| `output_partition_expiration_days` | `DBT_BQ_MONITORING_OUTPUT_LIMIT_SIZE` | Default table expiration in days for incremental models | `365` days |
| `use_copy_partitions` | `DBT_BQ_MONITORING_USE_COPY_PARTITIONS` | Whether to use copy partitions or not | `true` |
| `google_information_schema_model_materialization` | `DBT_BQ_MONITORING_GOOGLE_INFORMATION_SCHEMA_MODELS_MATERIALIZATION` | Whether to use a specific materialization for information schema models. Note that it doesn't work in project mode as it will materialize intermediate tables to avoid issues from BQ when too many projects are used. | `ephemeral` |

### GCP Billing export configuration

Expand Down
12 changes: 12 additions & 0 deletions macros/materalization_information_schema.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
{% macro dbt_bigquery_monitoring_materialization() %}
{% set projects = project_list() %}
{#- If the user has set the materialization in the config that's different from the default -#}
{% if var('google_information_schema_model_materialization') != 'placeholder' %}
{% set materialization = var('google_information_schema_model_materialization') %}
{% elif projects|length == 0 %}
{% set materialization = 'ephemeral' %}
{% else %}
{% set materialization = 'project_by_project_table' %}
{% endif %}
{{ return(materialization) }}
{% endmacro %}
88 changes: 88 additions & 0 deletions macros/project_by_project_table.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
{%- materialization project_by_project_table, adapter='bigquery' -%}

{% set target_relation = this %}
{% set existing_relation = load_relation(this) %}
{% set projects = project_list() %}
{%- set raw_partition_by = config.get('partition_by', none) -%}
{%- set partition_config = adapter.parse_partition_by(raw_partition_by) -%}
{%- set full_refresh_mode = (should_full_refresh()) -%}

{{ run_hooks(pre_hooks) }}

{%- set sql_no_data = sql + " LIMIT 0" %}

-- Create the table if it doesn't exist or if we're in full-refresh mode
{% if existing_relation is none or full_refresh_mode %}
{% call statement('main') -%}
{% if partition_config is not none %}
{% set build_sql = create_table_as(False, target_relation, sql_no_data) %}
{% else %}
{% set build_sql = create_table_as(False, target_relation, sql_no_data) %}
{% endif %}
{{ build_sql }}
{%- endcall %}
{% else %}
{% call statement('main') -%}
SELECT 1
{%- endcall %}
{% if partition_config is not none %}
-- Get the maximum partition value
{% set max_partition_sql %}
SELECT FORMAT_TIMESTAMP("%F %T", MAX({{ partition_config.field }})) as max_partition
FROM {{ target_relation }}
WHERE {{ partition_config.field }} IS NOT NULL
{% endset %}
{% else %}
-- Truncate the table if partition_by is not defined
{% set truncate_sql %}
TRUNCATE TABLE {{ target_relation }}
{% endset %}
{{ truncate_sql }}
{% do run_query(truncate_sql) %}
{% endif %}
{% if partition_config is not none %}
{% set max_partition_result = run_query(max_partition_sql) %}
{% if max_partition_result|length > 0 %}
{% set max_partition_value = max_partition_result.columns[0].values()[0] %}
{% endif %}
{% endif %}
{% endif %}

-- If we have projects, process them one by one
{% if projects|length > 0 %}
{% set all_insert_sql = [] %}
{% for project in projects %}
{% set project_sql = sql | replace('`region-', '`' ~ project | trim ~ '`.`region-') %}
{% if existing_relation is not none and partition_config is not none and max_partition_value is not none and max_partition_value | length > 0 %}
{% set where_condition = 'WHERE ' ~ partition_config.field ~ ' >= TIMESTAMP_TRUNC("' ~ max_partition_value ~ '", HOUR)' %}
{% set insert_sql %}
DELETE FROM {{ target_relation }}
{{ where_condition }};

INSERT INTO {{ target_relation }}
{{ project_sql }}
{{ where_condition }}
{% endset %}
{% else %}
{#- bigquery doesn't allow more than 4000 partitions per insert so if we have hourly tables it's ~ 166 days -#}
{% set project_sql = project_sql + ' WHERE ' ~ partition_config.field ~ ' >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 166 DAY)' %}
{% set insert_sql %}
INSERT INTO {{ target_relation }}
{{ project_sql }}
{% endset %}
{% endif %}
{% do all_insert_sql.append(insert_sql) %}
{% endfor %}
{% call statement('insert') -%}
{{ all_insert_sql | join(';\n') }}
{%- endcall %}
{% endif %}

{{ run_hooks(post_hooks) }}
{% set should_revoke = should_revoke(old_relation, full_refresh_mode=True) %}
{% do apply_grants(target_relation, grant_config, should_revoke) %}
{% do persist_docs(target_relation, model) %}

{{ return({'relations': [target_relation]}) }}

{%- endmaterialization -%}
2 changes: 1 addition & 1 deletion models/base/combined_jobs_inputs.sql
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

SELECT
COALESCE(TIMESTAMP_TRUNC(a.timestamp, HOUR), TIMESTAMP_TRUNC(j.creation_time, HOUR)) AS hour,
COALESCE(a.bi_engine_statistics, j.bi_engine_statistics) AS bi_engine_statistics,
j.bi_engine_statistics AS bi_engine_statistics, -- this field is only available in information schema
COALESCE(a.cache_hit, j.cache_hit) AS cache_hit,
a.caller_supplied_user_agent AS caller_supplied_user_agent, -- this field is only available in the audit logs
COALESCE(a.creation_time, j.creation_time) AS creation_time,
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
{{ config(materialized=dbt_bigquery_monitoring_materialization()) }}
{# More details about base table in https://cloud.google.com/bigquery/docs/information-schema-object-privileges -#}
{# Required role/permissions: To query the INFORMATION_SCHEMA.OBJECT_PRIVILEGES view, you need following
Identity and Access Management (IAM) permissions:
Expand All @@ -6,31 +7,11 @@ bigquery.tables.getIamPolicy for tables and views.
For more information about BigQuery permissions, see
Access control with IAM. -#}

WITH base AS (
{% if project_list()|length > 0 -%}
{% for project in project_list() -%}
SELECT object_catalog, object_schema, object_name, object_type, privilege_type, grantee
FROM `{{ project | trim }}`.`region-{{ var('bq_region') }}`.`INFORMATION_SCHEMA`.`OBJECT_PRIVILEGES`
{% if not loop.last %}UNION ALL{% endif %}
{% endfor %}
{%- else %}
SELECT
SELECT
object_catalog,
object_schema,
object_name,
object_type,
privilege_type,
grantee
FROM `region-{{ var('bq_region') }}`.`INFORMATION_SCHEMA`.`OBJECT_PRIVILEGES`
{%- endif %}
)

SELECT
object_catalog,
object_schema,
object_name,
object_type,
privilege_type,
grantee,
FROM
base
22 changes: 2 additions & 20 deletions models/base/google/bi_engine/information_schema_bi_capacities.sql
Original file line number Diff line number Diff line change
@@ -1,28 +1,10 @@
{{ config(materialized=dbt_bigquery_monitoring_materialization()) }}
{# More details about base table in https://cloud.google.com/bigquery/docs/information-schema-bi-capacities -#}

WITH base AS (
{% if project_list()|length > 0 -%}
{% for project in project_list() -%}
SELECT project_id, project_number, bi_capacity_name, size, preferred_tables
FROM `{{ project | trim }}`.`region-{{ var('bq_region') }}`.`INFORMATION_SCHEMA`.`BI_CAPACITIES`
{% if not loop.last %}UNION ALL{% endif %}
{% endfor %}
{%- else %}
SELECT
SELECT
project_id,
project_number,
bi_capacity_name,
size,
preferred_tables
FROM `region-{{ var('bq_region') }}`.`INFORMATION_SCHEMA`.`BI_CAPACITIES`
{%- endif %}
)

SELECT
project_id,
project_number,
bi_capacity_name,
size,
preferred_tables,
FROM
base
Original file line number Diff line number Diff line change
@@ -1,14 +1,7 @@
{{ config(materialized=dbt_bigquery_monitoring_materialization()) }}
{# More details about base table in https://cloud.google.com/bigquery/docs/information-schema-bi-capacity-changes -#}

WITH base AS (
{% if project_list()|length > 0 -%}
{% for project in project_list() -%}
SELECT change_timestamp, project_id, project_number, bi_capacity_name, size, user_email, preferred_tables
FROM `{{ project | trim }}`.`region-{{ var('bq_region') }}`.`INFORMATION_SCHEMA`.`BI_CAPACITY_CHANGES`
{% if not loop.last %}UNION ALL{% endif %}
{% endfor %}
{%- else %}
SELECT
SELECT
change_timestamp,
project_id,
project_number,
Expand All @@ -17,16 +10,3 @@ size,
user_email,
preferred_tables
FROM `region-{{ var('bq_region') }}`.`INFORMATION_SCHEMA`.`BI_CAPACITY_CHANGES`
{%- endif %}
)

SELECT
change_timestamp,
project_id,
project_number,
bi_capacity_name,
size,
user_email,
preferred_tables,
FROM
base
Loading

0 comments on commit f1c89bd

Please sign in to comment.