Merge branch 'current' into patch-1

dbt-labs · Jan 15, 2024 · d617462 · d617462
2 parents 9826f7b + 5f39650
commit d617462
Show file tree

Hide file tree

Showing 88 changed files with 276 additions and 421 deletions.
diff --git a/website/blog/2023-12-15-serverless-free-tier-data-stack-with-dlt-and-dbt-core.md b/website/blog/2023-12-15-serverless-free-tier-data-stack-with-dlt-and-dbt-core.md
@@ -0,0 +1,160 @@
+---
+title: Serverless, free-tier data stack with dlt + dbt core.
+description: "In this article, Euan shares his personal project to fetch property price data during his and his partner's house-hunting process, and how he created a serverless free-tier data stack by using Google Cloud Functions to run data ingestion tool dlt alongside dbt for transformation."
+slug: serverless-dlt-dbt-stack
+
+authors: [euan_johnston]
+
+hide_table_of_contents: false
+
+date: 2023-01-15
+is_featured: false
+---
+
+
+
+## The problem, the builder and tooling
+
+**The problem**: My partner and I are considering buying a property in Portugal. There is no reference data for the real estate market here - how many houses are being sold, for what price? Nobody knows except the property office and maybe the banks, and they don’t readily divulge this information. The only data source we have is Idealista, which is a portal where real estate agencies post ads.
+
+Unfortunately, there are significantly fewer properties than ads - it seems many real estate companies re-post the same ad that others do, with intentionally different data and often misleading bits of info. The real estate agencies do this so the interested parties reach out to them for clarification, and from there they can start a sales process. At the same time, the website with the ads is incentivised to allow this to continue as they get paid per ad, not per property.
+
+**The builder:** I’m a data freelancer who deploys end to end solutions, so when I have a data problem, I cannot just let it go.
+
+**The tools:** I want to be able to run my project on [Google Cloud Functions](https://cloud.google.com/functions) due to the generous free tier. [dlt](https://dlthub.com/) is a new Python library for declarative data ingestion which I have wanted to test for some time. Finally, I will use dbt Core for transformation.
+
+## The starting point
+
+If I want to have reliable information on the state of the market I will need to:
+
+- Grab the messy data from Idealista and historize it.
+- Deduplicate existing listings.
+- Try to infer what listings sold for how much.
+
+Once I have deduplicated listings with some online history, I can get an idea:
+
+- How expensive which properties are.
+- How fast they get sold, hopefully a signal of whether they are “worth it” or not.
+
+## Towards a solution
+
+The solution has pretty standard components:
+
+- An EtL pipeline. The little t stands for normalisation, such as transforming strings to dates or unpacking nested structures. This is handled by dlt functions written in Python.
+- A transformation layer taking the source data loaded by my dlt functions and creating the tables necessary, handled by dbt.
+- Due to the complexity of deduplication, I needed to add a human element to confirm the deduplication in Google Sheets.
+
+These elements are reflected in the diagram below and further clarified in greater detail later in the article:
+
+<Lightbox src="/img/blog/serverless-free-tier-data-stack-with-dlt-and-dbt-core/architecture_diagram.png" width="70%" title="Project architecture" />
+
+### Ingesting the data
+
+For ingestion, I use a couple of sources:
+
+First, I ingest home listings from the Idealista API, accessed through [API Dojo's freemium wrapper](https://rapidapi.com/apidojo/api/idealista2). The dlt pipeline I created for ingestion is in [this repo](https://github.com/euanjohnston-dev/Idealista_pipeline).
+
+After an initial round of transformation (described in the next section), the deduplicated data is loaded into BigQuery where I can query it from the Google Sheets client and manually review the deduplication.
+
+When I'm happy with the results, I use the [ready-made dlt Sheets source connector](https://dlthub.com/docs/dlt-ecosystem/verified-sources/google_sheets) to pull the data back into BigQuery, [as defined here](https://github.com/euanjohnston-dev/gsheets_check_pipeline).
+
+### Transforming the data
+
+For transforming I use my favorite solution, dbt Core. For running and orchestrating dbt on Cloud Functions, I am using dlt’s dbt Core runner. The benefit of the runner in this context is that I can re-use the same credential setup, instead of creating a separate profiles.yml file.
+
+This is the package I created: <https://github.com/euanjohnston-dev/idealista_dbt_pipeline>
+
+### Production-readying the pipeline
+
+To make the pipeline more “production ready”, I made some improvements:
+
+- Using a credential store instead of hard-coding passwords, in this case Google Secret Manager.
+- Be notified when the pipeline runs and what the outcome is. For this I sent data to Slack via a dlt decorator that posts the error on failure and the metadata on success.
+
+```python
+from dlt.common.runtime.slack import send_slack_message
+
+def notify_on_completion(hook):
+    def decorator(func):
+        def wrapper(*args, **kwargs):
+            try:
+                load_info = func(*args, **kwargs)
+                message = f"Function {func.__name__} completed successfully. Load info: {load_info}"
+                send_slack_message(hook, message)
+                return load_info
+            except Exception as e:
+                message = f"Function {func.__name__} failed. Error: {str(e)}"
+                send_slack_message(hook, message)
+                raise
+        return wrapper
+    return decorator
+```
+
+## The outcome
+
+The outcome was first and foremost a visualisation highlighting the unique properties available in my specific area of search. The map shown on the left of the page gives a live overview of location, number of duplicates (bubble size) and price (bubble colour) which can amongst other features be filtered using the sliders on the right. This represents a much better decluttered solution from which to observe the actual inventory available.
+
+<Lightbox src="/img/blog/serverless-free-tier-data-stack-with-dlt-and-dbt-core/map_screenshot.png" width="70%" title="Dashboard mapping overview" />
+
+Further charts highlight additional metrics which – now that deduplication is complete – can be accurately measured including most importantly, the development over time of “average price/square metre” and those properties which have been inferred to have been sold.
+
+### Next steps
+
+This version was very much about getting a base from which to analyze the properties for my own personal use case.
+
+In terms of further development which could take place, I have had interest from people to run the solution on their own specific target area.
+
+For this to work at scale I would need a more robust method to deal with duplicate attribution, which is a difficult problem as real estate agencies intentionally change details like number of rooms or surface area.
+
+Perhaps this is a problem ML or GPT could solve equally well as a human, given the limited options available.
+
+## Learnings and conclusion
+
+The data problem itself was an eye opener into the real-estate market. It’s a messy market full of unknowns and noise, which adds a significant purchase risk to first time buyers.
+
+Tooling wise, it was surprising how quick it was to set everything up. dlt integrates well with dbt and enables fast and simple data ingestion, making this project simpler than I thought it would be.
+
+### dlt
+
+Good:
+
+- As a big fan of dbt I love how seamlessly the two solutions complement one another. dlt handles the data cleaning and normalisation automatically so I can focus on curating and modelling it in dbt. While the automatic unpacking leaves some small adjustments for the analytics engineer, it’s much better than cleaning and typing json in the database or in custom python code.
+- When creating my first dummy pipeline I used duckdb. It felt like a great introduction into how simple it is to get started and provided a solid starting block before developing something for the cloud.
+
+Bad:
+
+- I did have a small hiccup with the google sheets connector assuming an oauth authentication over my desired sdk but this was relatively easy to rectify. (explicitly stating GcpServiceAccountCredentials in the init.py file for the source).
+- Using both a verified source in the gsheets connector and building my own from Rapid API endpoints seemed equally intuitive. However I would have wanted more documentation on how to run these 2 pipelines in the same script with the dbt pipeline.
+
+### dbt
+
+No surprises there. I developed the project locally, and to deploy to cloud functions I injected credentials to dbt via the dlt runner. This meant I could re-use the setup I did for the other dlt pipelines.
+
+```python
+def dbt_run():
+  # make an authenticated connection with dlt to the dwh
+    pipeline = dlt.pipeline(
+        pipeline_name='dbt_pipeline',
+        destination='bigquery', # credentials read from env
+        dataset_name='dbt'
+    )
+  # make a venv in case we have lib conflicts between dlt and current env
+    venv = dlt.dbt.get_venv(pipeline)
+  # package the pipeline, dbt package and env
+    dbt = dlt.dbt.package(pipeline, "dbt/property_analytics", venv=venv)
+  # and run it
+    models = dbt.run_all()
+    # show outcome
+    for m in models:
+        print(f"Model {m.model_name} materialized in {m.time} with status {m.status} and message {m.message}"
+```
+
+### Cloud functions
+
+While I had used cloud functions before, I had never previously set them up for dbt and I was able to easily follow dlt’s docs to run the pipelines there. Cloud functions is a great solution to cheaply run small scale pipelines and my running cost of the project is a few cents a month. If the insights drawn from the project help us save even 1% of a house price, the project will have been a success.
+
+### To sum up
+
+dlt feels like the perfect solution for anyone who has scratched the surface of python development. To be able to have schemas ready for transformation in such a short space of time is truly… transformational. As a freelancer, being able to accelerate the development of pipelines is a huge benefit within companies who are often frustrated with the amount of time it takes to start ‘showing value’.
+
+I’d welcome the chance to discuss what’s been built to date or collaborate on any potential further development in the comments below.
diff --git a/website/blog/2024-01-09-defer-in-development.md b/website/blog/2024-01-09-defer-in-development.md
@@ -12,7 +12,7 @@ date: 2024-01-09
 is_featured: true
 ---
 
-Picture this — you’ve got a massive dbt project, thousands of models chugging along, creating actionable insights for your stakeholders. A ticket comes your way &mdash; a model needs to be refactored! "No problem," you think to yourself, "I will simply make that change and test it locally!" You look at you lineage, and realize this model is many layers deep, buried underneath a long chain of tables and views.
+Picture this — you’ve got a massive dbt project, thousands of models chugging along, creating actionable insights for your stakeholders. A ticket comes your way &mdash; a model needs to be refactored! "No problem," you think to yourself, "I will simply make that change and test it locally!" You look at your lineage, and realize this model is many layers deep, buried underneath a long chain of tables and views.
 
 “OK,” you think further, “I’ll just run a `dbt build -s +my_changed_model` to make sure I have everything I need built into my dev schema and I can test my changes”. You run the command. You wait. You wait some more. You get some coffee, and completely take yourself out of your dbt development flow state. A lot of time and money down the drain to get to a point where you can *start* your work. That’s no good!
 

diff --git a/website/blog/authors.yml b/website/blog/authors.yml
@@ -187,6 +187,16 @@ emily_riederer:
     - icon: fa-readme
       url: https://emilyriederer.com
 
+euan_johnston:  
+  image_url: /img/blog/authors/ejohnston.png
+  job_title: Freelance Business Intelligence manager
+  name: Euan Johnston
+  links:
+    - icon: fa-linkedin
+      url: https://www.linkedin.com/in/euan-johnston-610a05a8/
+    - icon: fa-github
+      url: https://github.com/euanjohnston-dev
+
 grace_goheen:
   image_url: /img/blog/authors/grace-goheen.jpeg
   job_title: Analytics Engineer

diff --git a/website/docs/best-practices/best-practice-workflows.md b/website/docs/best-practices/best-practice-workflows.md
@@ -39,7 +39,7 @@ Your dbt project will depend on raw data stored in your database. Since this dat
 
 :::info Using sources for raw data references
 
-As of v0.13.0, we recommend defining your raw data as [sources](/docs/build/sources), and selecting from the source rather than using the direct relation reference. Our dbt projects no longer contain any direct relation references in any models.
+We recommend defining your raw data as [sources](/docs/build/sources), and selecting from the source rather than using the direct relation reference. Our dbt projects don't contain any direct relation references in any models.
 
 :::
 

diff --git a/website/docs/docs/build/groups.md b/website/docs/docs/build/groups.md
@@ -7,18 +7,6 @@ keywords:
   - groups access mesh
 ---
 
-:::info New functionality
-This functionality is new in v1.5.
-:::
-
-## Related docs
-
-* [Model Access](/docs/collaborate/govern/model-access#groups)
-* [Group configuration](/reference/resource-configs/group)
-* [Group selection](/reference/node-selection/methods#the-group-method)
-
-## About groups 
-
 A group is a collection of nodes within a dbt DAG. Groups are named, and every group has an `owner`. They enable intentional collaboration within and across teams by restricting [access to private](/reference/resource-configs/access) models.
 
 Group members may include models, tests, seeds, snapshots, analyses, and metrics. (Not included: sources and exposures.) Each node may belong to only one group.
@@ -126,3 +114,9 @@ dbt.exceptions.DbtReferenceError: Parsing Error
   Node model.jaffle_shop.marketing_model attempted to reference node model.jaffle_shop.finance_model, 
   which is not allowed because the referenced node is private to the finance group.
 ```
+
+## Related docs
+
+* [Model Access](/docs/collaborate/govern/model-access#groups)
+* [Group configuration](/reference/resource-configs/group)
+* [Group selection](/reference/node-selection/methods#the-group-method)
diff --git a/website/docs/docs/build/materializations.md b/website/docs/docs/build/materializations.md
@@ -120,7 +120,7 @@ required with incremental materializations
   * `dbt run` on materialized views corresponds to a code deployment, just like views
 * **Cons:**
   * Due to the fact that materialized views are more complex database objects, database platforms tend to have
-less configuration options available, see your database platform's docs for more details
+fewer configuration options available; see your database platform's docs for more details
   * Materialized views may not be supported by every database platform
 * **Advice:**
   * Consider materialized views for use cases where incremental models are sufficient, but you would like the data platform to manage the incremental logic and refresh.

diff --git a/website/docs/docs/build/project-variables.md b/website/docs/docs/build/project-variables.md
@@ -25,13 +25,6 @@ Jinja is not supported within the `vars` config, and all values will be interpre
 
 :::
 
-:::info New in v0.17.0
-
-The syntax for specifying vars in the `dbt_project.yml` file has changed in
-dbt v0.17.0. See the [migration guide](/docs/dbt-versions/core-upgrade)
-for more information on these changes.
-
-:::
 
 To define variables in a dbt project, add a `vars` config to your `dbt_project.yml` file.
 These `vars` can be scoped globally, or to a specific package imported in your

diff --git a/website/docs/docs/cloud/configure-cloud-cli.md b/website/docs/docs/cloud/configure-cloud-cli.md
@@ -66,9 +66,8 @@ Once you install the dbt Cloud CLI, you need to configure it to connect to a dbt
     ```yaml
     # dbt_project.yml
     name:
-
     version:
-    ...
+    # Your project configs...
 
     dbt-cloud: 
         project-id: PROJECT_ID
@@ -86,6 +85,7 @@ To set environment variables in the dbt Cloud CLI for your dbt project:
 2. Then select **Profile Settings**, then **Credentials**.
 3. Click on your project and scroll to the **Environment Variables** section.
 4. Click **Edit** on the lower right and then set the user-level environment variables.  
+   - Note, when setting up the [dbt Semantic Layer](/docs/use-dbt-semantic-layer/dbt-sl), using [environment variables](/docs/build/environment-variables) like `{{env_var('DBT_WAREHOUSE')}}` is not supported. You should use the actual credentials instead.
 
 ## Use the dbt Cloud CLI
 

diff --git a/website/docs/docs/cloud/manage-access/set-up-bigquery-oauth.md b/website/docs/docs/cloud/manage-access/set-up-bigquery-oauth.md
@@ -77,4 +77,5 @@ Select **Allow**. This redirects you back to dbt Cloud. You should now be an aut
 
 ## FAQs
 
-<FAQ path="Warehouse/bg-oauth-drive-scope"/>
+<FAQ path="Warehouse/bq-oauth-drive-scope" />
+
diff --git a/website/docs/docs/core/connect-data-platform/connection-profiles.md b/website/docs/docs/core/connect-data-platform/connection-profiles.md
@@ -83,11 +83,8 @@ To set up your profile, copy the correct sample profile for your warehouse into
 
 You can find more information on which values to use in your targets below.
 
-:::info Validating your warehouse credentials
+Use the [debug](/reference/dbt-jinja-functions/debug-method) command to validate your warehouse connection. Run `dbt debug` from within a dbt project to test your connection.
 
-Use the [debug](/reference/dbt-jinja-functions/debug-method) command to check whether you can successfully connect to your warehouse. Simply run `dbt debug` from within a dbt project to test your connection.
-
-:::
 
 ## Understanding targets in profiles
 

diff --git a/website/docs/docs/dbt-cloud-apis/schema-discovery-environment.mdx b/website/docs/docs/dbt-cloud-apis/schema-discovery-environment.mdx
@@ -18,13 +18,6 @@ When querying for `environment`, you can use the following arguments.
 
 <QueryArgsTable queryName="environment" useBetaAPI />
 
-:::caution
-
-dbt Labs is making changes to the Discovery API. These changes will take effect on August 15, 2023.
-
-The data type `Int` for `id` is being deprecated and will be replaced with `BigInt`. When the time comes, you will need to update your API call accordingly to avoid errors.
-:::
-
 ### Example queries
 
 You can use your production environment's `id`:

diff --git a/website/docs/docs/dbt-versions/core-upgrade/00-upgrading-to-v1.7.md b/website/docs/docs/dbt-versions/core-upgrade/00-upgrading-to-v1.7.md
@@ -5,10 +5,6 @@ description: New features and changes in dbt Core v1.7
 displayed_sidebar: "docs"
 ---
 
-import UpgradeMove from '/snippets/_upgrade-move.md';
-
-<UpgradeMove />
-
 ## Resources
 
 - [Changelog](https://github.com/dbt-labs/dbt-core/blob/8aaed0e29f9560bc53d9d3e88325a9597318e375/CHANGELOG.md)

diff --git a/website/docs/docs/dbt-versions/core-upgrade/01-upgrading-to-v1.6.md b/website/docs/docs/dbt-versions/core-upgrade/01-upgrading-to-v1.6.md
@@ -5,10 +5,6 @@ id: "upgrading-to-v1.6"
 displayed_sidebar: "docs"
 ---
 
-import UpgradeMove from '/snippets/_upgrade-move.md';
-
-<UpgradeMove />
-
 dbt Core v1.6 has three significant areas of focus:
 1. Next milestone of [multi-project deployments](https://github.com/dbt-labs/dbt-core/discussions/6725): improvements to contracts, groups/access, versions; and building blocks for cross-project `ref`
 1. Semantic layer re-launch: dbt Core and [MetricFlow](https://docs.getdbt.com/docs/build/about-metricflow) integration

diff --git a/website/docs/docs/dbt-versions/core-upgrade/02-upgrading-to-v1.5.md b/website/docs/docs/dbt-versions/core-upgrade/02-upgrading-to-v1.5.md
@@ -5,10 +5,6 @@ id: "upgrading-to-v1.5"
 displayed_sidebar: "docs"
 ---
 
-import UpgradeMove from '/snippets/_upgrade-move.md';
-
-<UpgradeMove />
-
 dbt Core v1.5 is a feature release, with two significant additions:
 1. [**Model governance**](/docs/collaborate/govern/about-model-governance) — access, contracts, versions — the first phase of [multi-project deployments](https://github.com/dbt-labs/dbt-core/discussions/6725)
 2. A Python entry point for [**programmatic invocations**](/reference/programmatic-invocations), at parity with the CLI

diff --git a/website/docs/docs/dbt-versions/core-upgrade/03-upgrading-to-dbt-utils-v1.0.md b/website/docs/docs/dbt-versions/core-upgrade/03-upgrading-to-dbt-utils-v1.0.md
@@ -3,10 +3,6 @@ title: "Upgrading to dbt utils v1.0"
 description: New features and breaking changes to consider as you upgrade to dbt utils v1.0.
 ---
 
-import UpgradeMove from '/snippets/_upgrade-move.md';
-
-<UpgradeMove />
-
 # Upgrading to dbt utils v1.0
 
 For the first time, [dbt utils](https://hub.getdbt.com/dbt-labs/dbt_utils/latest/) is crossing the major version boundary. From [last month’s blog post](https://www.getdbt.com/blog/announcing-dbt-v1.3-and-utils/): 

diff --git a/website/docs/docs/dbt-versions/core-upgrade/04-upgrading-to-v1.4.md b/website/docs/docs/dbt-versions/core-upgrade/04-upgrading-to-v1.4.md
@@ -5,10 +5,6 @@ id: "upgrading-to-v1.4"
 displayed_sidebar: "docs"
 ---
 
-import UpgradeMove from '/snippets/_upgrade-move.md';
-
-<UpgradeMove />
-
 ### Resources
 
 - [Changelog](https://github.com/dbt-labs/dbt-core/blob/1.4.latest/CHANGELOG.md)

diff --git a/website/docs/docs/dbt-versions/core-upgrade/05-upgrading-to-v1.3.md b/website/docs/docs/dbt-versions/core-upgrade/05-upgrading-to-v1.3.md
@@ -5,10 +5,6 @@ id: "upgrading-to-v1.3"
 displayed_sidebar: "docs"
 ---
 
-import UpgradeMove from '/snippets/_upgrade-move.md';
-
-<UpgradeMove />
-
 ### Resources
 
 - [Changelog](https://github.com/dbt-labs/dbt-core/blob/1.3.latest/CHANGELOG.md)

diff --git a/website/docs/docs/dbt-versions/core-upgrade/06-upgrading-to-v1.2.md b/website/docs/docs/dbt-versions/core-upgrade/06-upgrading-to-v1.2.md
@@ -5,10 +5,6 @@ id: "upgrading-to-v1.2"
 displayed_sidebar: "docs"
 ---
 
-import UpgradeMove from '/snippets/_upgrade-move.md';
-
-<UpgradeMove />
-
 ### Resources
 
 - [Changelog](https://github.com/dbt-labs/dbt-core/blob/1.2.latest/CHANGELOG.md)

diff --git a/website/docs/docs/dbt-versions/core-upgrade/07-upgrading-to-v1.1.md b/website/docs/docs/dbt-versions/core-upgrade/07-upgrading-to-v1.1.md
@@ -5,10 +5,6 @@ id: "upgrading-to-v1.1"
 displayed_sidebar: "docs"
 ---
 
-import UpgradeMove from '/snippets/_upgrade-move.md';
-
-<UpgradeMove />
-
 ### Resources
 
 - [Changelog](https://github.com/dbt-labs/dbt-core/blob/1.1.latest/CHANGELOG.md)
-Original file line number
+Diff line change
@@ Expand Up @@
     :::info Using sources for raw data references
-    As of v0.13.0, we recommend defining your raw data as [sources](/docs/build/sources), and selecting from the source rather than using the direct relation reference. Our dbt projects no longer contain any direct relation references in any models.
+    We recommend defining your raw data as [sources](/docs/build/sources), and selecting from the source rather than using the direct relation reference. Our dbt projects don't contain any direct relation references in any models.
     :::
@@ Expand Down @@
Original file line number	Diff line number	Diff line change
Expand Up		@@ -77,4 +77,5 @@ Select Allow. This redirects you back to dbt Cloud. You should now be an aut

		## FAQs

		<FAQ path="Warehouse/bg-oauth-drive-scope"/>
		<FAQ path="Warehouse/bq-oauth-drive-scope" />