Skip to content

Commit

Permalink
remove width attribute and add max width
Browse files Browse the repository at this point in the history
  • Loading branch information
mirnawong1 committed Jan 15, 2024
1 parent 4e5bac4 commit 788976d
Show file tree
Hide file tree
Showing 93 changed files with 333 additions and 333 deletions.
10 changes: 5 additions & 5 deletions website/blog/2022-11-22-move-spreadsheets-to-your-dwh.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,9 +70,9 @@ An obvious choice if you have data to load into your warehouse would be your exi

[Fivetran’s browser uploader](https://fivetran.com/docs/files/browser-upload) does exactly what it says on the tin: you upload a file to their web portal and it creates a table containing that data in a predefined schema in your warehouse. With a visual interface to modify data types, it’s easy for anyone to use. And with an account type with the permission to only upload files, you don’t need to worry about your stakeholders accidentally breaking anything either.

<Lightbox src="/img/blog/2022-11-22-move-spreadsheets-to-your-dwh/fivetran-uploader-1.png" width="65%" title="Converting data types from text to dates and numbers is easy in the visual editor" />
<Lightbox src="/img/blog/2022-11-22-move-spreadsheets-to-your-dwh/fivetran-uploader-1.png" title="Converting data types from text to dates and numbers is easy in the visual editor" />

<Lightbox src="/img/blog/2022-11-22-move-spreadsheets-to-your-dwh/fivetran-uploader-2.png" width="65%" title="Picking the matching date format from a list of options to convert them to a standardized format" />
<Lightbox src="/img/blog/2022-11-22-move-spreadsheets-to-your-dwh/fivetran-uploader-2.png" title="Picking the matching date format from a list of options to convert them to a standardized format" />

A nice benefit of the uploader is support for updating data in the table over time. If a file with the same name and same columns is uploaded, any new records will be added, and existing records (per the <Term id="primary-key"/>) will be updated.

Expand Down Expand Up @@ -100,7 +100,7 @@ The main benefit of connecting to Google Sheets instead of a static spreadsheet

Instead of syncing all cells in a sheet, you create a [named range](https://fivetran.com/docs/files/google-sheets/google-sheets-setup-guide) and connect Fivetran to that range. Each Fivetran connector can only read a single range—if you have multiple tabs then you’ll need to create multiple connectors, each with its own schema and table in the target warehouse. When a sync takes place, it will [truncate](https://docs.getdbt.com/terms/ddl#truncate) and reload the table from scratch as there is no primary key to use for matching.

<Lightbox src="/img/blog/2022-11-22-move-spreadsheets-to-your-dwh/google-sheets-uploader.png" width="65%" title="Creating a named range in Google Sheets to sync via the Fivetran Google Sheets Connector" />
<Lightbox src="/img/blog/2022-11-22-move-spreadsheets-to-your-dwh/google-sheets-uploader.png" title="Creating a named range in Google Sheets to sync via the Fivetran Google Sheets Connector" />

Beware of inconsistent data types though—if someone types text into a column that was originally numeric, Fivetran will automatically convert the column to a string type which might cause issues in your downstream transformations. [The recommended workaround](https://fivetran.com/docs/files/google-sheets#typetransformationsandmapping) is to explicitly cast your types in [staging models](https://docs.getdbt.com/best-practices/how-we-structure/2-staging) to ensure that any undesirable records are converted to null.

Expand All @@ -119,7 +119,7 @@ Beware of inconsistent data types though—if someone types text into a column t

I’m a big fan of [Fivetran’s Google Drive connector](https://fivetran.com/docs/files/google-drive); in the past I’ve used it to streamline a lot of weekly reporting. It allows stakeholders to use a tool they’re already familiar with (Google Drive) instead of dealing with another set of credentials. Every file uploaded into a specific folder on Drive (or [Box, or consumer Dropbox](https://fivetran.com/docs/files/magic-folder)) turns into a table in your warehouse.

<Lightbox src="/img/blog/2022-11-22-move-spreadsheets-to-your-dwh/google-drive-uploader.png" width="65%" title="Fivetran will add each of these csv files to a single schema in your warehouse, making it ideal for regular uploads" />
<Lightbox src="/img/blog/2022-11-22-move-spreadsheets-to-your-dwh/google-drive-uploader.png" title="Fivetran will add each of these csv files to a single schema in your warehouse, making it ideal for regular uploads" />

Like the Google Sheets connector, the data types of the columns are determined automatically. Dates, in particular, are finicky though—if you can control your input data, try to get it into [ISO 8601 format](https://xkcd.com/1179/) to minimize the amount of cleanup you have to do on the other side.

Expand Down Expand Up @@ -174,7 +174,7 @@ Each of the major data warehouses also has native integrations to import spreads

Snowflake’s options are robust and user-friendly, offering both a [web-based loader](https://docs.snowflake.com/en/user-guide/data-load-web-ui.html) as well as [a bulk importer](https://docs.snowflake.com/en/user-guide/data-load-bulk.html). The web loader is suitable for small to medium files (up to 50MB) and can be used for specific files, all files in a folder, or files in a folder that match a given pattern. It’s also the most provider-agnostic, with support for Amazon S3, Google Cloud Storage, Azure and the local file system.

<Lightbox src="/img/blog/2022-11-22-move-spreadsheets-to-your-dwh/snowflake-uploader.png" width="65%" title="Snowflake’s web-based Load Data Wizard via the Snowflake Blog https://www.snowflake.com/blog/tech-tip-getting-data-snowflake/" />
<Lightbox src="/img/blog/2022-11-22-move-spreadsheets-to-your-dwh/snowflake-uploader.png" title="Snowflake’s web-based Load Data Wizard via the Snowflake Blog https://www.snowflake.com/blog/tech-tip-getting-data-snowflake/" />

### BigQuery

Expand Down
4 changes: 2 additions & 2 deletions website/blog/2022-11-30-dbt-project-evaluator.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ If you attended [Coalesce 2022](https://www.youtube.com/watch?v=smbRwmcM1Ok), yo

Don’t believe me??? Here’s photographic proof.

<Lightbox src="/img/blog/2022-11-30-dbt-project-evaluator/proserv_aliens.png" width="65%" title="Rare photographic evidence of the dbt Labs Professional Services team" />
<Lightbox src="/img/blog/2022-11-30-dbt-project-evaluator/proserv_aliens.png" title="Rare photographic evidence of the dbt Labs Professional Services team" />

Since the inception of dbt Labs, our team has been embedded with a variety of different data teams — from an over-stretched-data-team-of-one to a data-mesh-multiverse.

Expand Down Expand Up @@ -120,4 +120,4 @@ If something isn’t working quite right or you have ideas for future functional

Together, we can ensure that dbt projects across the galaxy are set up for success as they grow to infinity and beyond.

<Lightbox src="/img/blog/2022-11-30-dbt-project-evaluator/grace_at_coalesce.png" width="65%" title="Alien Graceline beams back to dbt Labs’ mission control center…for now" />
<Lightbox src="/img/blog/2022-11-30-dbt-project-evaluator/grace_at_coalesce.png" title="Alien Graceline beams back to dbt Labs’ mission control center…for now" />
4 changes: 2 additions & 2 deletions website/blog/2023-01-17-grouping-data-tests.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,11 +43,11 @@ So what do we discover when we validate our data by group?

Testing for monotonicity, we find many poorly behaved turnstiles. Unlike the well-behaved dark blue line, other turnstiles seem to _decrement_ versus _increment_ with each rotation while still others cyclically increase and plummet to zero – perhaps due to maintenance events, replacements, or glitches in communication with the central server.

<Lightbox src="/img/blog/2023-01-17-grouping-data-tests/1-monotonicity.png" width="65%" title="Cumulative Entries by Turnstile for 3 Turnstiles" alt="A chart with three lines: one in dark blue trending up and to the right, one in light blue trending down and to the right, and one in very light blue which tracks up and then suddenly drops, repeating in a sawtooth pattern."/>
<Lightbox src="/img/blog/2023-01-17-grouping-data-tests/1-monotonicity.png" title="Cumulative Entries by Turnstile for 3 Turnstiles" alt="A chart with three lines: one in dark blue trending up and to the right, one in light blue trending down and to the right, and one in very light blue which tracks up and then suddenly drops, repeating in a sawtooth pattern."/>

Similarly, while no expected timestamp is missing from the data altogether, a more rigorous test of timestamps _by turnstile_ reveals between roughly 50-100 missing observations for any given period.

<Lightbox src="/img/blog/2023-01-17-grouping-data-tests/2-missing.png" width="65%" title="Number of Missing Turnstiles by Recording Time Period" alt="A dot plot showing 50-100 turnstiles are missing entries for each period between January and May, the range shown on the x axis."/>
<Lightbox src="/img/blog/2023-01-17-grouping-data-tests/2-missing.png" title="Number of Missing Turnstiles by Recording Time Period" alt="A dot plot showing 50-100 turnstiles are missing entries for each period between January and May, the range shown on the x axis."/>

_Check out this [GitHub gist](https://gist.github.com/emilyriederer/4dcc6a05ea53c82db175e15f698a1fb6) to replicate these views locally._

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,7 @@ In both cases, the operation can be done on a single partition at a time so it r

On a 192 GB partition here is how the different methods compare:

<Lightbox src="/img/blog/2023-02-01-ingestion-time-partitioning-bigquery/merge-vs-select.png" width="65%" />
<Lightbox src="/img/blog/2023-02-01-ingestion-time-partitioning-bigquery/merge-vs-select.png" />

Also, the `SELECT` statement consumed more than 10 hours of slot time while `MERGE` statement took days of slot time.

Expand Down
16 changes: 8 additions & 8 deletions website/blog/2023-03-23-audit-helper.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ It is common for analytics engineers (AE) and data analysts to have to refactor

Not only is that approach time-consuming, but it is also prone to naive assumptions that values match based on aggregate measures (such as counts or sums). To provide a better, more accurate approach to auditing, dbt Labs has created the `audit_helper` package. `audit_helper` is a package for dbt whose main purpose is to audit data by comparing two tables (the original one versus a refactored model). It uses a simple and intuitive query structure that enables quickly comparing tables based on the column values, row amount, and even column types (for example, to make sure that a given column is numeric in both your table and the original one). Figure 1 graphically displays the workflow and where `audit_helper` is positioned in the refactoring process.

<Lightbox src="/img/blog/2023-03-23-audit-helper/image1.png" width="65%" title="Figure 1 — Workflow of auditing process using audit_helper" />
<Lightbox src="/img/blog/2023-03-23-audit-helper/image1.png" title="Figure 1 — Workflow of auditing process using audit_helper" />

Now that it is clear where the `audit_helper` package is positioned in the refactoring process, it is important to highlight the benefits of using audit_helper (and ultimately, of auditing refactored models). Among the benefits, we can mention:
- **Quality assurance**: Assert that a refactored model is reaching the same output as the original model that is being refactored.
Expand Down Expand Up @@ -57,12 +57,12 @@ According to the `audit_helper` package documentation, this macro comes in handy
### How it works
When you run the dbt audit model, it will compare all columns, row by row. To count for the match, every column in a row from one source must exactly match a row from another source, as illustrated in the example in Figure 2 below:

<Lightbox src="/img/blog/2023-03-23-audit-helper/image5.png" width="65%" title="Figure 2 — Workflow of auditing rows (compare_queries) using audit_helper" />
<Lightbox src="/img/blog/2023-03-23-audit-helper/image5.png" title="Figure 2 — Workflow of auditing rows (compare_queries) using audit_helper" />


As shown in the example, the model is compared line by line, and in this case, all lines in both models are equivalent and the result should be 100%. Figure 3 below depicts a row in which two of the three columns are equal and only the last column of row 1 has divergent values. In this case, despite the fact that most of row 1 is identical, that row will not be counted towards the final result. In this example, only row 2 and row 3 are valid, yielding a 66.6% match in the total of analyzed rows.

<Lightbox src="/img/blog/2023-03-23-audit-helper/image4.png" width="65%" title="Figure 3 — Example of different values" />
<Lightbox src="/img/blog/2023-03-23-audit-helper/image4.png" title="Figure 3 — Example of different values" />

As previously stated, for the match to be valid, all column values of a model’s row must be equal to the other model. This is why we sometimes need to exclude columns from the comparison (such as date columns, which can have a time zone difference from the original model to the refactored — we will discuss tips like these below).

Expand Down Expand Up @@ -103,12 +103,12 @@ Let’s understand the arguments used in the `compare_queries` macro:
- `summarize` (optional): This argument allows you to switch between a summary or detailed (verbose) view of the compared data. This argument accepts true or false values (its default is set to be true).

3. Replace the sources from the example with your own
<Lightbox src="/img/blog/2023-03-23-audit-helper/image8.png" width="65%" title="Figure 4 — Replace sources path" />
<Lightbox src="/img/blog/2023-03-23-audit-helper/image8.png" title="Figure 4 — Replace sources path" />

As illustrated in Figure 4, using the `ref` statements allows you to easily refer to your development model, and using the full <Term id="data-warehouse" /> path makes it easy to refer to the original table (which will be useful when you are refactoring a SQL Server Stored Procedure or Alteryx Workflow that is already being materialized in the data warehouse).

4. Specify your comparison columns
<Lightbox src="/img/blog/2023-03-23-audit-helper/image6.png" width="65%" title="Figure 5 — Delete and write columns name" />
<Lightbox src="/img/blog/2023-03-23-audit-helper/image6.png" title="Figure 5 — Delete and write columns name" />

Delete the example columns and replace them with the columns of your models, exactly as they are written in each model. You should rename/alias the columns to match, as well as ensuring they are in the same order within the `select` clauses.

Expand All @@ -129,7 +129,7 @@ Let’s understand the arguments used in the `compare_queries` macro:
```
The output will be the similar to the one shown in Figure 6 below:

<Lightbox src="/img/blog/2023-03-23-audit-helper/image2.png" width="65%" title="Figure 6 — Output example of compare queries audit model" />
<Lightbox src="/img/blog/2023-03-23-audit-helper/image2.png" title="Figure 6 — Output example of compare queries audit model" />
<br />
The output is presented in table format, with each column explained below:
<br />
Expand All @@ -155,7 +155,7 @@ While we can surely rely on that overview to validate the final refactored model

A really useful way to check out which specific columns are driving down the match percentage between tables is the `compare_column_values` macro that allows us to audit column values. This macro requires a <Term id="primary-key" /> column to be set, so it can be used as an anchor to compare entries between the refactored dbt model column and the legacy table column. Figure 7 illustrates how the `compare_column_value`s macro works.

<Lightbox src="/img/blog/2023-03-23-audit-helper/image7.png" width="65%" title="Figure 7 — Workflow of auditing rows (compare_column_values) using audit_helper" />
<Lightbox src="/img/blog/2023-03-23-audit-helper/image7.png" title="Figure 7 — Workflow of auditing rows (compare_column_values) using audit_helper" />


The macro’s output summarizes the status of column compatibility, breaking it down into different categories: perfect match, both are null, values do not match, value is null in A only, value is null in B only, missing from A and missing from B. This level of detailing makes it simpler for the AE or data analyst to figure out what can be causing incompatibility issues between the models. While refactoring a model, it is common that some keys used to join models are inconsistent, bringing up unwanted null values on the final model as a result, and that would cause the audit row query to fail, without giving much more detail.
Expand Down Expand Up @@ -224,7 +224,7 @@ Also, we can see that the example code includes a table printing option enabled

But unlike from the `compare_queries` macro, if you have kept the printing function enabled, you should expect a table to be printed in the command line when you run the model, as shown in Figure 8. Otherwise, it will be materialized on your data warehouse like this:

<Lightbox src="/img/blog/2023-03-23-audit-helper/image3.png" width="65%" title="Figure 8 — Example of table printed in command line" />
<Lightbox src="/img/blog/2023-03-23-audit-helper/image3.png" title="Figure 8 — Example of table printed in command line" />

The `compare_column_values` macro separates column auditing results in seven different labels:
- **Perfect match**: count of rows (and relative percentage) where the column values compared between both tables are equal and not null;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ Dimensional modeling is a technique introduced by Ralph Kimball in 1996 with his

The goal of dimensional modeling is to take raw data and transform it into Fact and Dimension tables that represent the business.

<Lightbox src="/img/blog/2023-04-18-building-a-kimball-dimensional-model-with-dbt/3nf-to-dimensional-model.png" width="65%" title="Raw 3NF data to dimensional model"/>
<Lightbox src="/img/blog/2023-04-18-building-a-kimball-dimensional-model-with-dbt/3nf-to-dimensional-model.png" title="Raw 3NF data to dimensional model"/>

The benefits of dimensional modeling are:

Expand Down Expand Up @@ -185,7 +185,7 @@ Now that you’ve set up the dbt project, database, and have taken a peek at the

Identifying the business process is done in collaboration with the business user. The business user has context around the business objectives and business processes, and can provide you with that information.

<Lightbox src="/img/blog/2023-04-18-building-a-kimball-dimensional-model-with-dbt/conversation.png" width="65%" title="Conversation between business user and analytics engineer"/>
<Lightbox src="/img/blog/2023-04-18-building-a-kimball-dimensional-model-with-dbt/conversation.png" title="Conversation between business user and analytics engineer"/>

Upon speaking with the CEO of AdventureWorks, you learn the following information:

Expand Down
Loading

0 comments on commit 788976d

Please sign in to comment.