From 4026bb6cbca9f945aff7e3bf6998022ee326adfc Mon Sep 17 00:00:00 2001 From: Matt Shaver <60105315+matthewshaver@users.noreply.github.com> Date: Mon, 3 Oct 2022 11:45:28 -0400 Subject: [PATCH 1/3] Changes to pandas to match branding --- .../building-models/python-models.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/website/docs/docs/building-a-dbt-project/building-models/python-models.md b/website/docs/docs/building-a-dbt-project/building-models/python-models.md index 1499af9a9b3..0031101d34f 100644 --- a/website/docs/docs/building-a-dbt-project/building-models/python-models.md +++ b/website/docs/docs/building-a-dbt-project/building-models/python-models.md @@ -522,7 +522,7 @@ def model(dbt, session): #### Code reuse Currently, Python functions defined in one dbt model cannot be imported and reused in other models. This is something we'd like dbt to support. There are two patterns we're considering: -1. Creating and registering **"named" UDFs**. This process is different across data platforms and has some performance limitations. (Snowpark does support ["vectorized" UDFs](https://docs.snowflake.com/en/developer-guide/udf/python/udf-python-batch.html): Pandas-like functions that can be executed in parallel.) +1. Creating and registering **"named" UDFs**. This process is different across data platforms and has some performance limitations. (Snowpark does support ["vectorized" UDFs](https://docs.snowflake.com/en/developer-guide/udf/python/udf-python-batch.html): pandas-like functions that can be executed in parallel.) 2. Using **private Python packages**. In addition to importing reusable functions from public PyPI packages, many data platforms support uploading custom Python assets and registering them as packages. The upload process looks different across platforms, but your code’s actual `import` looks the same. :::note ❓ Our questions @@ -544,16 +544,16 @@ That's about where the agreement ends. There are numerous frameworks with their When developing a Python model, you will find yourself asking these questions: -**Why Pandas?** It's the most common API for DataFrames. It makes it easy to explore sampled data and develop transformations locally. You can “promote” your code as-is into dbt models and run it in production for small datasets. +**Why pandas?** It's the most common API for DataFrames. It makes it easy to explore sampled data and develop transformations locally. You can “promote” your code as-is into dbt models and run it in production for small datasets. -**Why _not_ Pandas?** Performance. Pandas runs "single-node" transformations, which cannot benefit from the parallelism and distributed computing offered by modern data warehouses. This quickly becomes a problem as you operate on larger datasets. Some data platforms support optimizations for code written using Pandas' DataFrame API, preventing the need for major refactors. For example, ["Pandas on PySpark"](https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_ps.html) offers support for 95% of Pandas functionality, using the same API while still leveraging parallel processing. +**Why _not_ pandas?** Performance. pandas runs "single-node" transformations, which cannot benefit from the parallelism and distributed computing offered by modern data warehouses. This quickly becomes a problem as you operate on larger datasets. Some data platforms support optimizations for code written using pandas' DataFrame API, preventing the need for major refactors. For example, ["pandas on PySpark"](https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_ps.html) offers support for 95% of pandas functionality, using the same API while still leveraging parallel processing. :::note ❓ Our questions -- When developing a new dbt Python model, should we recommend Pandas-style syntax for rapid iteration and then refactor? +- When developing a new dbt Python model, should we recommend pandas-style syntax for rapid iteration and then refactor? - Which open source libraries provide compelling abstractions across different data engines and vendor-specific APIs? - Should dbt attempt to play a longer-term role in standardizing across them? -💬 Discussion: ["Python models: the Pandas problem (and a possible solution)"](https://github.com/dbt-labs/dbt-core/discussions/5738) +💬 Discussion: ["Python models: the pandas problem (and a possible solution)"](https://github.com/dbt-labs/dbt-core/discussions/5738) ::: ### Limitations From 0cecfb3cdaa875b7495bad0f3d405d7ce1255649 Mon Sep 17 00:00:00 2001 From: Matt Shaver <60105315+matthewshaver@users.noreply.github.com> Date: Mon, 3 Oct 2022 14:49:20 -0400 Subject: [PATCH 2/3] Update website/docs/docs/building-a-dbt-project/building-models/python-models.md Co-authored-by: mirnawong1 <89008547+mirnawong1@users.noreply.github.com> --- .../building-a-dbt-project/building-models/python-models.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/docs/docs/building-a-dbt-project/building-models/python-models.md b/website/docs/docs/building-a-dbt-project/building-models/python-models.md index 0031101d34f..15575f5db79 100644 --- a/website/docs/docs/building-a-dbt-project/building-models/python-models.md +++ b/website/docs/docs/building-a-dbt-project/building-models/python-models.md @@ -521,7 +521,7 @@ def model(dbt, session): #### Code reuse -Currently, Python functions defined in one dbt model cannot be imported and reused in other models. This is something we'd like dbt to support. There are two patterns we're considering: +Currently, you cannot import or reuse Python functions defined in one dbt model, in other models. This is something we'd like dbt to support. There are two patterns we're considering: 1. Creating and registering **"named" UDFs**. This process is different across data platforms and has some performance limitations. (Snowpark does support ["vectorized" UDFs](https://docs.snowflake.com/en/developer-guide/udf/python/udf-python-batch.html): pandas-like functions that can be executed in parallel.) 2. Using **private Python packages**. In addition to importing reusable functions from public PyPI packages, many data platforms support uploading custom Python assets and registering them as packages. The upload process looks different across platforms, but your code’s actual `import` looks the same. From 6ddc61b79088bc2461cc461a5dd59de1f9ec4317 Mon Sep 17 00:00:00 2001 From: Matt Shaver <60105315+matthewshaver@users.noreply.github.com> Date: Mon, 3 Oct 2022 14:49:29 -0400 Subject: [PATCH 3/3] Update website/docs/docs/building-a-dbt-project/building-models/python-models.md Co-authored-by: mirnawong1 <89008547+mirnawong1@users.noreply.github.com> --- .../building-a-dbt-project/building-models/python-models.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/docs/docs/building-a-dbt-project/building-models/python-models.md b/website/docs/docs/building-a-dbt-project/building-models/python-models.md index 15575f5db79..56d9ae09413 100644 --- a/website/docs/docs/building-a-dbt-project/building-models/python-models.md +++ b/website/docs/docs/building-a-dbt-project/building-models/python-models.md @@ -522,7 +522,7 @@ def model(dbt, session): #### Code reuse Currently, you cannot import or reuse Python functions defined in one dbt model, in other models. This is something we'd like dbt to support. There are two patterns we're considering: -1. Creating and registering **"named" UDFs**. This process is different across data platforms and has some performance limitations. (Snowpark does support ["vectorized" UDFs](https://docs.snowflake.com/en/developer-guide/udf/python/udf-python-batch.html): pandas-like functions that can be executed in parallel.) +1. Creating and registering **"named" UDFs**. This process is different across data platforms and has some performance limitations. (Snowpark does support ["vectorized" UDFs](https://docs.snowflake.com/en/developer-guide/udf/python/udf-python-batch.html): pandas-like functions that you can execute in parallel.) 2. Using **private Python packages**. In addition to importing reusable functions from public PyPI packages, many data platforms support uploading custom Python assets and registering them as packages. The upload process looks different across platforms, but your code’s actual `import` looks the same. :::note ❓ Our questions