From 2ae14649b6c76805c1dad4064593eeec0f78af62 Mon Sep 17 00:00:00 2001
From: Zi Wang <wazi@google.com>
Date: Wed, 27 Sep 2023 21:45:28 -0700
Subject: [PATCH 1/3] modifying dbt-bigquery python model submission content to
 be easier and more readable

---
 website/docs/docs/build/python-models.md | 50 ++++++------------------
 1 file changed, 12 insertions(+), 38 deletions(-)
diff --git a/website/docs/docs/build/python-models.md b/website/docs/docs/build/python-models.md
index 12825648501..c489b4e3333 100644
--- a/website/docs/docs/build/python-models.md
+++ b/website/docs/docs/build/python-models.md
@@ -643,66 +643,40 @@ If not configured, `dbt-spark` will use the built-in defaults: the all-purpose c
 
 </div>
 
+<!-- 000000000000000000000000000000 BigQuery 000000000000000000000000000000 -->
 <div warehouse="BigQuery">
 
-The `dbt-bigquery` adapter uses a service called Dataproc to submit your Python models as PySpark jobs. That Python/PySpark code will read from your tables and views in BigQuery, perform all computation in Dataproc, and write the final result back to BigQuery.
+**Submission methods:** The `dbt-bigquery` adapter uses [Dataproc](https://cloud.google.com/dataproc) to submit your Python models as PySpark jobs. Dataproc supports two submission methods: `cluster` and `serverless`.
 
-**Submission methods.** Dataproc supports two submission methods: `serverless` and `cluster`. Dataproc Serverless does not require a ready cluster, which saves on hassle and cost—but it is slower to start up, and much more limited in terms of available configuration. For example, Dataproc Serverless supports only a small set of Python packages, though it does include `pandas`, `numpy`, and `scikit-learn`. (See the full list [here](https://cloud.google.com/dataproc-serverless/docs/guides/custom-containers#example_custom_container_image_build), under "The following packages are installed in the default image"). Whereas, by creating a Dataproc Cluster in advance, you can fine-tune the cluster's configuration, install any PyPI packages you want, and benefit from faster, more responsive runtimes.
-
-Use the `cluster` submission method with dedicated Dataproc clusters you or your organization manage. Use the `serverless` submission method to avoid managing a Spark cluster. The latter may be quicker for getting started, but both are valid for production.
-
-**Additional setup:**
-- Create or use an existing [Cloud Storage bucket](https://cloud.google.com/storage/docs/creating-buckets)
-- Enable Dataproc APIs for your project + region
-- If using the `cluster` submission method: Create or use an existing [Dataproc cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster) with the [Spark BigQuery connector initialization action](https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/connectors#bigquery-connectors). (Google recommends copying the action into your own Cloud Storage bucket, rather than using the example version shown in the screenshot)
+- Cluster Submission Method: Create or use an existing [Dataproc cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster). See example below with configuration set within the model's .py file, using the dbt.config() method
 
 <Lightbox src="/img/docs/building-a-dbt-project/building-models/python-models/dataproc-connector-initialization.png" title="Add the Spark BigQuery connector as an initialization action"/>
 
-The following configurations are needed to run Python models on Dataproc. You can add these to your [BigQuery profile](/docs/core/connect-data-platform/bigquery-setup#running-python-models-on-dataproc) or configure them on specific Python models:
-- `gcs_bucket`: Storage bucket to which dbt will upload your model's compiled PySpark code.
-- `dataproc_region`: GCP region in which you have enabled Dataproc (for example `us-central1`).
-- `dataproc_cluster_name`: Name of Dataproc cluster to use for running Python model (executing PySpark job). Only required if `submission_method: cluster`.
 
 ```python
 def model(dbt, session):
     dbt.config(
         submission_method="cluster",
-        dataproc_cluster_name="my-favorite-cluster"
+        dataproc_cluster_name="my-favorite-cluster",
+        dataproc_region="us-central1",
+        gcs_bucket="my-favorite-bucket"
     )
     ...
 ```
+
+- Serverless Submission Method: Dataproc Serverless does not require a ready cluster, but it can also mean the cluster is slower to start. See example below with configuration set in a dedicated .yml file, within the models/ directory
+
 ```yml
 version: 2
 models:
   - name: my_python_model
     config:
       submission_method: serverless
+      dataproc_region: us-central1
+      gcs_bucket: my-favorite-bucket
 ```
 
-Python models running on Dataproc Serverless can be further configured in your [BigQuery profile](/docs/core/connect-data-platform/bigquery-setup#running-python-models-on-dataproc).
-
-Any user or service account that runs dbt Python models will need the following permissions(in addition to the required BigQuery permissions) ([docs](https://cloud.google.com/dataproc/docs/concepts/iam/iam)):
-```
-dataproc.batches.create
-dataproc.clusters.use
-dataproc.jobs.create
-dataproc.jobs.get
-dataproc.operations.get
-dataproc.operations.list
-storage.buckets.get
-storage.objects.create
-storage.objects.delete
-```
-
-**Installing packages:** If you are using a Dataproc Cluster (as opposed to Dataproc Serverless), you can add third-party packages while creating the cluster.
-
-Google recommends installing Python packages on Dataproc clusters via initialization actions:
-- [How initialization actions are used](https://github.com/GoogleCloudDataproc/initialization-actions/blob/master/README.md#how-initialization-actions-are-used)
-- [Actions for installing via `pip` or `conda`](https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/python)
-
-You can also install packages at cluster creation time by [defining cluster properties](https://cloud.google.com/dataproc/docs/tutorials/python-configuration#image_version_20): `dataproc:pip.packages` or `dataproc:conda.packages`.
-
-<Lightbox src="/img/docs/building-a-dbt-project/building-models/python-models/dataproc-pip-packages.png" title="Adding packages to install via pip at cluster startup"/>
+**Installing packages**: If you are using a Dataproc Cluster (as opposed to Dataproc Serverless), you can add third-party packages while creating the cluster with the [Spark BigQuery connector initialization action](https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/connectors#bigquery-connectors). If you are using Dataproc Serverless, you can build your own [custom container image](https://cloud.google.com/dataproc-serverless/docs/guides/custom-containers#python_packages) with the packages you need.
 
 **Docs:**
 - [Dataproc overview](https://cloud.google.com/dataproc/docs/concepts/overview)

From b5c307f306ae91833cc721f85fe066b9eecd8965 Mon Sep 17 00:00:00 2001
From: Zi Wang <wazi@google.com>
Date: Wed, 27 Sep 2023 21:48:44 -0700
Subject: [PATCH 2/3] adding additional setup

---
 website/docs/docs/build/python-models.md | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/website/docs/docs/build/python-models.md b/website/docs/docs/build/python-models.md
index c489b4e3333..a2bfcd9dd00 100644
--- a/website/docs/docs/build/python-models.md
+++ b/website/docs/docs/build/python-models.md
@@ -678,6 +678,8 @@ models:
 
 **Installing packages**: If you are using a Dataproc Cluster (as opposed to Dataproc Serverless), you can add third-party packages while creating the cluster with the [Spark BigQuery connector initialization action](https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/connectors#bigquery-connectors). If you are using Dataproc Serverless, you can build your own [custom container image](https://cloud.google.com/dataproc-serverless/docs/guides/custom-containers#python_packages) with the packages you need.
 
+**Additional setup:**: The user or role should have the adequate IAM permission to be able to trigger a job through Dataproc Cluster or Dataproc Serverless
+
 **Docs:**
 - [Dataproc overview](https://cloud.google.com/dataproc/docs/concepts/overview)
 - [PySpark DataFrame syntax](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.html)

From 02c7a0c07da8ea68f0ec2999bf419f092b170046 Mon Sep 17 00:00:00 2001
From: Zi Wang <wazi@google.com>
Date: Thu, 28 Sep 2023 13:11:26 -0700
Subject: [PATCH 3/3] removing extra comment

---
 website/docs/docs/build/python-models.md | 1 -
 1 file changed, 1 deletion(-)

diff --git a/website/docs/docs/build/python-models.md b/website/docs/docs/build/python-models.md
index 5be4d68c520..65fb5fc3aeb 100644
--- a/website/docs/docs/build/python-models.md
+++ b/website/docs/docs/build/python-models.md
@@ -647,7 +647,6 @@ If not configured, `dbt-spark` will use the built-in defaults: the all-purpose c
 
 </div>
 
-<!-- 000000000000000000000000000000 BigQuery 000000000000000000000000000000 -->
 <div warehouse="BigQuery">
 
 **Submission methods:** The `dbt-bigquery` adapter uses [Dataproc](https://cloud.google.com/dataproc) to submit your Python models as PySpark jobs. Dataproc supports two submission methods: `cluster` and `serverless`.