Skip to content

Commit

Permalink
Python models Dataproc Serverless setup with packages (#5920)
Browse files Browse the repository at this point in the history
Add description on how to setup python models with Dataproc Serverless
using a custom image in order to use third-party packages.

## What are you changing in this pull request and why?
In the context of running Python models in Spark using Dataproc, the
documentation
([python-models.md](https://github.com/dbt-labs/docs.getdbt.com/blob/current/website/docs/docs/build/python-models.md))
says:
> Installing packages: If you are using a Dataproc Cluster (as opposed
to Dataproc Serverless), you can add third-party packages while creating
the cluster.

I dug and found it is possible to run python models using third-party
packages in dataproc serverless. It requires to use a custom docker
image. This is very well documented on GCP's end. We currently run this
in prod without any issue. I added this in the documentation. Let me
know if you need more details on how to set this up.

## Checklist
- [x] Review the [Content style
guide](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/content-style-guide.md)
so my content adheres to these guidelines.
- [x] For [docs
versioning](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/single-sourcing-content.md#about-versioning),
review how to [version a whole
page](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/single-sourcing-content.md#adding-a-new-version)
and [version a block of
content](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/single-sourcing-content.md#versioning-blocks-of-content).
- [x] Add a checklist item for anything that needs to happen before this
PR is merged, such as "needs technical review" or "change base branch."

Adding or removing pages (delete if not applicable):
N/A

---------

Co-authored-by: Matt Shaver <[email protected]>
Co-authored-by: Leona B. Campbell <[email protected]>
Co-authored-by: Mirna Wong <[email protected]>
  • Loading branch information
4 people authored Jan 31, 2025
1 parent 40266a9 commit 96f64a8
Showing 1 changed file with 32 additions and 5 deletions.
37 changes: 32 additions & 5 deletions website/docs/docs/build/python-models.md
Original file line number Diff line number Diff line change
Expand Up @@ -815,13 +815,40 @@ storage.objects.create
storage.objects.delete
```

**Installing packages:** If you are using a Dataproc Cluster (as opposed to Dataproc Serverless), you can add third-party packages while creating the cluster.
**Installing packages:**

Google recommends installing Python packages on Dataproc clusters via initialization actions:
- [How initialization actions are used](https://github.com/GoogleCloudDataproc/initialization-actions/blob/master/README.md#how-initialization-actions-are-used)
- [Actions for installing via `pip` or `conda`](https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/python)
Installation of third-party packages on Dataproc varies depending on whether it's a [cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster) or [serverless](https://cloud.google.com/dataproc-serverless/docs).

You can also install packages at cluster creation time by [defining cluster properties](https://cloud.google.com/dataproc/docs/tutorials/python-configuration#image_version_20): `dataproc:pip.packages` or `dataproc:conda.packages`.
- **Dataproc Cluster** &mdash; Google recommends installing Python packages while creating the cluster via initialization actions:
- [How initialization actions are used](https://github.com/GoogleCloudDataproc/initialization-actions/blob/master/README.md#how-initialization-actions-are-used)
- [Actions for installing via `pip` or `conda`](https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/python)

You can also install packages at cluster creation time by [defining cluster properties](https://cloud.google.com/dataproc/docs/tutorials/python-configuration#image_version_20): `dataproc:pip.packages` or `dataproc:conda.packages`.

- **Dataproc Serverless** &mdash; Google recommends using a [custom docker image](https://cloud.google.com/dataproc-serverless/docs/guides/custom-containers) to install thrid-party packages. The image needs to be hosted in [Google Artifact Registry](https://cloud.google.com/artifact-registry/docs). It can then be used by providing the image path in dbt profiles:

<File name='profiles.yml'>
```yml
my-profile:
target: dev
outputs:
dev:
type: bigquery
method: oauth
project: abc-123
dataset: my_dataset
# for dbt Python models to be run on Dataproc Serverless
gcs_bucket: dbt-python
dataproc_region: us-central1
submission_method: serverless
dataproc_batch:
runtime_config:
container_image: {HOSTNAME}/{PROJECT_ID}/{IMAGE}:{TAG}
```


</File>

<Lightbox src="/img/docs/building-a-dbt-project/building-models/python-models/dataproc-pip-packages.png" title="Adding packages to install via pip at cluster startup"/>

Expand Down

0 comments on commit 96f64a8

Please sign in to comment.