Data Catalog is a fully managed, scalable metadata management service within Dataplex. Data Catalog automatically catalogs the following Google Cloud assets: BigQuery datasets, tables, views, Pub/Sub topics, Dataplex lakes, zones, tables, and filesets, Analytics Hub linked datasets and (Public preview): Dataproc Metastore services, databases, and tables. Data Catalog handles two types of metadata: technical metadata and business metadata.
We can leverage Dataplex's Catalog to register data products by bringing in additional business context through tags for example - Data Ownership, Data Classification, Data Quality and so on. This can provide a single data disocvery experience for end user to discover high quality and secured data products.
Successful completion of lab1, lab2 and lab4
~45 mins
Data stewards associated with a data entry. A data steward for a data entry can be contacted to request more information about the data entry. A data steward does not require a specific IAM role. You can add a user with a non- Google email account as a data steward for a data entry. A data steward cannot perform any project-related activity within the console, unless the user is provided with explicit IAM permissions.
Rich text overview of a data entry that can include images, tables, links, and so on.
An example of technical metadata related to a data entry, such as a BigQuery table, can include the following attributes: Project information, such as name and ID, Asset name and description, Google Cloud resource labels, Schema name and description for BigQuery tables and views
Business metadata for a data entry can be of the following types: Tags, Data Stewards, Rich text overview
Tags are a type of business metadata. Adding tags to a data entry helps provide meaningful context to anyone who needs to use the asset. For example, a tag can tell you information such as who is responsible for a particular data entry, whether it contains personally identifiable information (PII), the data retention policy for the asset, and a data quality score.
To start tagging data, you first need to create one or more tag templates. A tag template can be a public or private tag template. When you create a tag template, the option to create a public tag template is the default and recommended option in the Google Cloud console. A tag template is a group of metadata key-value pairs called fields. Having a set of templates is similar to having a database schema for your metadata.
You will learn how to create bulk tags on the Dataplex Data Product entity across domains using Composer in this lab after the Data Products have been created as part of the above lab. You will use a Custom Dataple task based on custom metadata tag library to create 4 predefined tag templates - Data Classification, Data Quality, Data Exchange and Data product info(Onwership)
You can also look at Tag Engine for tagging automation
If you have just run your previous lab i.e. Building your Data Product Lab, make sure the entities are visible in Dataplex before proceeding with the below steps.
In order to verify Go to Dataplex → Manage tab → Click on Customer - Source Domain Lake → Click on Customer Data Product Zone
Do this for all the other domain data product zones as well.
Make sure tag templates are created in ${PROJECT_ID} created. Go to Dataplex → Manage Catalog → Tag templates. You should see the below 4 Tag Templates. Open each one and look the schema:
Make sure Data is populated by the DLP job profiler into
bq query --use_legacy_sql=false \
"SELECT
COUNT(*)
FROM
${PROJECT_ID}.central_dlp_data.dlp_data_profiles"
```
-
Open Cloud Shell and create a new file called “customer-tag.yaml” and copy and paste the below yaml into a file.
cd ~ vim customer-tag.yaml
Enter the below text
data_product_id: derived data_product_name: "" data_product_type: "" data_product_description: "" data_product_icon: "" data_product_category: "" data_product_geo_region: "" data_product_owner: "" data_product_documentation: "" domain: derived domain_owner: "" domain_type: "" last_modified_by: "" last_modify_date: ""
Make sure you always leave the last_modified_by and last _moddify date blank
-
Upload the file to the temp gcs bucket
gsutil cp ~/customer-tag.yaml gs://${PROJECT_ID}_dataplex_temp/
- Run the below command to create the tag for customer_data product entity
gcloud dataplex tasks create customer-tag-job \
--project=${PROJECT_ID} \
--location=us-central1 \
--vpc-sub-network-name=projects/${PROJECT_ID}/regions/us-central1/subnetworks/dataplex-default \
--lake='consumer-banking--customer--domain' \
--trigger-type=ON_DEMAND \
--execution-service-account=customer-sa@${PROJECT_ID}.iam.gserviceaccount.com \
--spark-main-class="com.google.cloud.dataplex.templates.dataproductinformation.DataProductInfo" \
--spark-file-uris="gs://${PROJECT_ID}_dataplex_temp/customer-tag.yaml" \
--container-image-java-jars="gs://${PROJECT_ID}_dataplex_process/common/tagmanager-1.0-SNAPSHOT.jar" \
--execution-args=^::^TASK_ARGS="--tag_template_id=projects/${PROJECT_ID}/locations/us-central1/tagTemplates/data_product_information, --project_id=${PROJECT_ID},--location=us-central1,--lake_id=consumer-banking--customer--domain,--zone_id=customer-data-product-zone,--entity_id=customer_data,--input_file=customer-tag.yaml"
-
Go to Dataplex UI → Process → Custom Spark tab → Monitor the job → you will find a job named “customer-tag-job”
-
Once job is successful, Go to Dataplex Search Tab and type this into the search bar - tag:data_product_information
-
Click on customer_data -> Go to the Tags section and make sure the data product information is created.
As you can see the automation utilities was able to derive most of the information. But at times, certain values may needs to be overriden.
- Now let's update the input file
cd ~
vim customer-tag.yaml
Enter the below text
data_product_id: derived
data_product_name: ""
data_product_type: ""
data_product_description: "Adding Custom Description as part of demo"
data_product_icon: ""
data_product_category: ""
data_product_geo_region: ""
data_product_owner: "[email protected]"
data_product_documentation: ""
domain: derived
domain_owner: "[email protected]"
domain_type: ""
last_modified_by: ""
last_modify_date: ""
- upload the updated file to the temp gcs bucket
gsutil cp ~/customer-tag.yaml gs://${PROJECT_ID}_dataplex_temp
- Trigger another tagging job
gcloud dataplex tasks create customer-tag-job-2 \
--project=${PROJECT_ID} \
--location=us-central1 \
--vpc-sub-network-name=projects/${PROJECT_ID}/regions/us-central1/subnetworks/dataplex-default \
--lake='consumer-banking--customer--domain' \
--trigger-type=ON_DEMAND \
--execution-service-account=customer-sa@${PROJECT_ID}.iam.gserviceaccount.com \
--spark-main-class="com.google.cloud.dataplex.templates.dataproductinformation.DataProductInfo" \
--spark-file-uris="gs://${PROJECT_ID}_dataplex_temp/customer-tag.yaml" \
--container-image-java-jars="gs://${PROJECT_ID}_dataplex_process/common/tagmanager-1.0-SNAPSHOT.jar" \
--execution-args=^::^TASK_ARGS="--tag_template_id=projects/${PROJECT_ID}/locations/us-central1/tagTemplates/data_product_information, --project_id=${PROJECT_ID},--location=us-central1,--lake_id=consumer-banking--customer--domain,--zone_id=customer-data-product-zone,--entity_id=customer_data,--input_file=customer-tag.yaml"
-
Go to Dataplex UI → Process → Custom Spark tab → Monitor the job -> Wait till it completes. You will find a job with name “customer-tag-job-2”
-
Go to Dataplex-> search tab -> Refresh the tag and see if the updates has been propagated Once the job is successful
Above we learned how to create and update the tags manually, now let’s see how we can use composer to automate the tagging process end-to-end. Now you can easily add tagging jobs as a downstream dependy to your data pipeline
We have limitation on Spark Serverless capacity(8 vCore) by default so makes sure you trigger the tags sequentially to avoid failures due resource crunch**
- Go to Composer → Go to Airflow UI → Click on DAGs
- Click on DAGs and search for or go to “data_governance_customer_dp_info_tag” DAG and click on it
- Trigger the DAG Manually by clicking on the Run button
- Monitor and wait for the jobs to complete. You can also go to Dataplex UI to monitor the jobs
- Trigger the “master_dag_customer_dq” and wait for its completion. This first runs a data quality job and then publishes the data quality score tag.
- Trigger the “data_governance_customer_exchange_tag” dag and wait for its completion
- Trigger the “data_governance_customer_classification_tag” and wait for its completion
- Step1: Follow the same steps you used above for triggering the below set of DAGs: and wait for its completion. Execute the below DAGs in airflow
- master_dag_merchant_dq
- data_governance_merchant_classification_tag
- data_governance_merchant_dp_info_tag
- data_governance_merchant_exchange_tag
This task is optional and finish this task only if you have time left for your lab.
- Step1: Follow the same steps you used above for triggering the below set of DAGs: and wait for its completion. Execute the below DAGs in airflow
- data_governance_transactions_classification_tag
- data_governance_transactions_dp_info_tag
- data_governance_transactions_exchange_tag
- data_governance_transactions_quality_tag
This concludes the lab module. Either proceed to the main menu or to the next module you will learn to discover data products and look at data lineage