Dataplex provides templates, powered by Dataflow, to perform common data processing tasks like data ingestion, processing, and managing the data lifecycle. In this lab, you will discover how to leverage common Dataplex templates to curate raw data and translate it into standardized formats like parquet and Avro in the Data Curation lane. This demonstrates how domain teams may quickly process data in a serverless manner and begin consuming it.
Lab2-data-security successfully completed.
~20 mins
tbd
As part of this lab we will curate the Customer Data using Dataplex's Curation Task
This lab is optional
The Customer Raw data has 2 feeds coming into the gcs raw bucket. We will convert both the feeds from CSV to parquet while preserving the data and the partitions.
-
Step1: Create the Dataflow pipeline
-
Step2: Enter the Base Parameters
-
Step3: Enter the Required Parameters
DON'T FORGET TO REPLACE THE PROJECT_ID WITH YOURS
-
Dataplex asset name or Dataplex entity names for the files to be converted: projects/${PROJECT_ID}/locations/us-central1/lakes/consumer-banking--customer--domain/zones/customer-raw-zone/assets/customer-raw-data
-
Output file format in GCS: PARQUET
-
Dataplex asset name for the destination GCS bucket: projects/${PROJECT_ID}/locations/us-central1/lakes/consumer-banking--customer--domain/zones/customer-curated-zone/assets/customer-curated-data
-
-
Step4: Enter the Optional Parameters Very critical for job success
DON'T FORGET TO REPLACE THE PROJECT_ID WITH YOURS
Open "Show Optional Parameters" and add the following-
-
Service Account Email: customer-sa@${PROJECT_ID}.iam.gserviceaccount.com
-
Sample ScreenShot:
-
-
Step5: Set Schedule
- Choose "custom" as Repeats adn set Custom schedule to "0 0 1 1 *"
- Click the Create button.
-
Step6: Click Run button ad then Run Pipeline
-
Step7: Give it a couple of seconds, and after click on the refresh button. Shows the current job that was scheduled and is in pending/running status
-
Step8: Monitor the job. Click on the refresh button again after a few minutes.
-
Step9: Validate the output
- Navigate to Dataplex--> Manage --> "Consumer Banking - Customer Domain" Lake
- Select "Customer Curated Zone"
- Select "Customer Curated Data" Asset
- Scroll down to Resource details and click the External URL link
- Open customers_data folder
- Open the dt=2022-12-01 folder
- You should see the "customer.parquet" file created here
-
Step10: Validate the metadata
(Note: Metadata for curated data may not show up immediately as metadata refreshes typically takes an hour)
-
Go to Dataplex UI
-
Navigate to the Discover menu, Search option
-
Open 'Consumer banking - Customer Domain' and select 'Customer Curated Zone' to filter the assets in the curated layer.
-
Select customer_data from the list of assets shown on the right panel to see the following display. Validate the entry details to see the parquet format file information
-
Suggestion: As a homework, try to curate the merchants data.
In this lab, you learned how to use the built-in one-click templatized Dataplex task to quick standardize your data. This can be a common Data Management task that can be executed without the need of understanding any underlying data. You can also leverage built-in scheduler to execute the workflow either on-demand or on-schedule.
This concludes the lab module. Either proceed to the main menu or to the next module you will use build the data products i.e move from refined to data product layer.
-
Issue#1: Error creating a Dataflow task with Error message "Create task failed The principal(user or service account) lacks IAm permission 'cloudscheduler.jobs.create' for the resource 'project/..." (or the resource may not exist)
Solution: Re-run the job usually fixes the issue
-
Issue#2: Nothing happens or screen frozen when hit Run
Resolution: Use the refresh button
-
Issue#3: Failed to start the VM, launcher-20220903094028585565812401460001, used for launching because of status code: INVALID_ARGUMENT, reason: Error: Message: Invalid value for field 'resource.networkInterfaces[0]': '{ "network": "global/networks/default", "accessConfig": [{ "type": "ONE_TO_ONE_NAT", "name":...'. Subnetwork should be specified for custom subnetmode network HTTP Code: 400.
Resolution: Make sure you have set the correct subnet.
-
Issue#4: After trying to clone a job and re-submit, the Create Button disappears and it keeps buffereing
Resolution: UI issue. Just click on a radio button eg. Pipeline option - Switch to Streaming and back to batch
-
Issue#5: : "message" : "[email protected] does not have storage.objects.list access to the Google Cloud Storage bucket.",
Resolution: Make sure the customer-sa has the right privileges on the bucket.