Skip to content

Commit

Permalink
Documentation updates for ingestion changes (#77)
Browse files Browse the repository at this point in the history
* Update README.md

* doc updates

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* README updates for new pathing
  • Loading branch information
mservidio authored and salsferrazza committed Dec 1, 2019
1 parent bf8d373 commit b70cb16
Show file tree
Hide file tree
Showing 2 changed files with 52 additions and 70 deletions.
3 changes: 3 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,9 @@

_This is not an officially supported Google product._

## **Breaking Changes in v0.2.0.**
**_There are breaking changes introduced in v0.2.0 related to the storage bucket directory structure used by the ingestion function. For further information see [ingestion](/ingestion/README.md)._**

The ```BigQuery Datashare Toolkit (BQDS)``` is a solution for data publishers to easily manage datasets residing within [BigQuery](https://cloud.google.com/bigquery/). The toolkit includes functionality to ingest and entitle data, relieving consumers from much of the toil involved in onboarding datasets from a variety of providers. Publishers upload data files to a storage bucket and allocate permissioned datasets using BigQuery [authorized views](https://cloud.google.com/bigquery/docs/authorized-views). This results in less friction for both publishers and consumers.

While ```BQDS``` is used for data management and entitlement, it does *not* manage any commercial aspects of delivery. Hence, it is assumed that publishers already have licensing arrangements and that consumers have furnished you the GCP account ID's corresponding to their entitled users. These are required for the creation of the authorized views within BigQuery.
Expand Down
119 changes: 49 additions & 70 deletions ingestion/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,75 +2,59 @@

# Ingestion Cloud Function for batch data uploads

## **Breaking Changes in v0.2.0.**
The dataset and table names were previously inferred from the name of the file placed into the `/bqds` directory, IE: `mydataset.mytable.upload.1.csv`. This has been changed and the dataset and table names are now inferred from the file path. IE: `/bqds/mydataset/mytable/data/upload.1.csv`. The ingestion function will only process files from the path starting with `/bqds` where the subsequent two path components are dataset and table respectively, followed by `data`. Data files to be processed should always be delivered into the dataset and table names respective `data` directory.

## Synopsis

`BQDS`'s entry point is through a Cloud Function that is listening for
finalize events in a Cloud Storage bucket. This function can be deployed to
`BQDS`'s entry point is through a [Cloud Function](https://cloud.google.com/functions/) that is listening for
the [finalize event](https://cloud.google.com/functions/docs/calling/storage) in a [Cloud Storage](https://cloud.google.com/storage/) bucket. This function can be deployed to
a bucket of your choice using the deployment script in [bin/deploy.sh](bin/deploy.sh).

An ingestion begins when a supported file type (```csv```,
```csv.gz```, ```txt```, ```avro``` or ```json```) is discovered by
the function through Google Cloud Functions'
[bucket trigger mechanism](https://cloud.google.com/functions/docs/calling/storage). When
the function is invoked, it looks in the same bucket (within the ```bqds``` subdirectory) for
[schema and transform](../examples/mlb/config/ingestion) configurations corresponding to the name of the
uploaded file. It then executes a series of BigQuery actions to transform and load the data into the
specified destination BigQuery dataset and table for that file.
An ingestion begins when a supported file type (```csv```, ```csv.gz```, ```txt```, ```avro``` or ```json```) is discovered by the function through Google Cloud Functions'
[bucket trigger mechanism](https://cloud.google.com/functions/docs/calling/storage). When the function is invoked, it looks in the same bucket (within the ```../config``` directory relative to the ```data``` directory) for [schema and transform](../examples/mlb/config/ingestion) configurations corresponding to the name of the uploaded file. It then executes a series of BigQuery actions to transform and load the data into the specified destination BigQuery dataset and table for that file.

A summary of the logic within the function is:

0. If the file extension is of a recognized file type -
1. Extract the dataset and table names from the bucket's inbound file
name, determined by the first two tokens of the file name delimited by `.` , e.g. `mydataset.mytable.upload.1.csv`
2. Determine whether the dataset exists and, if not, create it
3. Look for `<mytable>.schema.json` under the bucket's `/bqds`
subdirectory to get the delimiter, field
definitions, and write disposition for the upload. If it these do not exist, instruct the BigQuery job to
auto-detect the schema and delimiter, and apply `WRITE_APPEND` as the
write disposition
4. Execute a BigQuery job to load the file's contents into a temporary table
1. If the file extension is of a recognized file type ->
2. Extract the dataset and table names from the bucket's inbound file
`data` path, determined by the second and third path components of the file name e.g. `/bqds/mydataset/mytable/data/upload.1.csv`.
3. Determine whether the dataset exists and, if not, create it.
4. Look for `schema.json` under the bucket's `/bqds/mydataset/mytable/config/` directory to get the delimiter, field definitions, and write disposition for the upload. If these do not exist, instruct the BigQuery job to [auto-detect](https://cloud.google.com/bigquery/docs/schema-detect) the schema and delimiter, and apply `WRITE_APPEND` as the write disposition.
5. Execute a BigQuery job to load the file's contents into a temporary table
6. Execute SQL that uses the `SELECT` clause specified
in`<mytable>.transform.sql` (or the null transform`*`), and save the
results (creating or appending, depending on the write disposition chosen) into the specified destination table
7. Delete the temporary table after a successful transformation
stage (temporary tables otherwise expire in 2 days).
in `/bqds/mydataset/mytable/config/transform.sql` (or the null transform `*`), and save the results (creating or appending, depending on the write disposition chosen) into the specified destination table.
7. Delete the temporary table after a successful transformation stage (temporary tables otherwise expire in 2 days).

## Ingestion architecture

![Ingestion architecture](ingestion-flow.png "Ingestion flow")

## Installation

After cloning the repository, this command will attach the Cloud Function to a storage
bucket of your choosing:
After cloning the repository, this command will attach the Cloud Function to a storage bucket of your choosing:

```
cd bq-datashare-toolkit/ingestion/bin
./deploy.sh --trigger-bucket=gs://<mybucket>
```
At this point, any files uploaded to that bucket will trigger the
Cloud Function, but only files of a recognized type will be
processed. Unrecognized files, and any files placed in the bucket's `bqds`
subdirectory, will be ignored.
Cloud Function, but only files of a recognized type will be processed. Unrecognized files or any other directory at the root level of the bucket will be ignored.

## Specifications for uploaded files

The ingestion Cloud Function ```processUpload``` will be triggered upon
all changes to the bucket, but will exit if the file extension is not
supported. When it detects a supported file extension,
it parses the file name to determine the destination BigQuery
supported or if validation fails. When it detects a supported file extension,
it parses the file path to determine the destination BigQuery
dataset and table in which to load the data. For example, if you wish
to target dataset ```DS``` and table ```EXAMPLE```, files uploaded to
your storage bucket (```gs://example-bucket/```) are named:
to target dataset ```shareddataset``` and table ```EXAMPLE```, files uploaded to
your storage bucket (```gs://example-bucket/```) are placed into the following paths:

- ```DS.EXAMPLE.csv``` (if uncompressed).
- ```DS.EXAMPLE.csv.gz``` (if compressed).
- ```gs://example-bucket/bqds/shareddataset/EXAMPLE/data/data.csv``` (if uncompressed).
- ```gs://example-bucket/bqds/shareddataset/EXAMPLE/data/data.csv.gz``` (if compressed).

The Cloud Function will time out after *540*
seconds of execution. Depending on the size of your files, it
may not be possible to completely ingest very large files completely before this
timeout threshold is crossed. If you encounter this condition,
consider splitting up large files into smaller
The Cloud Function will time out after *540* seconds of execution. Depending on the size of your files, it may not be possible to completely ingest very large files completely before this timeout threshold is crossed. If you encounter this condition, consider splitting up large files into smaller
ones (each no larger than 1-1.5G) to upload and process individually.

## Configuration
Expand All @@ -82,30 +66,27 @@ files stored in the same Cloud Storage bucket as the data being
uploaded.

The first stage uses a file named according to the convention
`gs://bucket/bqds/<table-name>.schema.json`. For `<table-name>`, this
file specifies the field definitions corresponding to the file
being uploaded, as well as the
[write disposition](https://cloud.google.com/bigquery/docs/reference/auditlogs/rest/Shared.Types/WriteDisposition)
`gs://bucket/bqds/<dataset-name>/<table-name>/schema.json`. For `schema.json`, this file specifies the field definitions corresponding to the file
being uploaded, as well as the [write disposition](https://cloud.google.com/bigquery/docs/reference/auditlogs/rest/Shared.Types/WriteDisposition)
to be used in ingesting the file.

If the target BigQuery dataset (as inferred from the file name) does not exist, it will be created.
If the target BigQuery dataset (as inferred from the file path) does not exist it will be created.

Tables not found will also be created and seeded with the initial
transformed file contents. Subsequent updates to the same target will
append its rows if the table is found, unless you have specified
`truncate: true` in your `schema.json`.

If there is no ```EXAMPLE.transform.sql``` file,
If there is no ```transform.sql``` file in the `/config` directory,
the function willl default to ```*```, which signifies *no*
transformation from the originally specified schema. If the
```EXAMPLE.schema.json``` file is omitted, BigQuery
will attempt to autodetect the schema from the file contents.
```schema.json``` file is omitted, BigQuery will attempt to autodetect the schema from the file contents.

The content within ```EXAMPLE.schema.json``` is a JSON object representation. The
The content within ```schema.json``` is a JSON object representation. The
```metadata``` property is identical in format to BigQuery's JSON-based
[schema representation ](https://cloud.google.com/bigquery/docs/schemas). The ```delimiter``` property specifies the single character
[JobConfigurationLoad](https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#JobConfigurationLoad). The ```fieldDelimiter``` property specifies the single character
used to delimit columns in each row of the CSV file. An example
```.schema.json``` might resemble:
```/bqds/shareddataset/EXAMPLE/config/schema.json``` might resemble:

```
{
Expand All @@ -125,7 +106,7 @@ For a `WRITE_APPEND` disposition, simply omit `truncate: "true"` from
the configuration.

A file using this schema, and being uploaded into dataset `DS` and
table `EXAMPLE` data in ```DS.EXAMPLE.20201102.csv``` might resemble:
table `EXAMPLE` data in ```/bqds/shareddataset/EXAMPLE/data/20201102.csv``` might resemble:


|ts_ms|object|weight|unit_of_measurement|
Expand All @@ -137,26 +118,24 @@ table `EXAMPLE` data in ```DS.EXAMPLE.20201102.csv``` might resemble:
|1563543871232|liquid oxygen (1 gallon)|4.32|kg|


You may find that you need a ```schema.json``` configuration if the autodetection
by BigQuery does not yield expected results. For example, a column
representing a date as ```YYYYMMDD``` is interpreted by BigQuery as an
You may find that you need a ```schema.json``` configuration if the autodetection by BigQuery does not yield expected results. For example, a column representing a date as ```YYYYMMDD``` is interpreted by BigQuery as an
integer. However, ```YYYYMMDD```, if defined as a DATE in BigQuery
will fail to process, since BigQuery only interprets string DATEs in the format `YYYY-MM-DD`.

Provided that you ultimately want the data represented as a DATE type,
the source data destined for ```DS.EXAMPLE``` requires transformation from
the source data destined for ```shareddataset.EXAMPLE``` requires transformation from
the original data file schemas. These per-column transformations are
specified in ```EXAMPLE.transform.sql``` (stored witin the source
bucket's ```bqds``` subdirectory).
specified in ```transform.sql``` (stored witin the source
bucket's ```/bqds/shareddataset/EXAMPLE/config/` subdirectory).

The format of ```EXAMPLE.transform.sql``` is simply a SQL fragment
The format of ```transform.sql``` is simply a SQL fragment
that queries the original schema, either auto-detected or defined by
``` EXAMPLE.schema.json```. In ```EXAMPLE.transform.sql```, the SQL
``` schema.json```. In ```transform.sql```, the SQL
fragment is essentially the ```SELECT``` clause (including aggregate
or synthetic columns) of a statement that queries the temporary table,
but omitting the literal ```SELECT``` and everything following and inclusive of the ```WHERE``` clause of the query.

For example, if we wanted ```DS.EXAMPLE``` to query the timestamp as a
For example, if we wanted ```shareddataset.EXAMPLE``` to query the timestamp as a
SQL ```TIMESTAMP``` instead of the ```INTEGER``` being
auto-detected into the temporary tale, yet leave all other columns the same,
the SQL statement would be:
Expand All @@ -170,7 +149,7 @@ SELECT
FROM TMP_TABLE_76893023444
```

The associated ```EXAMPLE.transform.sql``` would be:
The associated ```transform.sql``` would be:

```
TIMESTAMP_MILLIS(ts_ms) AS measurement_time,
Expand All @@ -179,29 +158,29 @@ weight,
unit_of_measurement
```

Hence, ```DS.EXAMPLE```'s ultimate schema is inferred at runtime by the
contents of ```EXAMPLE.transform.sql``` and not explicitly as it is
Hence, ```shareddataset.EXAMPLE```'s ultimate schema is inferred at runtime by the
contents of ```transform.sql``` and not explicitly as it is
for the temporary table staging the CSV file.

Configuration files are placed in the
```gs:.//example-bucket/bqds``` subdirectory of the source
```/bqds/\<dataset-name\>/\<table-name\>/config/` subdirectory of the source
bucket. They are recognized by the Cloud Function as special, so it
won't treat them as normal data files to process. They can be copied to the source bucket with this command:

```gsutil cp *.schema.json *.transform.sql gs://example-bucket/bqds/```
```gsutil cp schema.json transform.sql gs://${BUCKET}/bqds/<dataset-name>/<table-name>/config/```

## Transformation options

There are generally two ways of transforming data from the
representation in your data file to the target representation that is
available to consumers. If you know the schema in advance, you can author
a ```<table>.schema.json``` file that embeds the schema in BigQuey
a ```schema.json``` file that embeds the schema in BigQuey
format.

If the schema is not available in advance, the
data can be autodetected and, in the worst case scenario, be staged
with all columns representing ```STRING```s. In this case, you can
author a ```<table>..transform.sql``` with SQL-based transformations
author a ```transform.sql``` with SQL-based transformations
into the target state.

Assume a file with the following format:
Expand All @@ -212,7 +191,7 @@ Assume a file with the following format:
|20211031|ZVZZT|4.19


The ```fields``` array of ```closing_prices.schema.json``` for this might be:
The ```fields``` array of ```schema.json``` for this might be:

```
{"name": "date", "type": "string"},
Expand All @@ -233,7 +212,7 @@ type, but the integer representation requires a casting to
```STRING``` before running the ```SUBSTR``` function to further
transform it into a
```DATE``` in the ultimate destination table's associated column. Two different
```closing_prices.transform.sql``` configurations illustrate this:
```transform.sql``` configurations illustrate this:

```
-- as a STRING
Expand Down

0 comments on commit b70cb16

Please sign in to comment.