An online shop wants to migrate from their database to Cloud Spanner. As a first step we generate some data that simulates the backup of a traditional RDBMS. For this example we assume the DB didn't use advanced features like triggers, stored procedures or user defined functions. Traditional referential constraints aren't supported in Cloud Spanner. To replace a subset of the referential constraints possible in traditional RDBMS one can use interleaving, hierarchical nested parent-child tables and rows. However, for simplicity, interleaving is not used in this example.
In folder version01 you find the sample code that loads data in batches of unordered rows, spanning multiple table splits in the result leading to contention during data load.
In folder version02 find the sample code that loads data in batches of distinct ranges of ordered (by primary key columns) rows, spanning fewer table splits per transaction resulting in less contention during data load and thus higher data load throughputs.
You find details about best practices for bulk loading data into Cloud Spanner here.
To simulate loading data with unordered batches of rows you can follow the following steps.
See root README.md how to create the instance.
gcloud compute ssh til-about-cloudspanner --zone europe-west1-c
git clone https://github.com/hostirosti/til-about-cloudspanner
cd 03-maximizing-data-load-throughput/version01/golang
export CSTIL_PROJECT=`gcloud config list --format 'value(core.project)'`
gcloud spanner instances create til-about-cloudspanner --config regional-europe-west1 --description "TIL about Cloud Spanner" --nodes 3
docker build -t gcr.io/$CSTIL_PROJECT/til-about-cloudspanner-03:v1 --build-arg version=v1 .
If you build the container locally but want to run it from your GCE instance, push it to the project private container registry first:
gcloud docker -- push gcr.io/$CSTIL_PROJECT/til-about-cloudspanner-03:v1
Copy the config-sample.env to config.env
and adjust the placeholders.
If you want to test a larger or smaller dataset adjust the cardinalities as well.
Make sure the Google Cloud Storage bucket exists before running the sample.
For BATCHSIZE
use 1000
and for PARALLELTXN
start with 50
on a 3 node
regional instance.
The demo application supports 4 commands.
create
creates the database and schemagenerate
generates sample data and writes it in csv files to Google Cloud Storageload
reads csv files from Google Cloud Storage, and persist the data to your Google Cloud Spanner databasereset
deletes all rows from all the tables in your demo database
If you run the code locally and use a service account you need to add -v {PATH_TO_SERVICE_ACCOUNT}:/service-account.json
to your docker cmd.
First create the database and schema
docker run --env-file config.env -it gcr.io/$CSTIL_PROJECT/til-about-cloudspanner-03:v1 create
Next generate the sample data
docker run --env-file config.env -it gcr.io/$CSTIL_PROJECT/til-about-cloudspanner-03:v1 generate
Then load the sample data
docker run --env-file config.env -it gcr.io/$CSTIL_PROJECT/til-about-cloudspanner-03:v1 load
To reset the database run
docker run --env-file config.env -it gcr.io/$CSTIL_PROJECT/til-about-cloudspanner-03:v1 reset
In folder version02 you find the sample code from version01 adjusted to load rows into cache, order them by the primary key columns and splitting them in distinct partitions of rows to minimize contention and maximize data load throughput.
To run the improved sample follow the following steps:
gcloud compute ssh til-about-cloudspanner --zone europe-west1-c
git clone https://github.com/hostirosti/til-about-cloudspanner
cd 03-maximizing-data-load-throughput/version02/golang
export CSTIL_PROJECT=`gcloud config list --format 'value(core.project)'`
gcloud spanner instances create til-about-cloudspanner --config regional-europe-west1 --description "TIL about Cloud Spanner" --nodes 3
docker build -t gcr.io/$CSTIL_PROJECT/til-about-cloudspanner-03:v2 --build-arg version=v2 .
If you build the container locally but want to run it from your GCE instance, first push it to the project private container registry:
gcloud docker -- push gcr.io/$CSTIL_PROJECT/til-about-cloudspanner-03:v2
Copy the config-sample.env to config.env
and adjust the
placeholders.
If you want to test a larger or smaller dataset adjust the cardinalities as well.
Make sure the Google Cloud Storage bucket exists before running the sample.
For BATCHSIZE
use 1000
and for PARALLELTXN
start with 50
on a 3 node
regional instance.
The demo application supports 3 commands.
create
creates the database and schemagenerate
generates sample data and writes it in csv files to Google Cloud Storageload
reads csv files from Google Cloud Storage, and persist the data to your Google Cloud Spanner databasereset
deletes all rows from all the tables in your demo database
If you run the code locally and use a service account you need to add -v {PATH_TO_SERVICE_ACCOUNT}:/service-account.json
to your docker cmd.
First create the database and schema:
docker run --env-file config.env -it gcr.io/$CSTIL_PROJECT/til-about-cloudspanner-03:v2 create
Next generate the sample data:
docker run --env-file config.env -it gcr.io/$CSTIL_PROJECT/til-about-cloudspanner-03:v2 generate
Then load the sample data:
docker run --env-file config.env -it gcr.io/$CSTIL_PROJECT/til-about-cloudspanner-03:v2 load
To reset the database run:
docker run --env-file config.env -it gcr.io/$CSTIL_PROJECT/til-about-cloudspanner-03:v2 reset
In folder dataflow-connector you find the sample code for using the Apache Beam Cloud Spanner Dataflow connector. The sample is demonstrating how to load OrderLineItems into your Cloud Spanner database.
Follow the steps from version02 to create the schema and sample data. You can
use the same database as before or create a new one dedicated for your testing
of the Cloud Spanner Dataflow connector. To create a dedicated database, simply
change the DATABASE
value in your config.env before running
docker run --env-file config.env -it gcr.io/$CSTIL_PROJECT/til-about-cloudspanner-03:v2 create
To run the pipeline, go into the dataflow-connector folder
and execute the following statement (you need Maven and a JDK installed, see the
setup.sh for reference). Make sure to replace the
placeholders PROJECT
, BUCKET
, BUCKET_DATA_PATH
before executing:
mvn compile exec:java \
-Dexec.mainClass="com.google.cloud.tilaboutcloudspanner.BatchImportPipeline" \
-Dexec.args="--runner=DataflowRunner --project=$CSTIL_PROJECT --gcpTempLocation=gs://{BUCKET}/tmp --region=europe-west1 --bucketDataPath={BUCKET_DATA_PATH} --experiments=shuffle_mode=service --jobName=batchimport-from-csv" -Dexec.cleanupDaemonThreads=false
Example:
mvn compile exec:java \
-Dexec.mainClass="com.google.cloud.tilaboutcloudspanner.BatchImportPipeline" \
-Dexec.args="--runner=DataflowRunner --project=your-project-id --gcpTempLocation=gs://your-bucket-id/tmp --region=europe-west1 --bucketDataPath=gs://your-bucket-id/generated/orderlineitems-0*.csv --experiments=shuffle_mode=service --jobName=batchimport-from-csv" -Dexec.cleanupDaemonThreads=false
We use govendor
(go get -u github.com/kardianos/govendor
) as the vendor package manager.
This is not an official Google product