Skip to content

Commit

Permalink
BigQuery as data destination (#500)
Browse files Browse the repository at this point in the history
* BigQuery as data destination

* BigQuery Docs

---------

Co-authored-by: godlin <[email protected]>
  • Loading branch information
jamesbayly and bgodlin authored Mar 21, 2024
1 parent 2f34361 commit ed8e07f
Show file tree
Hide file tree
Showing 5 changed files with 90 additions and 1 deletion.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
5 changes: 4 additions & 1 deletion docs/.vuepress/sidebar.ts
Original file line number Diff line number Diff line change
Expand Up @@ -471,7 +471,10 @@ export const getSidebar = (locale: string) =>
},
{
text: "Other Tools",
children: [`${locale}/run_publish/query/other_tools/metabase.md`],
children: [
`${locale}/run_publish/query/other_tools/metabase.md`,
`${locale}/run_publish/query/other_tools/bigquery.md`,
],
},
],
},
Expand Down
85 changes: 85 additions & 0 deletions docs/run_publish/query/other_tools/bigquery.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# Querying Data with BigQuery

Google BigQuery is a fully managed, serverless data warehouse provided by Google Cloud. It allows you to run super-fast, SQL-like queries against large datasets. BigQuery is particularly well-suited for analyzing large volumes of data, including blockchain data, due to its scalability, speed, and ease of use. You might use BigQuery to analyse indexed SubQuery data due to:

1. Scalability: BigQuery is designed to handle massive datasets, making it suitable for analyzing the vast amounts of data generated by blockchain networks.
2. Speed: BigQuery can process queries on large datasets quickly, allowing you to get insights from your blockchain data in near real-time.
3. SQL-like Queries: BigQuery supports standard SQL queries, making it easy for analysts and developers familiar with SQL to analyze blockchain data without having to learn a new query language.
4. Serverless: With BigQuery, you don't need to manage any infrastructure. Google handles the infrastructure, so you can focus on analyzing your data.
5. Integration: BigQuery integrates seamlessly with other Google Cloud services, such as Google Cloud Storage and Google Data Studio, making it easy to ingest, store, and visualize blockchain data.

SubQuery can easily be integrated with BigQuery in only a few steps, this means that you can export indexed blockchain data directly from SubQuery to BigQuery.

## Integrating SubQuery with BigQuery

At a high level, the integration of SubQuery with BigQuery works over 3 steps (each that can be automated):

1. Index data using SubQuery Indexing SDK
2. Export data using SubQuery's CSV export
3. Automate the loading of your CSV exports into BigQuery

### Export data using SubQuery's CSV export

Ensure that the indexed data is set to save in CSV files by enabling the [relevant CSV flag](../../references.md#csv-out-dir). Upon successful configuration, CSV files will be automatically created and populated as the indexing process runs.

We suggest running your project in GCP for ease of automated integration, although you can run your SubQuery project anywhere. This means you can export your CSV's to Google Cloud Storage for automated integration.

To save the data from your Docker container to Google Cloud Storage (GCS) instead of the local disk, you can use the `gsutil` command-line tool within your Docker container. Here's a general approach:

Install `gsutil` in your Docker container. You can use the following commands in your Dockerfile to install `gsutil`:

```Dockerfile
RUN apt-get update && apt-get install -y \
curl \
gnupg \
&& curl https://sdk.cloud.google.com | bash
```

This will install the Google Cloud SDK, which includes `gsutil`.

Authenticate `gsutil`: You need to authenticate `gsutil` with your Google Cloud account. You can do this by running the following commands and following the instructions to authenticate:

```sh
gcloud auth login
```

Use `gsutil` to copy your CSV file to GCS. Once authenticated, you can use `gsutil cp` command to copy your CSV file to GCS. For example, if your CSV file is `/path/to/your/local/file.csv` and you want to upload it to a bucket named `your-bucket`:

```sh
gsutil cp /path/to/your/local/file.csv gs://your-bucket/
```

Replace `your-bucket` with your actual bucket name.

Make sure to handle any permissions and access issues based on your GCP setup.

### Adding CSV Data to BigQuery

Once a sufficient amount of data is indexed for analysis, it's time to load it into BigQuery. Begin by creating an account on [Google Cloud](https://cloud.google.com) if you haven't already. Follow the steps outlined in [Enable the BigQuery sandbox](https://cloud.google.com/bigquery/docs/sandbox) to set up your account.

Once your account is created, you can proceed to batch load the data. Depending on your deployment setup, the commands for loading data to BigQuery may vary slightly. Refer to specific guides for more details:

- [Loading data from local files](https://cloud.google.com/bigquery/docs/batch-loading-data#loading_data_from_local_files)
- [Loading CSV data from Cloud Storage](https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-csv)

Alternatively, you can use the `bq` command-line tool to load the CSV file into BigQuery. Here's an example command:

`bq load --autodetect --source_format=CSV your_dataset.your_table gs://your-bucket/your-file.csv`

Replace `your_dataset` with your dataset name, `your_table` with your table name, and `gs://your-bucket/your-file.csv` with the path to your CSV file in GCS.

Make sure to have the necessary permissions to create tables in BigQuery and read from GCS.

## Query your data in BigQuery

After loading the data, you can proceed to query it. The provided screenshot from the Google Console showcases the successful execution of a `SELECT *` query on one of the loaded CSV files:

![](/assets/img/run_publish/bigquery/consoleBigquery.png)

By uploading your data to BigQuery, you not only gain access to a platform designed for limitless scalability and seamless integration with Google Cloud services but also benefit from a serverless architecture. This allows you to focus on analytics rather than infrastructure management, marking a strategic move towards maximizing the potential of your data.

## Synchronise Updates Automatically

The act of loading a CSV file from Google Cloud Storage or a Local Disk into Google BigQuery does not establish automatic synchronisation or updates. Should you or the SubQuery Indexer make modifications to the CSV file in GCS, it becomes necessary to manually reload the updated file into BigQuery following the same procedural steps.

To streamline and automate this process, consider implementing a recurring job through Google Cloud services or configuring a cron job using the recommended commands. Alternatively, you can incorporate this automation directly within the mapping file code. For example, create a [block handler](../../../build/manifest/ethereum.md#mapping-handlers-and-filters) with a specific `modulo` to load your data in batches at predetermined intervals. These services will initiate a load job in BigQuery, ensuring your data stays synchronised effortlessly.
1 change: 1 addition & 0 deletions docs/run_publish/query/query.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,5 +19,6 @@ We now have guides to expose SubQuery data to the following locations.
- [Direct Postgres Access](../run.md#connect-to-database) - you can directly connect to the Postgres data from any other tool or service.
- [Metabase](./other_tools/metabase.md) - an industry leading open-source and free data visualisation and business intelligence tool.
- [CSV Export](../references.md#csv-out-dir) - export indexed datasets to CSV files easily.
- [BigQuery](./other_tools/bigquery.md) - a fully managed, serverless data warehouse provided by Google Cloud, well-suited for analyzing large volumes of data.

![Integration Ecosystem](/assets/img/run_publish/integration_ecosystem.png)

0 comments on commit ed8e07f

Please sign in to comment.