cloudquery-sync.yaml

id: cloudquery-sync
namespace: company.team

tasks:
  - id: hn_to_duckdb
    type: io.kestra.plugin.cloudquery.Sync
    env:
      CLOUDQUERY_API_KEY: "{{ secret('CLOUDQUERY_API_KEY') }}"
    incremental: false
    configs:
      - kind: source
        spec:
          name: hackernews
          path: cloudquery/hackernews
          version: v3.0.13
          tables:
            - "*"
          destinations:
            - duckdb
          spec:
            item_concurrency: 100
            start_time: "{{ trigger.date ?? execution.startDate | dateAdd(-1, 'DAYS') }}"
      - kind: destination
        spec:
          name: duckdb
          path: cloudquery/duckdb
          version: v4.2.10
          write_mode: overwrite-delete-stale
          spec:
            connection_string: hn.db

triggers:
  - id: schedule
    type: io.kestra.plugin.core.trigger.Schedule
    cron: "@daily"
    timezone: US/Eastern

extend:
  title: Schedule a CloudQuery data ingestion sync with kestra
  description: >-
    This flow will start a batch job to sync data from HackerNews to DuckDB with
    CloudQuery. The sync process relies on CloudQuery source and destination
    plugins. The source plugin will extract data from HackerNews and the
    destination plugin will load it to DuckDB. 

    This flow uses Kestra templating to dynamically configure the start time of
    the sync process to be 1 day before the scheduled date, allowing you to also
    backfill data for past intervals in small batches. 

    Alternatively, you can toggle the `incremental` flag to `true` to enable
    incremental sync. During incremental syncs, Kestra will store the CloudQuery
    cursor in the Kestra backend, and will use it to always start the sync
    process from the last cursor position.

    To get started configuring CloudQuery sources and destinations, go to
    [CloudQuery Integrations](https://www.cloudquery.io/integrations) page. From
    here, you can select your desired source and destination, and you'll see the
    YAML configuration that you can copy and paste into your own file, or into
    Kestra flow. This page will also provide more detailed instructions and
    references about tables available in each plugin.

    Note that you can [generate an API
    key](https://docs.cloudquery.io/docs/deployment/generate-api-key) to use
    premium plugins. You can add the API key as an environment variable:

    ```yaml
      - id: hn_to_duckdb
        type: io.kestra.plugin.cloudquery.Sync
        env:
          CLOUDQUERY_API_KEY: "{{ secret('CLOUDQUERY_API_KEY') }}"
    ```      
  tags:
    - Ingest
    - CLI
  ee: false
  demo: true
  meta_description: This flow will start a batch job to sync data from HackerNews
    to DuckDB with CloudQuery.