Skip to content

Commit

Permalink
Update data ingestion overview about batch (#55)
Browse files Browse the repository at this point in the history
* direct correct link

* add batching strategy for file sink

* availability

* availability for encode

* remove implementation detail

* hide the concept of chunk

* remove another implementation detail

* highlight something

* Update delivery/overview.mdx

Co-authored-by: congyi wang <[email protected]>
Signed-off-by: IrisWan <[email protected]>

* Update delivery/overview.mdx

Co-authored-by: congyi wang <[email protected]>
Signed-off-by: IrisWan <[email protected]>

* default batching strategy

* Update .wordlist.txt

* add partition by and example

* rename into `path_partition_prefix`

---------

Signed-off-by: IrisWan <[email protected]>
Co-authored-by: congyi wang <[email protected]>
  • Loading branch information
WanYixian and wcy-fdu authored Nov 25, 2024
1 parent f00d7b6 commit 4471321
Show file tree
Hide file tree
Showing 2 changed files with 47 additions and 0 deletions.
46 changes: 46 additions & 0 deletions delivery/overview.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -124,3 +124,49 @@ WITH (
<Note>
File sink currently supports only append-only mode, so please change the query to `append-only` and specify this explicitly after the `FORMAT ... ENCODE ...` statement.
</Note>

## Batching strategy for file sink

RisingWave implements batching strategies for file sinks to optimize file management by preventing the generation of numerous small files. The batching strategy is available for Parquet, JSON, and CSV encode.

### Category

- **Batching based on row numbers**:
RisingWave monitors the number of rows written and completes the file once the maximum row count threshold is reached.

- **Batching based on rollover interval**:
RisingWave checks the threshold each time a chunk is about to be written and when a barrier is encountered.

- If no batching strategy is specified, RisingWave defaults to writing a new file every 10 seconds.

<Note>The condition for batching is relatively coarse-grained. The actual number of rows or exact timing of file completion may vary from the specified thresholds, as this function is intentionally flexible to prioritize efficient file management.</Note>

### File organization

You can use `path_partition_prefix` to organize files into subdirectories based on their creation time. The available options are month, day, or hour. If not specified, files will be stored directly in the root directory without any time-based subdirectories.

Regarding file naming rules, currently, files follow the naming pattern `/Option<path_partition_prefix>/executor_id + timestamp.suffix`. `Timestamp` differentiates files batched by the rollover interval.

The output files look like below:

```
path/2024-09-20/47244640257_1727072046.parquet
path/2024-09-20/47244640257_1727072055.parquet
```

### Example

```sql
CREATE SINK s1
FROM t
WITH (
connector = 's3',
max_row_count = '100',
rollover_seconds = '10',
type = 'append-only',
path_partition_prefix = 'day'
) FORMAT PLAIN ENCODE PARQUET (force_append_only=true);
```

In this example, if the number of rows in the file exceeds 100, or if writing has continued for more than 10 seconds, the writing of this file will be completed.
Once completed, the file will be visible in the downstream sink system.
1 change: 1 addition & 0 deletions mint.json
Original file line number Diff line number Diff line change
Expand Up @@ -167,6 +167,7 @@
{"source": "/docs/current/architecture", "destination": "/reference/architecture"},
{"source": "/docs/current/fault-tolerance", "destination": "/reference/fault-tolerance"},
{"source": "/docs/current/limitations", "destination": "/reference/limitations"},
{"source": "/docs/current/sources", "destination": "/integrations/sources/overview"},
{"source": "/docs/current/sql-alter-connection", "destination": "/sql/commands/sql-alter-connection"},
{"source": "/docs/current/sql-alter-database", "destination": "/sql/commands/sql-alter-database"},
{"source": "/docs/current/sql-alter-function", "destination": "/sql/commands/sql-alter-function"},
Expand Down

0 comments on commit 4471321

Please sign in to comment.