Skip to content

Commit

Permalink
[docs](load) improve broker load structure (#1368)
Browse files Browse the repository at this point in the history
  • Loading branch information
kaijchen authored Nov 18, 2024
1 parent 1cb0fc1 commit edb5412
Show file tree
Hide file tree
Showing 2 changed files with 251 additions and 106 deletions.
72 changes: 72 additions & 0 deletions docs/data-operate/import/import-way/broker-load-manual.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,23 @@ WITH [HDFS|S3|BROKER broker_name]

For the specific syntax for usage, please refer to [BROKER LOAD](../../../sql-manual/sql-statements/Data-Manipulation-Statements/Load/BROKER-LOAD) in the SQL manual.

## Load Properties

| Property Name | Type | Default Value | Description |
| --- | --- | --- | --- |
| "timeout" | Long | 14400 | Used to specify the timeout for the import in seconds. The configurable range is from 1 second to 259200 seconds. |
| "max_filter_ratio" | Float | 0.0 | Used to specify the maximum tolerable ratio of filterable (irregular or otherwise problematic) data, which defaults to zero tolerance. The value range is 0 to 1. If the error rate of the imported data exceeds this value, the import will fail. Irregular data does not include rows filtered out by the where condition. |
| "exec_mem_limit" | Long | 2147483648 | The memory limit in bytes of the load task, which defaults to 2GB. |
| "strict_mode" | Boolean | false | Used to specify whether to enable strict mode for this import. |
| "partial_columns" | Boolean | false | Used to specify whether to enable partial column update, the default value is false, this parameter is only available for Unique Key + Merge on Write tables. |
| "timezone" | String | "Asia/Shanghai" | Used to specify the timezone to be used for this import. This parameter affects the results of all timezone-related functions involved in the import. |
| "load_parallelism" | Integer | 8 | Limits the maximum parallel instances on each backend. |
| "send_batch_parallelism" | Integer | 1 | The parallelism for sink node to send data, when memtable_on_sink_node is disabled. |
| "load_to_single_tablet" | Boolean | "false" | Used to specify whether to load data only to a single tablet corresponding to the partition. This parameter is only available when loading to an OLAP table with random bucketing. |
| "skip_lines" | Integer | "0" | It will skip some lines in the head of a csv file. It will be ignored when the format is csv_with_names or csv_with_names_and_types. |
| "trim_double_quotes" | Boolean | "false" | Used to specify whether to trim the outermost double quotes of each field in the source files. |
| "priority" | oneof "HIGH", "NORMAL", "LOW" | "NORMAL" | The priority of the task. |

## Checking import status

Broker Load is an asynchronous import method, and the specific import results can be viewed through the [SHOW LOAD](../../../sql-manual/sql-statements/Show-Statements/SHOW-LOAD) command.
Expand Down Expand Up @@ -514,6 +531,7 @@ Doris supports importing data directly from object storage systems that support
)
WITH S3
(
"provider" = "S3",
"AWS_ENDPOINT" = "AWS_ENDPOINT",
"AWS_ACCESS_KEY" = "AWS_ACCESS_KEY",
"AWS_SECRET_KEY"="AWS_SECRET_KEY",
Expand All @@ -525,6 +543,17 @@ Doris supports importing data directly from object storage systems that support
);
```

The `provider` specifies the vendor of the S3 Service.
Supported S3 Provider list:

- "S3" (AWS, Amazon Web Services)
- "AZURE" (Microsoft Azure)
- "GCP" (GCP, Google Cloud Platform)
- "OSS" (Alibaba Cloud)
- "COS" (Tencent Cloud)
- "OBS" (Huawei Cloud)
- "BOS" (Baidu Cloud)

### Common Issues

- The S3 SDK defaults to using the virtual-hosted style method for accessing objects. However, some object storage systems may not have enabled or supported the virtual-hosted style access. In such cases, we can add the `use_path_style` parameter to force the use of the path style method:
Expand Down Expand Up @@ -640,6 +669,8 @@ Different Broker types and access methods require different authentication infor

## Related Configurations

### fe.conf

The following configurations belong to the system-level settings for Broker load, which affect all Broker load import tasks. These configurations can be adjusted by modifying the `fe.conf `file.

**min_bytes_per_broker_scanner**
Expand Down Expand Up @@ -669,6 +700,47 @@ Import Concurrency = Math.min(Source File Size / min_bytes_per_broker_scanner, m
Processing Volume per BE for this Import = Source File Size / Import Concurrency
```

**default_load_parallelism**

- Default: 8.

- Limits the maximum parallel instances on each backend.

- The minimum processed data volume, maximum concurrency, size of the source file, and the current number of BE nodes jointly determine the concurrency of this import.

```Plain
Import Concurrency = Math.min(Source File Size / min_bytes_per_broker_scanner, max_broker_concurrency, Current Number of BE Nodes * load_parallelism)
Processing Volume per BE for this Import = Source File Size / Import Concurrency
```

**broker_load_default_timeout_second**

- Default: 14400.

- The default broker load timeout in seconds.

### session variables

**exec_mem_limit**

- Default: 2147483648.

- The memory limit in bytes of the load.

**time_zone**

- Default: "Asia/Shanghai".

- Default timezone, which affects time related functions in the load.

**send_batch_parallelism**

- Default: 1

- The parallelism for sink node to send data, when memtable_on_sink_node is disabled.

Set session variable `enable_memtable_on_sink_node` (defaults to true) to false to disable this feature.

## Common Issues

**1. Import Error: `Scan bytes per broker scanner exceed limit:xxx`**
Expand Down
Loading

0 comments on commit edb5412

Please sign in to comment.