Skip to content

Commit

Permalink
Merge remote-tracking branch 'upstream/master' into arrays-are-not-mvds
Browse files Browse the repository at this point in the history
  • Loading branch information
clintropolis committed Oct 30, 2023
2 parents b729e86 + e6b7c36 commit 1b0ccbb
Show file tree
Hide file tree
Showing 108 changed files with 4,355 additions and 640 deletions.
2 changes: 2 additions & 0 deletions distribution/bin/check-licenses.py
Original file line number Diff line number Diff line change
Expand Up @@ -266,6 +266,8 @@ def build_compatible_license_names():
compatible_licenses['Eclipse Public License - Version 1.0'] = 'Eclipse Public License 1.0'
compatible_licenses['Eclipse Public License, Version 1.0'] = 'Eclipse Public License 1.0'
compatible_licenses['Eclipse Public License v1.0'] = 'Eclipse Public License 1.0'
compatible_licenses['Eclipse Public License - v1.0'] = 'Eclipse Public License 1.0'
compatible_licenses['Eclipse Public License - v 1.0'] = 'Eclipse Public License 1.0'
compatible_licenses['EPL 1.0'] = 'Eclipse Public License 1.0'

compatible_licenses['Eclipse Public License 2.0'] = 'Eclipse Public License 2.0'
Expand Down
5 changes: 5 additions & 0 deletions docs/api-reference/sql-ingestion-api.md
Original file line number Diff line number Diff line change
Expand Up @@ -603,6 +603,11 @@ The following table describes the response fields when you retrieve a report for
| `multiStageQuery.payload.status.status` | RUNNING, SUCCESS, or FAILED. |
| `multiStageQuery.payload.status.startTime` | Start time of the query in ISO format. Only present if the query has started running. |
| `multiStageQuery.payload.status.durationMs` | Milliseconds elapsed after the query has started running. -1 denotes that the query hasn't started running yet. |
| `multiStageQuery.payload.status.workers` | Workers for the controller task.|
| `multiStageQuery.payload.status.workers.<workerNumber>` | Array of worker tasks including retries. |
| `multiStageQuery.payload.status.workers.<workerNumber>[].workerId` | Id of the worker task.| |
| `multiStageQuery.payload.status.workers.<workerNumber>[].status` | RUNNING, SUCCESS, or FAILED.|
| `multiStageQuery.payload.status.workers.<workerNumber>[].durationMs` | Milliseconds elapsed after the worker task started running. It is -1 for worker tasks with status RUNNING.|
| `multiStageQuery.payload.status.pendingTasks` | Number of tasks that are not fully started. -1 denotes that the number is currently unknown. |
| `multiStageQuery.payload.status.runningTasks` | Number of currently running tasks. Should be at least 1 since the controller is included. |
| `multiStageQuery.payload.status.segmentLoadStatus` | Segment loading container. Only present after the segments have been published. |
Expand Down
2 changes: 1 addition & 1 deletion docs/configuration/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -757,7 +757,7 @@ In current Druid, multiple data segments may be announced under the same Znode.
|Property|Description|Default|
|--------|-----------|-------|
|`druid.announcer.segmentsPerNode`|Each Znode contains info for up to this many segments.|50|
|`druid.announcer.maxBytesPerNode`|Max byte size for Znode.|524288|
|`druid.announcer.maxBytesPerNode`|Max byte size for Znode. Allowed range is [1024, 1048576].|524288|
|`druid.announcer.skipDimensionsAndMetrics`|Skip Dimensions and Metrics list from segment announcements. NOTE: Enabling this will also remove the dimensions and metrics list from Coordinator and Broker endpoints.|false|
|`druid.announcer.skipLoadSpec`|Skip segment LoadSpec from segment announcements. NOTE: Enabling this will also remove the loadspec from Coordinator and Broker endpoints.|false|

Expand Down
102 changes: 101 additions & 1 deletion docs/data-management/automatic-compaction.md
Original file line number Diff line number Diff line change
Expand Up @@ -162,7 +162,7 @@ To get statistics by API, send a [`GET` request](../api-reference/automatic-comp

## Examples

The following examples demonstrate potential use cases in which auto-compaction may improve your Druid performance. See more details in [Compaction strategies](../data-management/compaction.md#compaction-strategies). The examples in this section do not change the underlying data.
The following examples demonstrate potential use cases in which auto-compaction may improve your Druid performance. See more details in [Compaction strategies](../data-management/compaction.md#compaction-guidelines). The examples in this section do not change the underlying data.

### Change segment granularity

Expand Down Expand Up @@ -203,6 +203,106 @@ The following auto-compaction configuration compacts updates the `wikipedia` seg
}
```

## Concurrent append and replace

:::info
Concurrent append and replace is an [experimental feature](../development/experimental.md) and is not currently available for SQL-based ingestion.
:::

This feature allows you to safely replace the existing data in an interval of a datasource while new data is being appended to that interval. One of the most common applications of this is appending new data (using say streaming ingestion) to an interval while compaction of that interval is already in progress.

To set up concurrent append and replace, you need to ensure that your ingestion jobs have the appropriate lock types:

You can enable concurrent append and replace by ensuring the following:
- The append task (with `appendToExisting` set to `true`) has `taskLockType` set to `APPEND` in the task context.
- The replace task (with `appendToExisting` set to `false`) has `taskLockType` set to `REPLACE` in the task context.
- The segment granularity of the append task is equal to or finer than the segment granularity of the replace task.

:::info

When using concurrent append and replace, keep the following in mind:

- Concurrent append and replace fails if the task with `APPEND` lock uses a coarser segment granularity than the task with the `REPLACE` lock. For example, if the `APPEND` task uses a segment granularity of YEAR and the `REPLACE` task uses a segment granularity of MONTH, you should not use concurrent append and replace.

- Only a single task can hold a `REPLACE` lock on a given interval of a datasource.

- Multiple tasks can hold `APPEND` locks on a given interval of a datasource and append data to that interval simultaneously.

:::


### Configure concurrent append and replace

##### Update the compaction settings with the API

Prepare your datasource for concurrent append and replace by setting its task lock type to `REPLACE`.
Add the `taskContext` like you would any other automatic compaction setting through the API:

```shell
curl --location --request POST 'http://localhost:8081/druid/coordinator/v1/config/compaction' \
--header 'Content-Type: application/json' \
--data-raw '{
"dataSource": "YOUR_DATASOURCE",
"taskContext": {
"taskLockType": "REPLACE"
}
}'
```

##### Update the compaction settings with the UI

In the **Compaction config** for a datasource, set **Allow concurrent compactions (experimental)** to **True**.

#### Add a task lock type to your ingestion job

Next, you need to configure the task lock type for your ingestion job:

- For streaming jobs, the context parameter goes in your supervisor spec, and the lock type is always `APPEND`
- For legacy JSON-based batch ingestion, the context parameter goes in your ingestion spec, and the lock type can be either `APPEND` or `REPLACE`.

You can provide the context parameter through the API like any other parameter for ingestion job or through the UI.

##### Add the task lock type through the API

Add the following JSON snippet to your supervisor or ingestion spec if you're using the API:

```json
"context": {
"taskLockType": LOCK_TYPE
}
```

The `LOCK_TYPE` depends on what you're trying to accomplish.

Set `taskLockType` to `APPEND` if either of the following are true:

- Dynamic partitioning with append to existing is set to `true`
- The ingestion job is a streaming ingestion job

If you have multiple ingestion jobs that append all targeting the same datasource and want them to run simultaneously, you need to also include the following context parameter:

```json
"useSharedLock": "true"
```

Keep in mind that `taskLockType` takes precedence over `useSharedLock`. Do not use it with `REPLACE` task locks.


Set `taskLockType` to `REPLACE` if you're replacing data. For example, if you use any of the following partitioning types, use `REPLACE`:

- hash partitioning
- range partitioning
- dynamic partitioning with append to existing set to `false`


##### Add a task lock using the Druid console

As part of the **Load data** wizard for classic batch (JSON-based ingestion) and streaming ingestion, you can configure the task lock type for the ingestion during the **Publish** step:

- If you set **Append to existing** to **True**, you can then set **Allow concurrent append tasks (experimental)** to **True**.
- If you set **Append to existing** to **False**, you can then set **Allow concurrent replace tasks (experimental)** to **True**.


## Learn more

See the following topics for more information:
Expand Down
Loading

0 comments on commit 1b0ccbb

Please sign in to comment.