Merge remote-tracking branch 'upstream/master' into arrays-are-not-mvds

clintropolis · Oct 30, 2023 · 1b0ccbb · 1b0ccbb
2 parents b729e86 + e6b7c36
commit 1b0ccbb
Show file tree

Hide file tree

Showing 108 changed files with 4,355 additions and 640 deletions.
diff --git a/distribution/bin/check-licenses.py b/distribution/bin/check-licenses.py
@@ -266,6 +266,8 @@ def build_compatible_license_names():
     compatible_licenses['Eclipse Public License - Version 1.0'] = 'Eclipse Public License 1.0'
     compatible_licenses['Eclipse Public License, Version 1.0'] = 'Eclipse Public License 1.0'
     compatible_licenses['Eclipse Public License v1.0'] = 'Eclipse Public License 1.0'
+    compatible_licenses['Eclipse Public License - v1.0'] = 'Eclipse Public License 1.0'
+    compatible_licenses['Eclipse Public License - v 1.0'] = 'Eclipse Public License 1.0'
     compatible_licenses['EPL 1.0'] = 'Eclipse Public License 1.0'
 
     compatible_licenses['Eclipse Public License 2.0'] = 'Eclipse Public License 2.0'

diff --git a/docs/api-reference/sql-ingestion-api.md b/docs/api-reference/sql-ingestion-api.md
@@ -603,6 +603,11 @@ The following table describes the response fields when you retrieve a report for
 | `multiStageQuery.payload.status.status` | RUNNING, SUCCESS, or FAILED. |
 | `multiStageQuery.payload.status.startTime` | Start time of the query in ISO format. Only present if the query has started running. |
 | `multiStageQuery.payload.status.durationMs` | Milliseconds elapsed after the query has started running. -1 denotes that the query hasn't started running yet. |
+| `multiStageQuery.payload.status.workers` | Workers for the controller task.|
+| `multiStageQuery.payload.status.workers.<workerNumber>` | Array of worker tasks including retries. |
+| `multiStageQuery.payload.status.workers.<workerNumber>[].workerId` | Id of the worker task.| |
+| `multiStageQuery.payload.status.workers.<workerNumber>[].status` | RUNNING, SUCCESS, or FAILED.|
+| `multiStageQuery.payload.status.workers.<workerNumber>[].durationMs` | Milliseconds elapsed after the worker task started running. It is -1 for worker tasks with status RUNNING.|
 | `multiStageQuery.payload.status.pendingTasks` | Number of tasks that are not fully started. -1 denotes that the number is currently unknown. |
 | `multiStageQuery.payload.status.runningTasks` | Number of currently running tasks. Should be at least 1 since the controller is included. |
 | `multiStageQuery.payload.status.segmentLoadStatus` | Segment loading container. Only present after the segments have been published. |

diff --git a/docs/configuration/index.md b/docs/configuration/index.md
@@ -757,7 +757,7 @@ In current Druid, multiple data segments may be announced under the same Znode.
 |Property|Description|Default|
 |--------|-----------|-------|
 |`druid.announcer.segmentsPerNode`|Each Znode contains info for up to this many segments.|50|
-|`druid.announcer.maxBytesPerNode`|Max byte size for Znode.|524288|
+|`druid.announcer.maxBytesPerNode`|Max byte size for Znode. Allowed range is [1024, 1048576].|524288|
 |`druid.announcer.skipDimensionsAndMetrics`|Skip Dimensions and Metrics list from segment announcements. NOTE: Enabling this will also remove the dimensions and metrics list from Coordinator and Broker endpoints.|false|
 |`druid.announcer.skipLoadSpec`|Skip segment LoadSpec from segment announcements. NOTE: Enabling this will also remove the loadspec from Coordinator and Broker endpoints.|false|
 

diff --git a/docs/data-management/automatic-compaction.md b/docs/data-management/automatic-compaction.md
@@ -162,7 +162,7 @@ To get statistics by API, send a [`GET` request](../api-reference/automatic-comp
 
 ## Examples
 
-The following examples demonstrate potential use cases in which auto-compaction may improve your Druid performance. See more details in [Compaction strategies](../data-management/compaction.md#compaction-strategies). The examples in this section do not change the underlying data.
+The following examples demonstrate potential use cases in which auto-compaction may improve your Druid performance. See more details in [Compaction strategies](../data-management/compaction.md#compaction-guidelines). The examples in this section do not change the underlying data.
 
 ### Change segment granularity
 
@@ -203,6 +203,106 @@ The following auto-compaction configuration compacts updates the `wikipedia` seg
 }
 ```
 
+## Concurrent append and replace
+
+:::info
+Concurrent append and replace is an [experimental feature](../development/experimental.md) and is not currently available for SQL-based ingestion.
+:::
+
+This feature allows you to safely replace the existing data in an interval of a datasource while new data is being appended to that interval. One of the most common applications of this is appending new data (using say streaming ingestion) to an interval while compaction of that interval is already in progress.
+
+To set up concurrent append and replace, you need to ensure that your ingestion jobs have the appropriate lock types:
+
+You can enable concurrent append and replace by ensuring the following:
+- The append task (with `appendToExisting` set to `true`) has `taskLockType` set to `APPEND` in the task context.
+- The replace task (with `appendToExisting` set to `false`) has `taskLockType` set to `REPLACE` in the task context.
+- The segment granularity of the append task is equal to or finer than the segment granularity of the replace task.
+
+:::info
+
+When using concurrent append and replace, keep the following in mind:
+
+- Concurrent append and replace fails if the task with `APPEND` lock uses a coarser segment granularity than the task with the `REPLACE` lock. For example, if the `APPEND` task uses a segment granularity of YEAR and the `REPLACE` task uses a segment granularity of MONTH, you should not use concurrent append and replace.
+
+-  Only a single task can hold a `REPLACE` lock on a given interval of a datasource.
+
+- Multiple tasks can hold `APPEND` locks on a given interval of a datasource and append data to that interval simultaneously.
+
+:::
+
+
+### Configure concurrent append and replace
+
+##### Update the compaction settings with the API
+
+ Prepare your datasource for concurrent append and replace by setting its task lock type to `REPLACE`.
+Add the `taskContext` like you would any other automatic compaction setting through the API:
+
+```shell
+curl --location --request POST 'http://localhost:8081/druid/coordinator/v1/config/compaction' \
+--header 'Content-Type: application/json' \
+--data-raw '{
+    "dataSource": "YOUR_DATASOURCE",
+    "taskContext": {
+        "taskLockType": "REPLACE"
+    }
+}'
+```
+
+##### Update the compaction settings with the UI
+
+In the **Compaction config** for a datasource, set  **Allow concurrent compactions (experimental)** to **True**.
+
+#### Add a task lock type to your ingestion job
+
+Next, you need to configure the task lock type for your ingestion job: 
+
+- For streaming jobs, the context parameter goes in your supervisor spec, and the lock type is always `APPEND`
+- For legacy JSON-based batch ingestion, the context parameter goes in your ingestion spec, and the lock type can be either `APPEND` or `REPLACE`. 
+
+You can provide the context parameter through the API like any other parameter for ingestion job or through the UI.
+
+##### Add the task lock type through the API
+
+Add the following JSON snippet to your supervisor or ingestion spec if you're using the API:
+
+```json
+"context": {
+   "taskLockType": LOCK_TYPE
+}   
+```
+
+The `LOCK_TYPE` depends on what you're trying to accomplish.
+
+Set `taskLockType` to  `APPEND` if either of the following are true:
+
+- Dynamic partitioning with append to existing is set to `true`
+- The ingestion job is a streaming ingestion job
+
+If you have multiple ingestion jobs that append all targeting the same datasource and want them to run simultaneously, you need to also include the following context parameter:
+
+```json
+"useSharedLock": "true"
+```
+
+Keep in mind that `taskLockType` takes precedence over `useSharedLock`. Do not use it with `REPLACE` task locks.
+
+
+Set  `taskLockType` to `REPLACE` if you're replacing data. For example, if you use any of the following partitioning types, use `REPLACE`:
+
+- hash partitioning 
+- range partitioning
+- dynamic partitioning with append to existing set to `false`
+
+
+##### Add a task lock using the Druid console
+
+As part of the  **Load data** wizard for classic batch (JSON-based ingestion) and streaming ingestion, you can configure the task lock type for the ingestion during the **Publish** step:
+
+- If you set **Append to existing** to **True**, you can then set **Allow concurrent append tasks (experimental)** to **True**.
+- If you set **Append to existing** to **False**, you can then set **Allow concurrent replace tasks (experimental)** to **True**.
+
+
 ## Learn more
 
 See the following topics for more information: