Merge branch 'master' into window-subquery-frames

apache · Jul 25, 2024 · f6a9611 · f6a9611
2 parents 22e01a7 + 7e3fab5
commit f6a9611
Show file tree

Hide file tree

Showing 849 changed files with 11,714 additions and 17,398 deletions.
diff --git a/benchmarks/src/test/java/org/apache/druid/benchmark/query/ScanBenchmark.java b/benchmarks/src/test/java/org/apache/druid/benchmark/query/ScanBenchmark.java
@@ -262,12 +262,8 @@ public void setup()
         rowsPerSegment
     );
 
-    final ScanQueryConfig config = new ScanQueryConfig().setLegacy(false);
     factory = new ScanQueryRunnerFactory(
-        new ScanQueryQueryToolChest(
-            config,
-            DefaultGenericQueryMetricsFactory.instance()
-        ),
+        new ScanQueryQueryToolChest(DefaultGenericQueryMetricsFactory.instance()),
         new ScanQueryEngine(),
         new ScanQueryConfig()
     );

diff --git a/.../src/test/java/org/apache/druid/server/coordinator/NewestSegmentFirstPolicyBenchmark.java b/.../src/test/java/org/apache/druid/server/coordinator/NewestSegmentFirstPolicyBenchmark.java
@@ -141,7 +141,7 @@ public void setup()
   @Benchmark
   public void measureNewestSegmentFirstPolicy(Blackhole blackhole)
   {
-    final CompactionSegmentIterator iterator = policy.reset(compactionConfigs, dataSources, Collections.emptyMap());
+    final CompactionSegmentIterator iterator = policy.createIterator(compactionConfigs, dataSources, Collections.emptyMap());
     for (int i = 0; i < numCompactionTaskSlots && iterator.hasNext(); i++) {
       blackhole.consume(iterator.next());
     }

diff --git a/docs/api-reference/sql-ingestion-api.md b/docs/api-reference/sql-ingestion-api.md
@@ -474,7 +474,6 @@ The response shows an example report for a query.
                   "agent_type",
                   "timestamp"
                 ],
-                "legacy": false,
                 "context": {
                   "finalize": false,
                   "finalizeAggregations": false,

diff --git a/docs/api-reference/tasks-api.md b/docs/api-reference/tasks-api.md
@@ -914,13 +914,10 @@ Host: http://ROUTER_IP:ROUTER_PORT
 ### Get task segments
 
 :::info
- This API is deprecated and will be removed in future releases.
+ This API is not supported anymore and always returns a 404 response.
+ Use the metric `segment/added/bytes` instead to identify the segment IDs committed by a task.
 :::
 
-Retrieves information about segments generated by the task given the task ID. To hit this endpoint, make sure to enable the audit log config on the Overlord with `druid.indexer.auditLog.enabled = true`.
-
-In addition to enabling audit logs, configure a cleanup strategy to prevent overloading the metadata store with old audit logs which may cause performance issues. To enable automated cleanup of audit logs on the Coordinator, set `druid.coordinator.kill.audit.on`. You may also manually export the audit logs to external storage. For more information, see [Audit records](../operations/clean-metadata-store.md#audit-records).
-
 #### URL
 
 `GET` `/druid/indexer/v1/task/{taskId}/segments`
@@ -929,12 +926,14 @@ In addition to enabling audit logs, configure a cleanup strategy to prevent over
 
 <Tabs>
 
-<TabItem value="27" label="200 SUCCESS">
+<TabItem value="27" label="404 NOT FOUND">
 
 
-<br/>
-
-*Successfully retrieved task segments*
+```json
+{
+  "error": "Segment IDs committed by a task action are not persisted anymore. Use the metric 'segment/added/bytes' to identify the segments created by a task."
+}
+```
 
 </TabItem>
 </Tabs>

diff --git a/docs/configuration/extensions.md b/docs/configuration/extensions.md
@@ -80,7 +80,7 @@ All of these community extensions can be downloaded using [pull-deps](../operati
 |aliyun-oss-extensions|Aliyun OSS deep storage |[link](../development/extensions-contrib/aliyun-oss-extensions.md)|
 |ambari-metrics-emitter|Ambari Metrics Emitter |[link](../development/extensions-contrib/ambari-metrics-emitter.md)|
 |druid-cassandra-storage|Apache Cassandra deep storage.|[link](../development/extensions-contrib/cassandra.md)|
-|druid-cloudfiles-extensions|Rackspace Cloudfiles deep storage and firehose.|[link](../development/extensions-contrib/cloudfiles.md)|
+|druid-cloudfiles-extensions|Rackspace Cloudfiles deep storage.|[link](../development/extensions-contrib/cloudfiles.md)|
 |druid-compressed-bigdecimal|Compressed Big Decimal Type | [link](../development/extensions-contrib/compressed-big-decimal.md)|
 |druid-ddsketch|Support for DDSketch approximate quantiles based on [DDSketch](https://github.com/datadog/sketches-java) | [link](../development/extensions-contrib/ddsketch-quantiles.md)|
 |druid-deltalake-extensions|Support for ingesting Delta Lake tables.|[link](../development/extensions-contrib/delta-lake.md)|

diff --git a/docs/configuration/index.md b/docs/configuration/index.md
@@ -395,7 +395,6 @@ Metric monitoring is an essential part of Druid operations. The following monito
 |`org.apache.druid.java.util.metrics.CgroupCpuSetMonitor`|Reports CPU core/HT and memory node allocations as per the `cpuset` cgroup.|
 |`org.apache.druid.java.util.metrics.CgroupDiskMonitor`|Reports disk statistic as per the blkio cgroup.|
 |`org.apache.druid.java.util.metrics.CgroupMemoryMonitor`|Reports memory statistic as per the memory cgroup.|
-|`org.apache.druid.server.metrics.EventReceiverFirehoseMonitor`|Reports how many events have been queued in the EventReceiverFirehose.|
 |`org.apache.druid.server.metrics.HistoricalMetricsMonitor`|Reports statistics on Historical services. Available only on Historical services.|
 |`org.apache.druid.server.metrics.SegmentStatsMonitor` | **EXPERIMENTAL** Reports statistics about segments on Historical services. Available only on Historical services. Not to be used when lazy loading is configured.|
 |`org.apache.druid.server.metrics.QueryCountStatsMonitor`|Reports how many queries have been successful/failed/interrupted.|
@@ -607,7 +606,7 @@ the [HDFS input source](../ingestion/input-sources.md#hdfs-input-source).
 
 |Property|Possible values|Description|Default|
 |--------|---------------|-----------|-------|
-|`druid.ingestion.hdfs.allowedProtocols`|List of protocols|Allowed protocols for the HDFS input source and HDFS firehose.|`["hdfs"]`|
+|`druid.ingestion.hdfs.allowedProtocols`|List of protocols|Allowed protocols for the HDFS input source.|`["hdfs"]`|
 
 #### HTTP input source
 
@@ -616,7 +615,7 @@ the [HTTP input source](../ingestion/input-sources.md#http-input-source).
 
 |Property|Possible values|Description|Default|
 |--------|---------------|-----------|-------|
-|`druid.ingestion.http.allowedProtocols`|List of protocols|Allowed protocols for the HTTP input source and HTTP firehose.|`["http", "https"]`|
+|`druid.ingestion.http.allowedProtocols`|List of protocols|Allowed protocols for the HTTP input source.|`["http", "https"]`|
 
 ### External data access security configuration
 
@@ -1501,7 +1500,6 @@ Additional Peon configs include:
 |`druid.peon.mode`|One of `local` or `remote`. Setting this property to `local` means you intend to run the Peon as a standalone process which is not recommended.|`remote`|
 |`druid.indexer.task.baseDir`|Base temporary working directory.|`System.getProperty("java.io.tmpdir")`|
 |`druid.indexer.task.baseTaskDir`|Base temporary working directory for tasks.|`${druid.indexer.task.baseDir}/persistent/task`|
-|`druid.indexer.task.batchProcessingMode`| Batch ingestion tasks have three operating modes to control construction and tracking for intermediary segments: `OPEN_SEGMENTS`, `CLOSED_SEGMENTS`, and `CLOSED_SEGMENT_SINKS`. `OPEN_SEGMENTS` uses the streaming ingestion code path and performs a `mmap` on intermediary segments to build a timeline to make these segments available to realtime queries. Batch ingestion doesn't require intermediary segments, so the default mode, `CLOSED_SEGMENTS`, eliminates `mmap` of intermediary segments. `CLOSED_SEGMENTS` mode still tracks the entire set of segments in heap. The `CLOSED_SEGMENTS_SINKS` mode is the most aggressive configuration and should have the smallest memory footprint. It eliminates in-memory tracking and `mmap` of intermediary segments produced during segment creation. `CLOSED_SEGMENTS_SINKS` mode isn't as well tested as other modes so is currently considered experimental. You can use `OPEN_SEGMENTS` mode if problems occur with the 2 newer modes. |`CLOSED_SEGMENTS`|
 |`druid.indexer.task.defaultHadoopCoordinates`|Hadoop version to use with HadoopIndexTasks that do not request a particular version.|`org.apache.hadoop:hadoop-client-api:3.3.6`, `org.apache.hadoop:hadoop-client-runtime:3.3.6`|
 |`druid.indexer.task.defaultRowFlushBoundary`|Highest row count before persisting to disk. Used for indexing generating tasks.|75000|
 |`druid.indexer.task.directoryLockTimeout`|Wait this long for zombie Peons to exit before giving up on their replacements.|PT10M|

diff --git a/docs/development/extensions-contrib/cloudfiles.md b/docs/development/extensions-contrib/cloudfiles.md
@@ -40,59 +40,3 @@ To use this Apache Druid extension, [include](../../configuration/extensions.md#
 |`druid.cloudfiles.apiKey`||Rackspace Cloud API key.|Must be set.|
 |`druid.cloudfiles.provider`|rackspace-cloudfiles-us,rackspace-cloudfiles-uk|Name of the provider depending on the region.|Must be set.|
 |`druid.cloudfiles.useServiceNet`|true,false|Whether to use the internal service net.|true|
-
-## Firehose
-
-<a name="firehose"></a>
-
-#### StaticCloudFilesFirehose
-
-This firehose ingests events, similar to the StaticAzureBlobStoreFirehose, but from Rackspace's Cloud Files.
-
-Data is newline delimited, with one JSON object per line and parsed as per the `InputRowParser` configuration.
-
-The storage account is shared with the one used for Rackspace's Cloud Files deep storage functionality, but blobs can be in a different region and container.
-
-As with the Azure blobstore, it is assumed to be gzipped if the extension ends in .gz
-
-This firehose is _splittable_ and can be used by [native parallel index tasks](../../ingestion/native-batch.md).
-Since each split represents an object in this firehose, each worker task of `index_parallel` will read an object.
-
-Sample spec:
-
-```json
-"firehose" : {
-    "type" : "static-cloudfiles",
-    "blobs": [
-        {
-          "region": "DFW"
-          "container": "container",
-          "path": "/path/to/your/file.json"
-        },
-        {
-          "region": "ORD"
-          "container": "anothercontainer",
-          "path": "/another/path.json"
-        }
-    ]
-}
-```
-This firehose provides caching and prefetching features. In IndexTask, a firehose can be read twice if intervals or
-shardSpecs are not specified, and, in this case, caching can be useful. Prefetching is preferred when direct scan of objects is slow.
-
-|property|description|default|required?|
-|--------|-----------|-------|---------|
-|type|This should be `static-cloudfiles`.|N/A|yes|
-|blobs|JSON array of Cloud Files blobs.|N/A|yes|
-|maxCacheCapacityBytes|Maximum size of the cache space in bytes. 0 means disabling cache.|1073741824|no|
-|maxCacheCapacityBytes|Maximum size of the cache space in bytes. 0 means disabling cache. Cached files are not removed until the ingestion task completes.|1073741824|no|
-|maxFetchCapacityBytes|Maximum size of the fetch space in bytes. 0 means disabling prefetch. Prefetched files are removed immediately once they are read.|1073741824|no|
-|fetchTimeout|Timeout for fetching a Cloud Files object.|60000|no|
-|maxFetchRetry|Maximum retry for fetching a Cloud Files object.|3|no|
-
-Cloud Files Blobs:
-
-|property|description|default|required?|
-|--------|-----------|-------|---------|
-|container|Name of the Cloud Files container|N/A|yes|
-|path|The path where data is located.|N/A|yes|
diff --git a/docs/development/extensions-core/postgresql.md b/docs/development/extensions-core/postgresql.md
@@ -87,7 +87,7 @@ In most cases, the configuration options map directly to the [postgres JDBC conn
 | `druid.metadata.postgres.ssl.sslPasswordCallback` | The classname of the SSL password provider. | none | no |
 | `druid.metadata.postgres.dbTableSchema` | druid meta table schema | `public` | no |
 
-### PostgreSQL Firehose
+### PostgreSQL InputSource
 
 The PostgreSQL extension provides an implementation of an [SQL input source](../../ingestion/input-sources.md) which can be used to ingest data into Druid from a PostgreSQL database.
 

diff --git a/docs/development/overview.md b/docs/development/overview.md
@@ -53,8 +53,17 @@ Most of the coordination logic for (real-time) ingestion is in the Druid indexin
 
 ## Real-time Ingestion
 
-Druid loads data through `FirehoseFactory.java` classes. Firehoses often wrap other firehoses, where, similar to the design of the
-query runners, each firehose adds a layer of logic, and the persist and hand-off logic is in `RealtimePlumber.java`.
+Druid streaming tasks are based on the 'seekable stream' classes such as `SeekableStreamSupervisor.java`,
+`SeekableStreamIndexTask.java`, and `SeekableStreamIndexTaskRunner.java`. The data processing happens through
+`StreamAppenderator.java`, and the persist and hand-off logic is in `StreamAppenderatorDriver.java`.
+
+## Native Batch Ingestion
+
+Druid native batch ingestion main task types are based on `AbstractBatchTask.java` and `AbstractBatchSubtask.java`.
+Parallel processing uses `ParallelIndexSupervisorTask.java`, which spawns subtasks to perform various operations such
+as data analysis and partitioning depending on the task specification. Segment generation happens in
+`SinglePhaseSubTask.java`, `PartialHashSegmentGenerateTask.java`, or `PartialRangeSegmentGenerateTask.java` through
+`BatchAppenderator`, and the persist and hand-off logic is in `BatchAppenderatorDriver.java`.
 
 ## Hadoop-based Batch Ingestion