Skip to content

Commit

Permalink
Merge branch 'master' into window-subquery-frames
Browse files Browse the repository at this point in the history
  • Loading branch information
LakshSingla committed Jul 25, 2024
2 parents 22e01a7 + 7e3fab5 commit f6a9611
Show file tree
Hide file tree
Showing 849 changed files with 11,714 additions and 17,398 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -262,12 +262,8 @@ public void setup()
rowsPerSegment
);

final ScanQueryConfig config = new ScanQueryConfig().setLegacy(false);
factory = new ScanQueryRunnerFactory(
new ScanQueryQueryToolChest(
config,
DefaultGenericQueryMetricsFactory.instance()
),
new ScanQueryQueryToolChest(DefaultGenericQueryMetricsFactory.instance()),
new ScanQueryEngine(),
new ScanQueryConfig()
);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -141,7 +141,7 @@ public void setup()
@Benchmark
public void measureNewestSegmentFirstPolicy(Blackhole blackhole)
{
final CompactionSegmentIterator iterator = policy.reset(compactionConfigs, dataSources, Collections.emptyMap());
final CompactionSegmentIterator iterator = policy.createIterator(compactionConfigs, dataSources, Collections.emptyMap());
for (int i = 0; i < numCompactionTaskSlots && iterator.hasNext(); i++) {
blackhole.consume(iterator.next());
}
Expand Down
1 change: 0 additions & 1 deletion docs/api-reference/sql-ingestion-api.md
Original file line number Diff line number Diff line change
Expand Up @@ -474,7 +474,6 @@ The response shows an example report for a query.
"agent_type",
"timestamp"
],
"legacy": false,
"context": {
"finalize": false,
"finalizeAggregations": false,
Expand Down
17 changes: 8 additions & 9 deletions docs/api-reference/tasks-api.md
Original file line number Diff line number Diff line change
Expand Up @@ -914,13 +914,10 @@ Host: http://ROUTER_IP:ROUTER_PORT
### Get task segments

:::info
This API is deprecated and will be removed in future releases.
This API is not supported anymore and always returns a 404 response.
Use the metric `segment/added/bytes` instead to identify the segment IDs committed by a task.
:::

Retrieves information about segments generated by the task given the task ID. To hit this endpoint, make sure to enable the audit log config on the Overlord with `druid.indexer.auditLog.enabled = true`.

In addition to enabling audit logs, configure a cleanup strategy to prevent overloading the metadata store with old audit logs which may cause performance issues. To enable automated cleanup of audit logs on the Coordinator, set `druid.coordinator.kill.audit.on`. You may also manually export the audit logs to external storage. For more information, see [Audit records](../operations/clean-metadata-store.md#audit-records).

#### URL

`GET` `/druid/indexer/v1/task/{taskId}/segments`
Expand All @@ -929,12 +926,14 @@ In addition to enabling audit logs, configure a cleanup strategy to prevent over

<Tabs>

<TabItem value="27" label="200 SUCCESS">
<TabItem value="27" label="404 NOT FOUND">


<br/>

*Successfully retrieved task segments*
```json
{
"error": "Segment IDs committed by a task action are not persisted anymore. Use the metric 'segment/added/bytes' to identify the segments created by a task."
}
```

</TabItem>
</Tabs>
Expand Down
2 changes: 1 addition & 1 deletion docs/configuration/extensions.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ All of these community extensions can be downloaded using [pull-deps](../operati
|aliyun-oss-extensions|Aliyun OSS deep storage |[link](../development/extensions-contrib/aliyun-oss-extensions.md)|
|ambari-metrics-emitter|Ambari Metrics Emitter |[link](../development/extensions-contrib/ambari-metrics-emitter.md)|
|druid-cassandra-storage|Apache Cassandra deep storage.|[link](../development/extensions-contrib/cassandra.md)|
|druid-cloudfiles-extensions|Rackspace Cloudfiles deep storage and firehose.|[link](../development/extensions-contrib/cloudfiles.md)|
|druid-cloudfiles-extensions|Rackspace Cloudfiles deep storage.|[link](../development/extensions-contrib/cloudfiles.md)|
|druid-compressed-bigdecimal|Compressed Big Decimal Type | [link](../development/extensions-contrib/compressed-big-decimal.md)|
|druid-ddsketch|Support for DDSketch approximate quantiles based on [DDSketch](https://github.com/datadog/sketches-java) | [link](../development/extensions-contrib/ddsketch-quantiles.md)|
|druid-deltalake-extensions|Support for ingesting Delta Lake tables.|[link](../development/extensions-contrib/delta-lake.md)|
Expand Down
6 changes: 2 additions & 4 deletions docs/configuration/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -395,7 +395,6 @@ Metric monitoring is an essential part of Druid operations. The following monito
|`org.apache.druid.java.util.metrics.CgroupCpuSetMonitor`|Reports CPU core/HT and memory node allocations as per the `cpuset` cgroup.|
|`org.apache.druid.java.util.metrics.CgroupDiskMonitor`|Reports disk statistic as per the blkio cgroup.|
|`org.apache.druid.java.util.metrics.CgroupMemoryMonitor`|Reports memory statistic as per the memory cgroup.|
|`org.apache.druid.server.metrics.EventReceiverFirehoseMonitor`|Reports how many events have been queued in the EventReceiverFirehose.|
|`org.apache.druid.server.metrics.HistoricalMetricsMonitor`|Reports statistics on Historical services. Available only on Historical services.|
|`org.apache.druid.server.metrics.SegmentStatsMonitor` | **EXPERIMENTAL** Reports statistics about segments on Historical services. Available only on Historical services. Not to be used when lazy loading is configured.|
|`org.apache.druid.server.metrics.QueryCountStatsMonitor`|Reports how many queries have been successful/failed/interrupted.|
Expand Down Expand Up @@ -607,7 +606,7 @@ the [HDFS input source](../ingestion/input-sources.md#hdfs-input-source).

|Property|Possible values|Description|Default|
|--------|---------------|-----------|-------|
|`druid.ingestion.hdfs.allowedProtocols`|List of protocols|Allowed protocols for the HDFS input source and HDFS firehose.|`["hdfs"]`|
|`druid.ingestion.hdfs.allowedProtocols`|List of protocols|Allowed protocols for the HDFS input source.|`["hdfs"]`|

#### HTTP input source

Expand All @@ -616,7 +615,7 @@ the [HTTP input source](../ingestion/input-sources.md#http-input-source).

|Property|Possible values|Description|Default|
|--------|---------------|-----------|-------|
|`druid.ingestion.http.allowedProtocols`|List of protocols|Allowed protocols for the HTTP input source and HTTP firehose.|`["http", "https"]`|
|`druid.ingestion.http.allowedProtocols`|List of protocols|Allowed protocols for the HTTP input source.|`["http", "https"]`|

### External data access security configuration

Expand Down Expand Up @@ -1501,7 +1500,6 @@ Additional Peon configs include:
|`druid.peon.mode`|One of `local` or `remote`. Setting this property to `local` means you intend to run the Peon as a standalone process which is not recommended.|`remote`|
|`druid.indexer.task.baseDir`|Base temporary working directory.|`System.getProperty("java.io.tmpdir")`|
|`druid.indexer.task.baseTaskDir`|Base temporary working directory for tasks.|`${druid.indexer.task.baseDir}/persistent/task`|
|`druid.indexer.task.batchProcessingMode`| Batch ingestion tasks have three operating modes to control construction and tracking for intermediary segments: `OPEN_SEGMENTS`, `CLOSED_SEGMENTS`, and `CLOSED_SEGMENT_SINKS`. `OPEN_SEGMENTS` uses the streaming ingestion code path and performs a `mmap` on intermediary segments to build a timeline to make these segments available to realtime queries. Batch ingestion doesn't require intermediary segments, so the default mode, `CLOSED_SEGMENTS`, eliminates `mmap` of intermediary segments. `CLOSED_SEGMENTS` mode still tracks the entire set of segments in heap. The `CLOSED_SEGMENTS_SINKS` mode is the most aggressive configuration and should have the smallest memory footprint. It eliminates in-memory tracking and `mmap` of intermediary segments produced during segment creation. `CLOSED_SEGMENTS_SINKS` mode isn't as well tested as other modes so is currently considered experimental. You can use `OPEN_SEGMENTS` mode if problems occur with the 2 newer modes. |`CLOSED_SEGMENTS`|
|`druid.indexer.task.defaultHadoopCoordinates`|Hadoop version to use with HadoopIndexTasks that do not request a particular version.|`org.apache.hadoop:hadoop-client-api:3.3.6`, `org.apache.hadoop:hadoop-client-runtime:3.3.6`|
|`druid.indexer.task.defaultRowFlushBoundary`|Highest row count before persisting to disk. Used for indexing generating tasks.|75000|
|`druid.indexer.task.directoryLockTimeout`|Wait this long for zombie Peons to exit before giving up on their replacements.|PT10M|
Expand Down
56 changes: 0 additions & 56 deletions docs/development/extensions-contrib/cloudfiles.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,59 +40,3 @@ To use this Apache Druid extension, [include](../../configuration/extensions.md#
|`druid.cloudfiles.apiKey`||Rackspace Cloud API key.|Must be set.|
|`druid.cloudfiles.provider`|rackspace-cloudfiles-us,rackspace-cloudfiles-uk|Name of the provider depending on the region.|Must be set.|
|`druid.cloudfiles.useServiceNet`|true,false|Whether to use the internal service net.|true|

## Firehose

<a name="firehose"></a>

#### StaticCloudFilesFirehose

This firehose ingests events, similar to the StaticAzureBlobStoreFirehose, but from Rackspace's Cloud Files.

Data is newline delimited, with one JSON object per line and parsed as per the `InputRowParser` configuration.

The storage account is shared with the one used for Rackspace's Cloud Files deep storage functionality, but blobs can be in a different region and container.

As with the Azure blobstore, it is assumed to be gzipped if the extension ends in .gz

This firehose is _splittable_ and can be used by [native parallel index tasks](../../ingestion/native-batch.md).
Since each split represents an object in this firehose, each worker task of `index_parallel` will read an object.

Sample spec:

```json
"firehose" : {
"type" : "static-cloudfiles",
"blobs": [
{
"region": "DFW"
"container": "container",
"path": "/path/to/your/file.json"
},
{
"region": "ORD"
"container": "anothercontainer",
"path": "/another/path.json"
}
]
}
```
This firehose provides caching and prefetching features. In IndexTask, a firehose can be read twice if intervals or
shardSpecs are not specified, and, in this case, caching can be useful. Prefetching is preferred when direct scan of objects is slow.

|property|description|default|required?|
|--------|-----------|-------|---------|
|type|This should be `static-cloudfiles`.|N/A|yes|
|blobs|JSON array of Cloud Files blobs.|N/A|yes|
|maxCacheCapacityBytes|Maximum size of the cache space in bytes. 0 means disabling cache.|1073741824|no|
|maxCacheCapacityBytes|Maximum size of the cache space in bytes. 0 means disabling cache. Cached files are not removed until the ingestion task completes.|1073741824|no|
|maxFetchCapacityBytes|Maximum size of the fetch space in bytes. 0 means disabling prefetch. Prefetched files are removed immediately once they are read.|1073741824|no|
|fetchTimeout|Timeout for fetching a Cloud Files object.|60000|no|
|maxFetchRetry|Maximum retry for fetching a Cloud Files object.|3|no|

Cloud Files Blobs:

|property|description|default|required?|
|--------|-----------|-------|---------|
|container|Name of the Cloud Files container|N/A|yes|
|path|The path where data is located.|N/A|yes|
2 changes: 1 addition & 1 deletion docs/development/extensions-core/postgresql.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ In most cases, the configuration options map directly to the [postgres JDBC conn
| `druid.metadata.postgres.ssl.sslPasswordCallback` | The classname of the SSL password provider. | none | no |
| `druid.metadata.postgres.dbTableSchema` | druid meta table schema | `public` | no |

### PostgreSQL Firehose
### PostgreSQL InputSource

The PostgreSQL extension provides an implementation of an [SQL input source](../../ingestion/input-sources.md) which can be used to ingest data into Druid from a PostgreSQL database.

Expand Down
13 changes: 11 additions & 2 deletions docs/development/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,8 +53,17 @@ Most of the coordination logic for (real-time) ingestion is in the Druid indexin

## Real-time Ingestion

Druid loads data through `FirehoseFactory.java` classes. Firehoses often wrap other firehoses, where, similar to the design of the
query runners, each firehose adds a layer of logic, and the persist and hand-off logic is in `RealtimePlumber.java`.
Druid streaming tasks are based on the 'seekable stream' classes such as `SeekableStreamSupervisor.java`,
`SeekableStreamIndexTask.java`, and `SeekableStreamIndexTaskRunner.java`. The data processing happens through
`StreamAppenderator.java`, and the persist and hand-off logic is in `StreamAppenderatorDriver.java`.

## Native Batch Ingestion

Druid native batch ingestion main task types are based on `AbstractBatchTask.java` and `AbstractBatchSubtask.java`.
Parallel processing uses `ParallelIndexSupervisorTask.java`, which spawns subtasks to perform various operations such
as data analysis and partitioning depending on the task specification. Segment generation happens in
`SinglePhaseSubTask.java`, `PartialHashSegmentGenerateTask.java`, or `PartialRangeSegmentGenerateTask.java` through
`BatchAppenderator`, and the persist and hand-off logic is in `BatchAppenderatorDriver.java`.

## Hadoop-based Batch Ingestion

Expand Down
Loading

0 comments on commit f6a9611

Please sign in to comment.