From 418da922281db3a131ac5ebefdedf8b774d67ac0 Mon Sep 17 00:00:00 2001 From: 317brian <53799971+317brian@users.noreply.github.com> Date: Fri, 23 Aug 2024 11:59:29 -0700 Subject: [PATCH] docs: update query from deepstorage segment requirement (#16842) Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> Co-authored-by: Rishabh Singh <6513075+findingrish@users.noreply.github.com> Co-authored-by: Katya Macedo <38017980+ektravel@users.noreply.github.com> --- docs/configuration/index.md | 4 +++- docs/querying/query-from-deep-storage.md | 18 +++++++++++------- docs/tutorials/tutorial-query-deep-storage.md | 2 +- website/.spelling | 1 + 4 files changed, 16 insertions(+), 9 deletions(-) diff --git a/docs/configuration/index.md b/docs/configuration/index.md index 637ebeabc0ef..9c83946eb823 100644 --- a/docs/configuration/index.md +++ b/docs/configuration/index.md @@ -596,7 +596,9 @@ need arises. |`druid.centralizedDatasourceSchema.enabled`|Boolean flag for enabling datasource schema building in the Coordinator, this should be specified in the common runtime properties.|false|No.| |`druid.indexer.fork.property.druid.centralizedDatasourceSchema.enabled`| This config should be set when CentralizedDatasourceSchema feature is enabled. This should be specified in the MiddleManager runtime properties.|false|No.| -For, stale schema cleanup configs, refer to properties with the prefix `druid.coordinator.kill.segmentSchema` in [Metadata Management](#metadata-management). +If you enable this feature, you can query datasources that are only stored in deep storage and are not loaded on a Historical. For more information, see [Query from deep storage](../querying/query-from-deep-storage.md). + +For stale schema cleanup configs, refer to properties with the prefix `druid.coordinator.kill.segmentSchema` in [Metadata Management](#metadata-management). ### Ingestion security configuration diff --git a/docs/querying/query-from-deep-storage.md b/docs/querying/query-from-deep-storage.md index 1ce74818655d..e90f2fb4b024 100644 --- a/docs/querying/query-from-deep-storage.md +++ b/docs/querying/query-from-deep-storage.md @@ -28,13 +28,20 @@ Druid can query segments that are only stored in deep storage. Running a query f Query from deep storage requires the Multi-stage query (MSQ) task engine. Load the extension for it if you don't already have it enabled before you begin. See [enable MSQ](../multi-stage-query/index.md#load-the-extension) for more information. +To be queryable, your datasource must meet one of the following conditions: + +- At least one segment from the datasource is loaded onto a Historical service for Druid to plan the query. This segment can be any segment from the datasource. You can verify that a datasource has at least one segment on a Historical service if it's visible in the Druid console. +- You have the centralized datasource schema feature enabled. For more information, see [Centralized datasource schema](../configuration/index.md#centralized-datasource-schema). + +If you use centralized data source schema, there's an additional step for any datasource created prior to enabling it to make the datasource queryable from deep storage. You need to load the segments from deep storage onto a Historical so that the schema can be backfilled in the metadata database. You can load some or all of the segments that are only in deep storage. If you don't load all the segments, any dimensions that are only in the segments you didn't load will not be in the queryable datasource schema and won't be queryable from deep storage. That is, only the dimensions that are present in the segment schema in metadata database are queryable. Once that process is complete, you can unload all the segments from the Historical and only keep the data in deep storage. + ## Keep segments in deep storage only -Any data you ingest into Druid is already stored in deep storage, so you don't need to perform any additional configuration from that perspective. However, to take advantage of the cost savings that querying from deep storage provides, make sure not all your segments get loaded onto Historical processes. +Any data you ingest into Druid is already stored in deep storage, so you don't need to perform any additional configuration from that perspective. However, to take advantage of the cost savings that querying from deep storage provides, make sure not all your segments get loaded onto Historical processes. If you use centralized datasource schema, a datasource can be kept only in deep storage but remain queryable. -To do this, configure [load rules](../operations/rule-configuration.md#load-rules) to manage the which segments are only in deep storage and which get loaded onto Historical processes. +To manage which segments are kept only in deep storage and which get loaded onto Historical processes, configure [load rules](../operations/rule-configuration.md#load-rules) -The easiest way to do this is to explicitly configure the segments that don't get loaded onto Historical processes. Set `tieredReplicants` to an empty array and `useDefaultTierForNull` to `false`. For example, if you configure the following rule for a datasource: +The easiest way to keep segments only in deep storage is to explicitly configure the segments that don't get loaded onto Historical processes. Set `tieredReplicants` to an empty array and `useDefaultTierForNull` to `false`. For example, if you configure the following rule for a datasource: ```json [ @@ -64,10 +71,7 @@ Segments with a `replication_factor` of `0` are not assigned to any Historical t You can also confirm this through the Druid console. On the **Segments** page, see the **Replication factor** column. -Keep the following in mind when working with load rules to control what exists only in deep storage: - -- At least one of the segments in a datasource must be loaded onto a Historical process so that Druid can plan the query. The segment on the Historical process can be any segment from the datasource. It does not need to be a specific segment. One way to verify that a datasource has at least one segment on a Historical process is if it's visible in the Druid console. -- The actual number of replicas may differ from the replication factor temporarily as Druid processes your load rules. +Note that the actual number of replicas may differ from the replication factor temporarily as Druid processes your load rules. ## Run a query from deep storage diff --git a/docs/tutorials/tutorial-query-deep-storage.md b/docs/tutorials/tutorial-query-deep-storage.md index dfb4de22eb01..1bd2b96501f7 100644 --- a/docs/tutorials/tutorial-query-deep-storage.md +++ b/docs/tutorials/tutorial-query-deep-storage.md @@ -25,7 +25,7 @@ sidebar_label: "Query from deep storage" Query from deep storage allows you to query segments that are stored only in deep storage, which provides lower costs than if you were to load everything onto Historical processes. The tradeoff is that queries from deep storage may take longer to complete. -This tutorial walks you through loading example data, configuring load rules so that not all the segments get loaded onto Historical processes, and querying data from deep storage. +This tutorial walks you through loading example data, configuring load rules so that not all the segments get loaded onto Historical services, and querying data from deep storage. If you have [centralized datasource schema enabled](../configuration/index.md#centralized-datasource-schema), you can query datasources that are only in deep storage without having any segment available on Historical. To run the queries in this tutorial, replace `ROUTER:PORT` with the location of the Router process and its port number. For example, use `localhost:8888` for the quickstart deployment. diff --git a/website/.spelling b/website/.spelling index 3e9827e82bed..74f660c2213f 100644 --- a/website/.spelling +++ b/website/.spelling @@ -276,6 +276,7 @@ averager averagers backend backfills +backfilled backpressure base64 big-endian