From f8ecf9380f7b1952b48221589e859f8419adffd2 Mon Sep 17 00:00:00 2001 From: 317brian <53799971+317brian@users.noreply.github.com> Date: Wed, 1 Nov 2023 13:14:17 -0700 Subject: [PATCH] [backport]docs: durable storage azure cleanup (#15120) (#15296) Co-authored-by: Laksh Singla --- docs/multi-stage-query/reference.md | 54 ++++++++++++++++------------- docs/multi-stage-query/security.md | 25 ++++++++----- docs/operations/durable-storage.md | 17 ++++++--- 3 files changed, 57 insertions(+), 39 deletions(-) diff --git a/docs/multi-stage-query/reference.md b/docs/multi-stage-query/reference.md index 3fd2335d0525..a497afa3a71a 100644 --- a/docs/multi-stage-query/reference.md +++ b/docs/multi-stage-query/reference.md @@ -354,40 +354,44 @@ SQL-based ingestion supports using durable storage to store intermediate files t ### Durable storage configurations -Durable storage is supported on Amazon S3 storage and Microsoft's Azure storage. There are a few common configurations that controls the behavior for both the services as documented below. Apart from the common configurations, -there are a few properties specific to each storage that must be set. +Durable storage is supported on Amazon S3 storage and Microsoft's Azure Blob Storage. +There are common configurations that control the behavior regardless of which storage service you use. Apart from these common configurations, there are a few properties specific to S3 and to Azure. Common properties to configure the behavior of durable storage -|Parameter |Default | Description | -|-------------------|----------------------------------------|----------------------| -|`druid.msq.intermediate.storage.enable` | false | Whether to enable durable storage for the cluster. Set it to true to enable durable storage. For more information about enabling durable storage, see [Durable storage](../operations/durable-storage.md).| -|`druid.msq.intermediate.storage.type` | n/a | Required. The type of storage to use. Set it to `s3` for S3 and `azure` for Azure | -|`druid.msq.intermediate.storage.tempDir`| n/a | Required. Directory path on the local disk to store temporary files required while uploading and downloading the data | -|`druid.msq.intermediate.storage.maxRetry` | 10 | Optional. Defines the max number times to attempt S3 API calls to avoid failures due to transient errors. | -|`druid.msq.intermediate.storage.chunkSize` | 100MiB | Optional. Defines the size of each chunk to temporarily store in `druid.msq.intermediate.storage.tempDir`. The chunk size must be between 5 MiB and 5 GiB. A large chunk size reduces the API calls made to the durable storage, however it requires more disk space to store the temporary chunks. Druid uses a default of 100MiB if the value is not provided.| +|Parameter | Required | Description | Default | +|--|--|--| +|`druid.msq.intermediate.storage.enable` | Yes | Whether to enable durable storage for the cluster. Set it to true to enable durable storage. For more information about enabling durable storage, see [Durable storage](../operations/durable-storage.md). | false | +|`druid.msq.intermediate.storage.type` | Yes | The type of storage to use. Set it to `s3` for S3 and `azure` for Azure | n/a | +|`druid.msq.intermediate.storage.tempDir`| Yes | Directory path on the local disk to store temporary files required while uploading and downloading the data | n/a | +|`druid.msq.intermediate.storage.maxRetry` | No | Defines the max number times to attempt S3 API calls to avoid failures due to transient errors. | 10 | +|`druid.msq.intermediate.storage.chunkSize` | No | Defines the size of each chunk to temporarily store in `druid.msq.intermediate.storage.tempDir`. The chunk size must be between 5 MiB and 5 GiB. A large chunk size reduces the API calls made to the durable storage, however it requires more disk space to store the temporary chunks. Druid uses a default of 100MiB if the value is not provided.| 100MiB | -Following properties need to be set in addition to the common properties to enable durable storage on S3 +To use S3 for durable storage, you also need to configure the following properties: -|Parameter |Default | Description | -|-------------------|----------------------------------------|----------------------| -|`druid.msq.intermediate.storage.bucket` | n/a | Required. The S3 bucket where the files are uploaded to and download from | -|`druid.msq.intermediate.storage.prefix` | n/a | Required. Path prepended to all the paths uploaded to the bucket to namespace the connector's files. Provide a unique value for the prefix and do not share the same prefix between different clusters. If the location includes other files or directories, then they might get cleaned up as well. | +|Parameter | Required | Description | Default | +|-------------------|----------------------------------------|----------------------| --| +|`druid.msq.intermediate.storage.bucket` | Yes | The S3 bucket where the files are uploaded to and download from | n/a | +|`druid.msq.intermediate.storage.prefix` | Yes | Path prepended to all the paths uploaded to the bucket to namespace the connector's files. Provide a unique value for the prefix and do not share the same prefix between different clusters. If the location includes other files or directories, then they might get cleaned up as well. | n/a | -Following properties must be set in addition to the common properties to enable durable storage on Azure. +To use Azure for durable storage, you also need to configure the following properties: -|Parameter |Default | Description | -|-------------------|----------------------------------------|----------------------| -|`druid.msq.intermediate.storage.container` | n/a | Required. The Azure container where the files are uploaded to and downloaded from. | -|`druid.msq.intermediate.storage.prefix` | n/a | Required. Path prepended to all the paths uploaded to the container to namespace the connector's files. Provide a unique value for the prefix and do not share the same prefix between different clusters. If the location includes other files or directories, then they might get cleaned up as well. | +|Parameter | Required | Description | Default | +|-------------------|----------------------------------------|----------------------| - | +|`druid.msq.intermediate.storage.container` | Yes | The Azure container where the files are uploaded to and downloaded from. | n/a | +|`druid.msq.intermediate.storage.prefix` | Yes | Path prepended to all the paths uploaded to the container to namespace the connector's files. Provide a unique value for the prefix and do not share the same prefix between different clusters. If the location includes other files or directories, then they might get cleaned up as well. | n/a | -Durable storage creates files on the remote storage and is cleaned up once the job no longer requires those files. However, due to failures causing abrupt exit of the tasks, these files might not get cleaned up. -Therefore, there are certain properties that you configure on the Overlord specifically to clean up intermediate files for the tasks that have completed and would no longer require these files: +### Durable storage cleaner configurations -|Parameter |Default | Description | -|-------------------|----------------------------------------|----------------------| -|`druid.msq.intermediate.storage.cleaner.enabled`| false | Optional. Whether durable storage cleaner should be enabled for the cluster. | -|`druid.msq.intermediate.storage.cleaner.delaySeconds`| 86400 | Optional. The delay (in seconds) after the last run post which the durable storage cleaner would clean the outputs. | +Durable storage creates files on the remote storage, and these files get cleaned up once a job no longer requires those files. However, due to failures causing abrupt exits of tasks, these files might not get cleaned up. +You can configure the Overlord to periodically clean up these intermediate files after a task completes and the files are no longer need. The files that get cleaned up are determined by the storage prefix you configure. Any files that match the path for the storage prefix may get cleaned up, not just intermediate files that are no longer needed. + +Use the following configurations to control the cleaner: + +|Parameter | Required | Description | Default | +|--|--|--|--| +|`druid.msq.intermediate.storage.cleaner.enabled`| No | Whether durable storage cleaner should be enabled for the cluster. | false | +|`druid.msq.intermediate.storage.cleaner.delaySeconds`| No | The delay (in seconds) after the latest run post which the durable storage cleaner cleans the up files. | 86400 | ## Limits diff --git a/docs/multi-stage-query/security.md b/docs/multi-stage-query/security.md index 2d412f40654f..3c395e40c576 100644 --- a/docs/multi-stage-query/security.md +++ b/docs/multi-stage-query/security.md @@ -60,17 +60,24 @@ Depending on what a user is trying to do, they might also need the following per -## S3 +## Permissions for durable storage -The MSQ task engine can use S3 to store intermediate files when running queries. This can increase its reliability but requires certain permissions in S3. -These permissions are required if you configure durable storage. +The MSQ task engine can use Amazon S3 or Azure Blog Storage to store intermediate files when running queries. To upload, read, move and delete these intermediate files, the MSQ task engine requires certain permissions specific to the storage provider. -Permissions for pushing and fetching intermediate stage results to and from S3: +### S3 -- `s3:GetObject` -- `s3:PutObject` -- `s3:AbortMultipartUpload` +The MSQ task engine needs the following permissions for pushing, fetching, and removing intermediate stage results to and from S3: -Permissions for removing intermediate stage results: +- `s3:GetObject` to retrieve files. Note that `GetObject` also requires read permission on the object that gets retrieved. +- `s3:PutObject` to upload files. +- `s3:AbortMultipartUpload` to cancel the upload of files +- `s3:DeleteObject` to delete files when they're no longer needed. -- `s3:DeleteObject` \ No newline at end of file +### Azure + +The MSQ task engine needs the following permissions for pushing, fetching, and removing intermediate stage results to and from Azure: + +- `Microsoft.Storage/storageAccounts/blobServices/containers/blobs/read` to read and list files in durable storage. +- `Microsoft.Storage/storageAccounts/blobServices/containers/blobs/write` to write files in durable storage. +- `Microsoft.Storage/storageAccounts/blobServices/containers/blobs/add/action` to create files in durable storage. +- `Microsoft.Storage/storageAccounts/blobServices/containers/blobs/delete` to delete files when they're no longer needed. \ No newline at end of file diff --git a/docs/operations/durable-storage.md b/docs/operations/durable-storage.md index 80545f9a9b28..b7a8ad1ef905 100644 --- a/docs/operations/durable-storage.md +++ b/docs/operations/durable-storage.md @@ -39,13 +39,20 @@ To enable durable storage, you need to set the following common service properti ``` druid.msq.intermediate.storage.enable=true -druid.msq.intermediate.storage.type=s3 -druid.msq.intermediate.storage.bucket=YOUR_BUCKET -druid.msq.intermediate.storage.prefix=YOUR_PREFIX druid.msq.intermediate.storage.tempDir=/path/to/your/temp/dir + +# Include these configs if you're using S3 +# druid.msq.intermediate.storage.type=s3 +# druid.msq.intermediate.storage.bucket=YOUR_BUCKET + +# Include these configs if you're using Azure Blob Storage +# druid.msq.intermediate.storage.type=azure +# druid.sq.intermediate.storage.container=YOUR_CONTAINER + +druid.msq.intermediate.storage.prefix=YOUR_PREFIX ``` -For detailed information about the settings related to durable storage, see [Durable storage configurations](../multi-stage-query/reference.md#durable-storage-configurations). +For detailed information about these and additional settings related to durable storage, see [Durable storage configurations](../multi-stage-query/reference.md#durable-storage-configurations). ## Use durable storage for SQL-based ingestion queries @@ -80,7 +87,7 @@ cleaner can be scheduled to clean the directories corresponding to which there i the storage connector to work upon the durable storage. The durable storage location should only be utilized to store the output for the cluster's MSQ tasks. If the location contains other files or directories, then they will get cleaned up as well. -Use `druid.msq.intermediate.storage.cleaner.enabled` and `druid.msq.intermediate.storage.cleaner.delaySEconds` to configure the cleaner. For more information, see [Durable storage configurations](../multi-stage-query/reference.md#durable-storage-configurations). +Use `druid.msq.intermediate.storage.cleaner.enabled` and `druid.msq.intermediate.storage.cleaner.delaySeconds` to configure the cleaner. For more information, see [Durable storage configurations](../multi-stage-query/reference.md#durable-storage-configurations). Note that if you choose to write query results to durable storage,the results are cleaned up when the task is removed from the metadata store.