Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: durable storage azure cleanup #15120

Merged
merged 10 commits into from
Oct 31, 2023
54 changes: 29 additions & 25 deletions docs/multi-stage-query/reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -354,40 +354,44 @@ SQL-based ingestion supports using durable storage to store intermediate files t

### Durable storage configurations

Durable storage is supported on Amazon S3 storage and Microsoft's Azure storage. There are a few common configurations that controls the behavior for both the services as documented below. Apart from the common configurations,
there are a few properties specific to each storage that must be set.
Durable storage is supported on Amazon S3 storage and Microsoft's Azure Blob Storage.
There are common configurations that control the behavior regardless of which storage service you use. Apart from these common configurations, there are a few properties specific to S3 and to Azure.
317brian marked this conversation as resolved.
Show resolved Hide resolved

Common properties to configure the behavior of durable storage

|Parameter |Default | Description |
|-------------------|----------------------------------------|----------------------|
|`druid.msq.intermediate.storage.enable` | false | Whether to enable durable storage for the cluster. Set it to true to enable durable storage. For more information about enabling durable storage, see [Durable storage](../operations/durable-storage.md).|
|`druid.msq.intermediate.storage.type` | n/a | Required. The type of storage to use. Set it to `s3` for S3 and `azure` for Azure |
|`druid.msq.intermediate.storage.tempDir`| n/a | Required. Directory path on the local disk to store temporary files required while uploading and downloading the data |
|`druid.msq.intermediate.storage.maxRetry` | 10 | Optional. Defines the max number times to attempt S3 API calls to avoid failures due to transient errors. |
|`druid.msq.intermediate.storage.chunkSize` | 100MiB | Optional. Defines the size of each chunk to temporarily store in `druid.msq.intermediate.storage.tempDir`. The chunk size must be between 5 MiB and 5 GiB. A large chunk size reduces the API calls made to the durable storage, however it requires more disk space to store the temporary chunks. Druid uses a default of 100MiB if the value is not provided.|
|Parameter | Required | Description | Default |
|--|--|--|
|`druid.msq.intermediate.storage.enable` | Yes | Whether to enable durable storage for the cluster. Set it to true to enable durable storage. For more information about enabling durable storage, see [Durable storage](../operations/durable-storage.md). | false |
|`druid.msq.intermediate.storage.type` | Yes | The type of storage to use. Set it to `s3` for S3 and `azure` for Azure | n/a |
|`druid.msq.intermediate.storage.tempDir`| Yes | Directory path on the local disk to store temporary files required while uploading and downloading the data | n/a |
|`druid.msq.intermediate.storage.maxRetry` | No | Defines the max number times to attempt S3 API calls to avoid failures due to transient errors. | 10 |
|`druid.msq.intermediate.storage.chunkSize` | No | Defines the size of each chunk to temporarily store in `druid.msq.intermediate.storage.tempDir`. The chunk size must be between 5 MiB and 5 GiB. A large chunk size reduces the API calls made to the durable storage, however it requires more disk space to store the temporary chunks. Druid uses a default of 100MiB if the value is not provided.| 100MiB |

Following properties need to be set in addition to the common properties to enable durable storage on S3
To use S3 for durable storage, you also need to configure the following properties:
317brian marked this conversation as resolved.
Show resolved Hide resolved

|Parameter |Default | Description |
|-------------------|----------------------------------------|----------------------|
|`druid.msq.intermediate.storage.bucket` | n/a | Required. The S3 bucket where the files are uploaded to and download from |
|`druid.msq.intermediate.storage.prefix` | n/a | Required. Path prepended to all the paths uploaded to the bucket to namespace the connector's files. Provide a unique value for the prefix and do not share the same prefix between different clusters. If the location includes other files or directories, then they might get cleaned up as well. |
|Parameter | Required | Description | Default |
|-------------------|----------------------------------------|----------------------| --|
|`druid.msq.intermediate.storage.bucket` | Yes | The S3 bucket where the files are uploaded to and download from | n/a |
|`druid.msq.intermediate.storage.prefix` | Yes | Path prepended to all the paths uploaded to the bucket to namespace the connector's files. Provide a unique value for the prefix and do not share the same prefix between different clusters. If the location includes other files or directories, then they might get cleaned up as well. | n/a |

Following properties must be set in addition to the common properties to enable durable storage on Azure.
To use Azure for durable storage, you also need to configure the following properties:

|Parameter |Default | Description |
|-------------------|----------------------------------------|----------------------|
|`druid.msq.intermediate.storage.container` | n/a | Required. The Azure container where the files are uploaded to and downloaded from. |
|`druid.msq.intermediate.storage.prefix` | n/a | Required. Path prepended to all the paths uploaded to the container to namespace the connector's files. Provide a unique value for the prefix and do not share the same prefix between different clusters. If the location includes other files or directories, then they might get cleaned up as well. |
|Parameter | Required | Description | Default |
|-------------------|----------------------------------------|----------------------| - |
|`druid.msq.intermediate.storage.container` | Yes | The Azure container where the files are uploaded to and downloaded from. | n/a |
|`druid.msq.intermediate.storage.prefix` | Yes | Path prepended to all the paths uploaded to the container to namespace the connector's files. Provide a unique value for the prefix and do not share the same prefix between different clusters. If the location includes other files or directories, then they might get cleaned up as well. | n/a |

Durable storage creates files on the remote storage and is cleaned up once the job no longer requires those files. However, due to failures causing abrupt exit of the tasks, these files might not get cleaned up.
Therefore, there are certain properties that you configure on the Overlord specifically to clean up intermediate files for the tasks that have completed and would no longer require these files:
### Durable storage cleaner configurations

|Parameter |Default | Description |
|-------------------|----------------------------------------|----------------------|
|`druid.msq.intermediate.storage.cleaner.enabled`| false | Optional. Whether durable storage cleaner should be enabled for the cluster. |
|`druid.msq.intermediate.storage.cleaner.delaySeconds`| 86400 | Optional. The delay (in seconds) after the last run post which the durable storage cleaner would clean the outputs. |
Durable storage creates files on the remote storage, and these files get cleaned up once a job no longer requires those files. However, due to failures causing abrupt exits of tasks, these files might not get cleaned up.
You can configure the Overlord to periodically clean up these intermediate files after a task completes and the files are no longer need. The files that get cleaned up are determined by the storage prefix you configure. Any files that match the path for the storage prefix may get cleaned up, not just intermediate files that are no longer needed.

Use the following configurations to control the cleaner:

|Parameter | Required | Description | Default |
|--|--|--|--|
|`druid.msq.intermediate.storage.cleaner.enabled`| No | Whether durable storage cleaner should be enabled for the cluster. | false |
|`druid.msq.intermediate.storage.cleaner.delaySeconds`| No | The delay (in seconds) after the latest run post which the durable storage cleaner cleans the up files. | 86400 |


## Limits
Expand Down
25 changes: 16 additions & 9 deletions docs/multi-stage-query/security.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,17 +60,24 @@ Depending on what a user is trying to do, they might also need the following per



## S3
## Permissions for durable storage

The MSQ task engine can use S3 to store intermediate files when running queries. This can increase its reliability but requires certain permissions in S3.
These permissions are required if you configure durable storage.
The MSQ task engine can use Amazon S3 or Azure Blog Storage to store intermediate files when running queries. To upload, read, move and delete these intermediate files, the MSQ task engine requires certain permissions specific to the storage provider.

Permissions for pushing and fetching intermediate stage results to and from S3:
### S3

- `s3:GetObject`
- `s3:PutObject`
- `s3:AbortMultipartUpload`
The MSQ task engine needs the following permissions for pushing, fetching, and removing intermediate stage results to and from S3:

Permissions for removing intermediate stage results:
- `s3:GetObject` to retrieve the files. Note that `GetObject` also requires read permission on the object that gets retrieved.
- `s3:PutObject` to upload files.
- `s3:AbortMultipartUpload` to cancel the upload of files
- `s3:DeleteObject` to delete files when they're no longer needed.

- `s3:DeleteObject`
### Azure

The MSQ task engine needs the following permissions for pushing, fetching, and removing intermediate stage results to and from Azure:

- `Microsoft.Storage/storageAccounts/blobServices/containers/blobs/read` to read and list the files in durable storage
- `Microsoft.Storage/storageAccounts/blobServices/containers/blobs/write` to write the files in durable storage.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: "the files" is used in these two sentences, while the rest omit the use of the article. Similarly in the block above

- `Microsoft.Storage/storageAccounts/blobServices/containers/blobs/add/action` to create files in durable storage.
- `Microsoft.Storage/storageAccounts/blobServices/containers/blobs/delete` to delete files when they're no longer needed.
17 changes: 12 additions & 5 deletions docs/operations/durable-storage.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,13 +39,20 @@ To enable durable storage, you need to set the following common service properti

```
druid.msq.intermediate.storage.enable=true
druid.msq.intermediate.storage.type=s3
druid.msq.intermediate.storage.bucket=YOUR_BUCKET
druid.msq.intermediate.storage.prefix=YOUR_PREFIX
druid.msq.intermediate.storage.tempDir=/path/to/your/temp/dir
# Include these configs if you're using S3
# druid.msq.intermediate.storage.type=s3
# druid.msq.intermediate.storage.bucket=YOUR_BUCKET
# Include these configs if you're using Azure Blob Storage
# druid.msq.intermediate.storage.type=azure
# druid.sq.intermediate.storage.container=YOUR_CONTAINER
druid.msq.intermediate.storage.prefix=YOUR_PREFIX
```

For detailed information about the settings related to durable storage, see [Durable storage configurations](../multi-stage-query/reference.md#durable-storage-configurations).
For detailed information about these and additional settings related to durable storage, see [Durable storage configurations](../multi-stage-query/reference.md#durable-storage-configurations).


## Use durable storage for SQL-based ingestion queries
Expand Down Expand Up @@ -80,7 +87,7 @@ cleaner can be scheduled to clean the directories corresponding to which there i
the storage connector to work upon the durable storage. The durable storage location should only be utilized to store the output
for the cluster's MSQ tasks. If the location contains other files or directories, then they will get cleaned up as well.

Use `druid.msq.intermediate.storage.cleaner.enabled` and `druid.msq.intermediate.storage.cleaner.delaySEconds` to configure the cleaner. For more information, see [Durable storage configurations](../multi-stage-query/reference.md#durable-storage-configurations).
Use `druid.msq.intermediate.storage.cleaner.enabled` and `druid.msq.intermediate.storage.cleaner.delaySeconds` to configure the cleaner. For more information, see [Durable storage configurations](../multi-stage-query/reference.md#durable-storage-configurations).

Note that if you choose to write query results to durable storage,the results are cleaned up when the task is removed from the metadata store.