-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: add concurent compaction docs #15218
Conversation
|
||
##### Update the compaction settings with the UI | ||
|
||
In the **Compaction config** for a datasource, set **Allow concurrent compaction append tasks** to **True**. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
Next, you need to configure the task lock type for your ingestion job. For streaming jobs, the context parameter goes in your supervisor spec. For legacy JSON-based batch ingestion, the context parameter goes in your ingestion spec. You can provide the context parameter through the API like any other parameter for a streaming ingestion or JSON-based batch ingestion or UI. | ||
|
||
##### Add the task lock type through the API |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we please explicitly add that a streaming supervisor spec must always have an APPEND lock when using concurrent append and replace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added below in the append section that talks about lock type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes, @317brian ! I have left some suggestions.
Given that this is an experimental feature, we can proceed with the docs that we have so far.
As the feature evolves and is hardened, we will enrich these docs to include some more technical details.
docs/data-management/compaction.md
Outdated
@@ -43,18 +44,20 @@ By default, compaction does not modify the underlying data of the segments. Howe | |||
|
|||
Compaction does not improve performance in all situations. For example, if you rewrite your data with each ingestion task, you don't need to use compaction. See [Segment optimization](../operations/segment-optimization.md) for additional guidance to determine if compaction will help in your environment. | |||
|
|||
## Types of compaction | |||
## Choose your compaction type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this heading aligns with the rest of headings.
Also, the type of compaction is not really much of a choice as say how partioning type is a choice (range or hashed or dynamic, where we are choosing three different paths that give you 3 different results).
We should just call this Ways to run compaction
or something in a similar vein.
|
||
If you enable automatic compaction, you can use concurrent append and replace to concurrently compact data as you ingest it for streaming and legacy JSON-based batch ingestion. | ||
|
||
Setting up concurrent append and replace is a two-step process. The first is to update your datasource and the second is to update your ingestion job. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not exactly correct. It doesn't make a lot of sense to "update a datasource" unless you mean adding data to a datasource.
Moreover, we shouldn't even look at this as a two step process, rather as an opt-in behaviour. Any ingestion job that wants to run concurrently with other ingestion jobs needs to use the correct lock types.
Please see the other suggestion.
Concurrent append and replace is an [experimental feature](../development/experimental.md) and is not currently available for SQL-based ingestion. | ||
::: | ||
|
||
If you enable automatic compaction, you can use concurrent append and replace to concurrently compact data as you ingest it for streaming and legacy JSON-based batch ingestion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you enable automatic compaction, you can use concurrent append and replace to concurrently compact data as you ingest it for streaming and legacy JSON-based batch ingestion. | |
This feature allows you to safely replace the existing data in an interval of a datasource while new data is being appended to that interval. One of the most common applications of this is appending new data (using say streaming ingestion) to an interval while compaction of that interval is already in progress. |
Setting up concurrent append and replace is a two-step process. The first is to update your datasource and the second is to update your ingestion job. | ||
|
||
Using concurrent append and replace in the following scenarios can be beneficial: | ||
|
||
- If the job with an `APPEND` task and the job with a `REPLACE` task have the same segment granularity. For example, when a datasource and its streaming ingestion job have the same granularity. | ||
- If the job with an `APPEND` task has a finer segment granularity than the replacing job. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting up concurrent append and replace is a two-step process. The first is to update your datasource and the second is to update your ingestion job. | |
Using concurrent append and replace in the following scenarios can be beneficial: | |
- If the job with an `APPEND` task and the job with a `REPLACE` task have the same segment granularity. For example, when a datasource and its streaming ingestion job have the same granularity. | |
- If the job with an `APPEND` task has a finer segment granularity than the replacing job. | |
You can enable concurrent append and replace by ensuring the following: | |
- The append task (with `appendToExisting` set to `true`) has `taskLockType` set to `APPEND` in the task context. | |
- The replace task (with `appendToExisting` set to `false`) has `taskLockType` set to `REPLACE` in the task context. | |
- The segment granularity of the append task is equal to or finer than the segment granularity of the replace task. |
- If the job with an `APPEND` task and the job with a `REPLACE` task have the same segment granularity. For example, when a datasource and its streaming ingestion job have the same granularity. | ||
- If the job with an `APPEND` task has a finer segment granularity than the replacing job. | ||
|
||
We do not recommend using concurrent append and replace when the job with an `APPEND` task has a coarser granularity than the job with a `REPLACE` task. For example, if the `APPEND` job has a yearly granularity and the `REPLACE` job has a monthly granularity. The job that finishes second will fail. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This point should be in a note or warning block.
Two more points to call out are that:
At any point in time
- There can only be a single task that holds a `REPLACE` lock on a given interval of a datasource.
- There may be multiple tasks that hold `APPEND` locks on a given interval of a datasource and append data to that interval simultaneously.
- The replace task (with `appendToExisting` set to `false`) has `taskLockType` set to `REPLACE` in the task context. | ||
- The segment granularity of the append task is equal to or finer than the segment granularity of the replace task. | ||
|
||
:::info |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kfaraz and @AmatyaAvadhanula I think I got all your comments. I manually made the changes for the comment suggestions since I already had the page open |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor change, otherwise LGTM.
Co-authored-by: Kashif Faraz <[email protected]>
Co-authored-by: Kashif Faraz <[email protected]> (cherry picked from commit 7379477)
Co-authored-by: Kashif Faraz <[email protected]> (cherry picked from commit 7379477)
Co-authored-by: Kashif Faraz <[email protected]>
Description
General compaction info as the parent w/ manual compaction and auto compaction as its children:
Build preview: https://druid-git-concurrent-compaction-317brian.vercel.app/docs/latest/data-management/compaction
This PR has: