Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add concurent compaction docs #15218

Merged
merged 16 commits into from
Oct 27, 2023

Conversation

317brian
Copy link
Contributor

Description

  • Adds the docs for concurrent compaction
  • Refactors the page to be more structured:

General compaction info as the parent w/ manual compaction and auto compaction as its children:
image

Build preview: https://druid-git-concurrent-compaction-317brian.vercel.app/docs/latest/data-management/compaction


This PR has:

  • been self-reviewed.

@317brian 317brian added this to the 28.0 milestone Oct 20, 2023

##### Update the compaction settings with the UI

In the **Compaction config** for a datasource, set **Allow concurrent compaction append tasks** to **True**.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The UI for compaction says "Allow concurrent compactions (experimental)"
image

@317brian 317brian marked this pull request as ready for review October 25, 2023 03:06

Next, you need to configure the task lock type for your ingestion job. For streaming jobs, the context parameter goes in your supervisor spec. For legacy JSON-based batch ingestion, the context parameter goes in your ingestion spec. You can provide the context parameter through the API like any other parameter for a streaming ingestion or JSON-based batch ingestion or UI.

##### Add the task lock type through the API
Copy link
Contributor

@AmatyaAvadhanula AmatyaAvadhanula Oct 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we please explicitly add that a streaming supervisor spec must always have an APPEND lock when using concurrent append and replace?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added below in the append section that talks about lock type

Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes, @317brian ! I have left some suggestions.

Given that this is an experimental feature, we can proceed with the docs that we have so far.

As the feature evolves and is hardened, we will enrich these docs to include some more technical details.

docs/ingestion/ingestion-spec.md Outdated Show resolved Hide resolved
@@ -43,18 +44,20 @@ By default, compaction does not modify the underlying data of the segments. Howe

Compaction does not improve performance in all situations. For example, if you rewrite your data with each ingestion task, you don't need to use compaction. See [Segment optimization](../operations/segment-optimization.md) for additional guidance to determine if compaction will help in your environment.

## Types of compaction
## Choose your compaction type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this heading aligns with the rest of headings.

Also, the type of compaction is not really much of a choice as say how partioning type is a choice (range or hashed or dynamic, where we are choosing three different paths that give you 3 different results).

We should just call this Ways to run compaction or something in a similar vein.


If you enable automatic compaction, you can use concurrent append and replace to concurrently compact data as you ingest it for streaming and legacy JSON-based batch ingestion.

Setting up concurrent append and replace is a two-step process. The first is to update your datasource and the second is to update your ingestion job.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not exactly correct. It doesn't make a lot of sense to "update a datasource" unless you mean adding data to a datasource.

Moreover, we shouldn't even look at this as a two step process, rather as an opt-in behaviour. Any ingestion job that wants to run concurrently with other ingestion jobs needs to use the correct lock types.

Please see the other suggestion.

Concurrent append and replace is an [experimental feature](../development/experimental.md) and is not currently available for SQL-based ingestion.
:::

If you enable automatic compaction, you can use concurrent append and replace to concurrently compact data as you ingest it for streaming and legacy JSON-based batch ingestion.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If you enable automatic compaction, you can use concurrent append and replace to concurrently compact data as you ingest it for streaming and legacy JSON-based batch ingestion.
This feature allows you to safely replace the existing data in an interval of a datasource while new data is being appended to that interval. One of the most common applications of this is appending new data (using say streaming ingestion) to an interval while compaction of that interval is already in progress.

Comment on lines 214 to 219
Setting up concurrent append and replace is a two-step process. The first is to update your datasource and the second is to update your ingestion job.

Using concurrent append and replace in the following scenarios can be beneficial:

- If the job with an `APPEND` task and the job with a `REPLACE` task have the same segment granularity. For example, when a datasource and its streaming ingestion job have the same granularity.
- If the job with an `APPEND` task has a finer segment granularity than the replacing job.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Setting up concurrent append and replace is a two-step process. The first is to update your datasource and the second is to update your ingestion job.
Using concurrent append and replace in the following scenarios can be beneficial:
- If the job with an `APPEND` task and the job with a `REPLACE` task have the same segment granularity. For example, when a datasource and its streaming ingestion job have the same granularity.
- If the job with an `APPEND` task has a finer segment granularity than the replacing job.
You can enable concurrent append and replace by ensuring the following:
- The append task (with `appendToExisting` set to `true`) has `taskLockType` set to `APPEND` in the task context.
- The replace task (with `appendToExisting` set to `false`) has `taskLockType` set to `REPLACE` in the task context.
- The segment granularity of the append task is equal to or finer than the segment granularity of the replace task.

- If the job with an `APPEND` task and the job with a `REPLACE` task have the same segment granularity. For example, when a datasource and its streaming ingestion job have the same granularity.
- If the job with an `APPEND` task has a finer segment granularity than the replacing job.

We do not recommend using concurrent append and replace when the job with an `APPEND` task has a coarser granularity than the job with a `REPLACE` task. For example, if the `APPEND` job has a yearly granularity and the `REPLACE` job has a monthly granularity. The job that finishes second will fail.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This point should be in a note or warning block.

Two more points to call out are that:

At any point in time
- There can only be a single task that holds a `REPLACE` lock on a given interval of a datasource.
- There may be multiple tasks that hold `APPEND` locks on a given interval of a datasource and append data to that interval simultaneously.

- The replace task (with `appendToExisting` set to `false`) has `taskLockType` set to `REPLACE` in the task context.
- The segment granularity of the append task is equal to or finer than the segment granularity of the replace task.

:::info
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renders as

image

@317brian
Copy link
Contributor Author

@kfaraz and @AmatyaAvadhanula I think I got all your comments. I manually made the changes for the comment suggestions since I already had the page open

Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor change, otherwise LGTM.

docs/data-management/automatic-compaction.md Outdated Show resolved Hide resolved
@317brian 317brian merged commit 7379477 into apache:master Oct 27, 2023
11 checks passed
317brian added a commit to 317brian/druid that referenced this pull request Oct 27, 2023
Co-authored-by: Kashif Faraz <[email protected]>
(cherry picked from commit 7379477)
abhishekagarwal87 pushed a commit that referenced this pull request Oct 30, 2023
Co-authored-by: Kashif Faraz <[email protected]>
(cherry picked from commit 7379477)
CaseyPan pushed a commit to CaseyPan/druid that referenced this pull request Nov 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants