docs: add concurent compaction docs #15218

317brian · 2023-10-19T23:28:30Z

Description

Adds the docs for concurrent compaction
Refactors the page to be more structured:

General compaction info as the parent w/ manual compaction and auto compaction as its children:

Build preview: https://druid-git-concurrent-compaction-317brian.vercel.app/docs/latest/data-management/compaction

This PR has:

been self-reviewed.

…nual compaction setup

docs/data-management/automatic-compaction.md

AmatyaAvadhanula · 2023-10-21T13:11:58Z

docs/data-management/automatic-compaction.md

+
+##### Update the compaction settings with the UI
+
+In the **Compaction config** for a datasource, set  **Allow concurrent compaction append tasks** to **True**.


The UI for compaction says "Allow concurrent compactions (experimental)"

docs/data-management/automatic-compaction.md

AmatyaAvadhanula · 2023-10-25T03:53:59Z

docs/data-management/automatic-compaction.md

+
+Next, you need to configure the task lock type for your ingestion job. For streaming jobs, the context parameter goes in your supervisor spec. For legacy JSON-based batch ingestion, the context parameter goes in your ingestion spec. You can provide the context parameter through the API like any other parameter for a streaming ingestion or JSON-based batch ingestion or UI.
+
+##### Add the task lock type through the API


Could we please explicitly add that a streaming supervisor spec must always have an APPEND lock when using concurrent append and replace?

Added below in the append section that talks about lock type

kfaraz

Thanks for the changes, @317brian ! I have left some suggestions.

Given that this is an experimental feature, we can proceed with the docs that we have so far.

As the feature evolves and is hardened, we will enrich these docs to include some more technical details.

docs/ingestion/ingestion-spec.md

docs/development/extensions-core/kafka-supervisor-reference.md

kfaraz · 2023-10-25T08:19:03Z

docs/data-management/compaction.md

@@ -43,18 +44,20 @@ By default, compaction does not modify the underlying data of the segments. Howe

 Compaction does not improve performance in all situations. For example, if you rewrite your data with each ingestion task, you don't need to use compaction. See [Segment optimization](../operations/segment-optimization.md) for additional guidance to determine if compaction will help in your environment.

-## Types of compaction
+## Choose your compaction type


I don't think this heading aligns with the rest of headings.

Also, the type of compaction is not really much of a choice as say how partioning type is a choice (range or hashed or dynamic, where we are choosing three different paths that give you 3 different results).

We should just call this Ways to run compaction or something in a similar vein.

kfaraz · 2023-10-25T08:23:20Z

docs/data-management/automatic-compaction.md

+
+If you enable automatic compaction, you can use concurrent append and replace to concurrently compact data as you ingest it for streaming and legacy JSON-based batch ingestion. 
+
+Setting up concurrent append and replace is a two-step process. The first is to update your datasource and the second is to update your ingestion job.


This is not exactly correct. It doesn't make a lot of sense to "update a datasource" unless you mean adding data to a datasource.

Moreover, we shouldn't even look at this as a two step process, rather as an opt-in behaviour. Any ingestion job that wants to run concurrently with other ingestion jobs needs to use the correct lock types.

Please see the other suggestion.

kfaraz · 2023-10-25T09:08:58Z

docs/data-management/automatic-compaction.md

+Concurrent append and replace is an [experimental feature](../development/experimental.md) and is not currently available for SQL-based ingestion.
+:::
+
+If you enable automatic compaction, you can use concurrent append and replace to concurrently compact data as you ingest it for streaming and legacy JSON-based batch ingestion. 


Suggested change

If you enable automatic compaction, you can use concurrent append and replace to concurrently compact data as you ingest it for streaming and legacy JSON-based batch ingestion.

This feature allows you to safely replace the existing data in an interval of a datasource while new data is being appended to that interval. One of the most common applications of this is appending new data (using say streaming ingestion) to an interval while compaction of that interval is already in progress.

kfaraz · 2023-10-25T09:47:36Z

docs/data-management/automatic-compaction.md

+Setting up concurrent append and replace is a two-step process. The first is to update your datasource and the second is to update your ingestion job.
+
+Using concurrent append and replace in the following scenarios can be beneficial:
+
+- If the job with an `APPEND` task and the job with a `REPLACE` task have the same segment granularity. For example, when a datasource and its streaming ingestion job have the same granularity.
+- If the job with an `APPEND` task  has a finer segment granularity than the replacing job.


Suggested change

Setting up concurrent append and replace is a two-step process. The first is to update your datasource and the second is to update your ingestion job.

Using concurrent append and replace in the following scenarios can be beneficial:

- If the job with an `APPEND` task and the job with a `REPLACE` task have the same segment granularity. For example, when a datasource and its streaming ingestion job have the same granularity.

- If the job with an `APPEND` task has a finer segment granularity than the replacing job.

You can enable concurrent append and replace by ensuring the following:

- The append task (with `appendToExisting` set to `true`) has `taskLockType` set to `APPEND` in the task context.

- The replace task (with `appendToExisting` set to `false`) has `taskLockType` set to `REPLACE` in the task context.

- The segment granularity of the append task is equal to or finer than the segment granularity of the replace task.

kfaraz · 2023-10-25T09:50:44Z

docs/data-management/automatic-compaction.md

+- If the job with an `APPEND` task and the job with a `REPLACE` task have the same segment granularity. For example, when a datasource and its streaming ingestion job have the same granularity.
+- If the job with an `APPEND` task  has a finer segment granularity than the replacing job.
+
+We do not recommend using concurrent append and replace when the job with an `APPEND` task has a coarser granularity than the job with a `REPLACE` task. For example, if the `APPEND` job has a yearly granularity and the `REPLACE` job has a monthly granularity. The job that finishes second will fail.


This point should be in a note or warning block.

Two more points to call out are that:

At any point in time - There can only be a single task that holds a `REPLACE` lock on a given interval of a datasource. - There may be multiple tasks that hold `APPEND` locks on a given interval of a datasource and append data to that interval simultaneously.

317brian · 2023-10-25T18:25:31Z

docs/data-management/automatic-compaction.md

+- The replace task (with `appendToExisting` set to `false`) has `taskLockType` set to `REPLACE` in the task context.
+- The segment granularity of the append task is equal to or finer than the segment granularity of the replace task.
+
+:::info


317brian · 2023-10-25T18:27:05Z

@kfaraz and @AmatyaAvadhanula I think I got all your comments. I manually made the changes for the comment suggestions since I already had the page open

kfaraz

Minor change, otherwise LGTM.

docs/data-management/automatic-compaction.md

Co-authored-by: Kashif Faraz <[email protected]>

Co-authored-by: Kashif Faraz <[email protected]> (cherry picked from commit 7379477)

Co-authored-by: Kashif Faraz <[email protected]>

317brian and others added 4 commits October 3, 2023 16:08

split up compaction page into general info and a separate page for ma…

2e43bc9

…nual compaction setup

add concurrent compaction info

09ce393

Merge branch 'master' into concurrent-compaction

95dce16

Merge branch 'master' into concurrent-compaction

8168f7e

github-actions bot added the Area - Documentation label Oct 19, 2023

317brian added 3 commits October 20, 2023 09:33

delete empty heading

f4cf2fc

clean up manual compact page

fabee9e

add apache license

387fd2b

317brian commented Oct 20, 2023

View reviewed changes

docs/data-management/automatic-compaction.md Outdated Show resolved Hide resolved

317brian commented Oct 20, 2023

View reviewed changes

docs/data-management/automatic-compaction.md Outdated Show resolved Hide resolved

fix typo

3caf1ae

317brian added this to the 28.0 milestone Oct 20, 2023

AmatyaAvadhanula reviewed Oct 21, 2023

View reviewed changes

317brian commented Oct 23, 2023

View reviewed changes

docs/data-management/automatic-compaction.md Outdated Show resolved Hide resolved

317brian and others added 4 commits October 23, 2023 14:30

change UI text

3bf8ab2

Merge branch 'master' into concurrent-compaction

a217887

Merge branch 'master' into concurrent-compaction

b803299

change naming to concurrent append and replace

293d12b

317brian marked this pull request as ready for review October 25, 2023 03:06

words

a60d3fe

AmatyaAvadhanula reviewed Oct 25, 2023

View reviewed changes

kfaraz reviewed Oct 25, 2023

View reviewed changes

317brian added 2 commits October 25, 2023 11:08

remove context section

73a1255

address comments from reviewers

94483cc

317brian commented Oct 25, 2023

View reviewed changes

317brian requested review from kfaraz and AmatyaAvadhanula October 25, 2023 18:26

kfaraz approved these changes Oct 26, 2023

View reviewed changes

docs/data-management/automatic-compaction.md Outdated Show resolved Hide resolved

Update docs/data-management/automatic-compaction.md

14c8c95

Co-authored-by: Kashif Faraz <[email protected]>

AmatyaAvadhanula approved these changes Oct 27, 2023

View reviewed changes

317brian merged commit 7379477 into apache:master Oct 27, 2023
11 checks passed

317brian added a commit to 317brian/druid that referenced this pull request Oct 27, 2023

docs: add concurent compaction docs (apache#15218)

b0aa81e

Co-authored-by: Kashif Faraz <[email protected]> (cherry picked from commit 7379477)

abhishekagarwal87 pushed a commit that referenced this pull request Oct 30, 2023

docs: add concurent compaction docs (#15218) (#15271)

da90b59

Co-authored-by: Kashif Faraz <[email protected]> (cherry picked from commit 7379477)

CaseyPan pushed a commit to CaseyPan/druid that referenced this pull request Nov 17, 2023

docs: add concurent compaction docs (apache#15218)

25456e1

Co-authored-by: Kashif Faraz <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add concurent compaction docs #15218

docs: add concurent compaction docs #15218

317brian commented Oct 19, 2023

AmatyaAvadhanula Oct 21, 2023

AmatyaAvadhanula Oct 25, 2023 •

edited

Loading

317brian Oct 25, 2023

kfaraz left a comment

kfaraz Oct 25, 2023

kfaraz Oct 25, 2023

kfaraz Oct 25, 2023

kfaraz Oct 25, 2023

kfaraz Oct 25, 2023

317brian Oct 25, 2023

317brian commented Oct 25, 2023

kfaraz left a comment


		##### Update the compaction settings with the UI

		In the Compaction config for a datasource, set Allow concurrent compaction append tasks to True.


		Next, you need to configure the task lock type for your ingestion job. For streaming jobs, the context parameter goes in your supervisor spec. For legacy JSON-based batch ingestion, the context parameter goes in your ingestion spec. You can provide the context parameter through the API like any other parameter for a streaming ingestion or JSON-based batch ingestion or UI.

		##### Add the task lock type through the API


		If you enable automatic compaction, you can use concurrent append and replace to concurrently compact data as you ingest it for streaming and legacy JSON-based batch ingestion.

		Setting up concurrent append and replace is a two-step process. The first is to update your datasource and the second is to update your ingestion job.

	If you enable automatic compaction, you can use concurrent append and replace to concurrently compact data as you ingest it for streaming and legacy JSON-based batch ingestion.
	This feature allows you to safely replace the existing data in an interval of a datasource while new data is being appended to that interval. One of the most common applications of this is appending new data (using say streaming ingestion) to an interval while compaction of that interval is already in progress.

docs: add concurent compaction docs #15218

docs: add concurent compaction docs #15218

Conversation

317brian commented Oct 19, 2023

Description

Choose a reason for hiding this comment

AmatyaAvadhanula Oct 25, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kfaraz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

317brian commented Oct 25, 2023

kfaraz left a comment

Choose a reason for hiding this comment

AmatyaAvadhanula Oct 25, 2023 •

edited

Loading