Data management API doc refactor #15087

writer-jill · 2023-10-04T13:04:41Z

Refactor the data management API documentation.

This PR has:

been self-reviewed.

docs/api-reference/data-management-api.md

Co-authored-by: Victoria Lim <[email protected]>

docs/api-reference/data-management-api.md

Co-authored-by: Victoria Lim <[email protected]>

vtlim

LGTM

* fix time shifting

…ationWithDefaults` (apache#15317) * + Fix for Flaky Test * + Replacing TreeMap with LinkedHashMap * + Changing data structure from LinkedHashMap to HashMap * Fixed flaky test in S3DataSegmentPusherConfigTest.testSerializationValidatingMaxListingLength * Minor Changes

…` query. (apache#15243) * MSQ generates tombstones honoring the query's granularity. This change tweaks to only account for the infinite-interval tombstones. For finite-interval tombstones, the MSQ query granualrity will be used which is consistent with how MSQ works. * more tests and some cleanup. * checkstyle * comment edits * Throw TooManyBuckets fault based on review; add more tests. * Add javadocs for both methods on reconciling the methods. * review: Move testReplaceTombstonesWithTooManyBucketsThrowsException to MsqFaultsTest * remove unused imports. * Move TooManyBucketsException to indexing package for shared exception handling. * lower max bucket for tests and fixup count * Advance and count the iterator. * checkstyle

Saw bug where MSQ controller task would continue to hold the task slot even after cancel was issued. This was due to a deadlock created on work launch. The main thread was waiting for tasks to spawn and the cancel thread was waiting for tasks to finish. The fix was to instruct the MSQWorkerTaskLauncher thread to stop creating new tasks which would enable the main thread to unblock and release the slot. Also short circuited the taskRetriable condition. Now the check is run in the MSQWorkerTaskLauncher thread as opposed to the main event thread loop. This will result in faster task failure in case the task is deemed to be non retriable.

* Document segment metadata cache behaviour * Fix typo * Minor update * Minor change

…n` by changing string to key:value pair (apache#15207) * Fix capacity response in mm-less ingestion (apache#14888) Changes: - Fix capacity response in mm-less ingestion. - Add field usedClusterCapacity to the GET /totalWorkerCapacity response. This API should be used to get the total ingestion capacity on the overlord. - Remove method `isK8sTaskRunner` from interface `TaskRunner` * Using Map to perform comparison * Minor Change --------- Co-authored-by: George Shiqi Wu <[email protected]>

There is a problem with Quantiles sketches and KLL Quantiles sketches. Queries using the histogram post-aggregator fail if: - the sketch contains at least one value, and - the values in the sketch are all equal, and - the splitPoints argument is not passed to the post-aggregator, and - the numBins argument is greater than 2 (or not specified, which leads to the default of 10 being used) In that case, the query fails and returns this error: { "error": "Unknown exception", "errorClass": "org.apache.datasketches.common.SketchesArgumentException", "host": null, "errorCode": "legacyQueryException", "persona": "OPERATOR", "category": "RUNTIME_FAILURE", "errorMessage": "Values must be unique, monotonically increasing and not NaN.", "context": { "host": null, "errorClass": "org.apache.datasketches.common.SketchesArgumentException", "legacyErrorCode": "Unknown exception" } } This behaviour is undesirable, since the caller doesn't necessarily know in advance whether the sketch has values that are diverse enough. With this change, the post-aggregators return [N, 0, 0...] instead of crashing, where N is the number of values in the sketch, and the length of the list is equal to numBins. That is what they already returned for numBins = 2. Here is an example of a query that would fail: {"queryType":"timeseries", "dataSource": { "type": "inline", "columnNames": ["foo", "bar"], "rows": [ ["abc", 42.0], ["def", 42.0] ] }, "intervals":["0000/3000"], "granularity":"all", "aggregations":[ {"name":"the_sketch", "fieldName":"bar", "type":"quantilesDoublesSketch"}], "postAggregations":[ {"name":"the_histogram", "type":"quantilesDoublesSketchToHistogram", "field":{"type":"fieldAccess","fieldName":"the_sketch"}, "numBins": 3}]} I believe this also fixes issue apache#10585.

Fixing outdated query from deep storage docs.

…pache#14995) * Prevent a race that may cause multiple attempts to publish segments for the same sequence

Co-authored-by: 317brian <[email protected]>

The TaskQueue maintains a map of active task ids to tasks, which can be utilized to get active task payloads, before falling back to the metadata store.

Fixed the following flaky tests: org.apache.druid.math.expr.ParserTest#testApplyFunctions org.apache.druid.math.expr.ParserTest#testSimpleMultiplicativeOp1 org.apache.druid.math.expr.ParserTest#testFunctions org.apache.druid.math.expr.ParserTest#testSimpleLogicalOps1 org.apache.druid.math.expr.ParserTest#testSimpleAdditivityOp1 org.apache.druid.math.expr.ParserTest#testSimpleAdditivityOp2 The above mentioned tests have been reported as flaky (tests assuming deterministic implementation of a non-deterministic specification ) when ran against the NonDex tool. The tests contain assertions (Assertion 1 & Assertion 2) that compare an ArrayList created from a HashSet using the ArrayList() constructor with another List. However, HashSet does not guarantee the ordering of elements and thus resulting in these flaky tests that assume deterministic implementation of HashSet. Thus, when the NonDex tool shuffles the HashSet elements, it results in the test failures: Co-authored-by: ythorat2 <[email protected]>

This patch introduces a param snapshotTime in the iceberg inputsource spec that allows the user to ingest data files associated with the most recent snapshot as of the given time. This helps the user ingest data based on older snapshots by specifying the associated snapshot time. This patch also upgrades the iceberg core version to 1.4.1

…d json_query (apache#15320) * support dynamic expressions for path arguments for json_value and json_query

* reset spec before looking for tile * improve logging * log screenshots * get and log jpeg * other test tidy up

* Make numCorePartitions as 0 in the TombstoneShardSpec. * fix up test * Add tombstone core partition tests * review comment * Need to register the test shard type to make jackson happy

…g-segment retry bug. (apache#15260) * Fix NPE caused by realtime segment closing race, fix possible missing-segment retry bug. Fixes apache#12168, by returning empty from FireHydrant when the segment is swapped to null. This causes the SinkQuerySegmentWalker to use ReportTimelineMissingSegmentQueryRunner, which causes the Broker to look for the segment somewhere else. In addition, this patch changes SinkQuerySegmentWalker to acquire references to all hydrants (subsegments of a sink) at once, and return a ReportTimelineMissingSegmentQueryRunner if *any* of them could not be acquired. I suspect, although have not confirmed, that the prior behavior could lead to segments being reported as missing even though results from some hydrants were still included. * Some more test coverage.

Merge branch 'master' into dm-doc-refactor

Co-authored-by: Victoria Lim <[email protected]> Co-authored-by: George Shiqi Wu <[email protected]> Co-authored-by: 317brian <[email protected]> Co-authored-by: ythorat2 <[email protected]> Co-authored-by: Krishna Anandan <[email protected]> Co-authored-by: Vadim Ogievetsky <[email protected]> Co-authored-by: Abhishek Radhakrishnan <[email protected]> Co-authored-by: Karan Kumar <[email protected]> Co-authored-by: Rishabh Singh <[email protected]> Co-authored-by: Magnus Henoch <[email protected]> Co-authored-by: AmatyaAvadhanula <[email protected]> Co-authored-by: Charles Smith <[email protected]> Co-authored-by: Yashdeep Thorat <[email protected]> Co-authored-by: Atul Mohan <[email protected]> Co-authored-by: Clint Wylie <[email protected]> Co-authored-by: Gian Merlino <[email protected]>

Data management API doc refactor

6fc7b36

github-actions bot added the Area - Documentation label Oct 4, 2023

vtlim mentioned this pull request Oct 4, 2023

Data management API documentation refactor #14832

Closed

1 task

vtlim requested changes Oct 5, 2023

View reviewed changes

writer-jill and others added 26 commits October 5, 2023 12:50

Update docs/api-reference/data-management-api.md

fa21862

Co-authored-by: Victoria Lim <[email protected]>

Update docs/api-reference/data-management-api.md

79772a4

Co-authored-by: Victoria Lim <[email protected]>

Update docs/api-reference/data-management-api.md

6e5ab64

Co-authored-by: Victoria Lim <[email protected]>

Update docs/api-reference/data-management-api.md

800a8d1

Co-authored-by: Victoria Lim <[email protected]>

Update docs/api-reference/data-management-api.md

6233515

Co-authored-by: Victoria Lim <[email protected]>

Update docs/api-reference/data-management-api.md

b54d1c1

Co-authored-by: Victoria Lim <[email protected]>

Update docs/api-reference/data-management-api.md

b8ed630

Co-authored-by: Victoria Lim <[email protected]>

Update docs/api-reference/data-management-api.md

2af22bc

Co-authored-by: Victoria Lim <[email protected]>

Update docs/api-reference/data-management-api.md

4933abe

Co-authored-by: Victoria Lim <[email protected]>

Update docs/api-reference/data-management-api.md

37da869

Co-authored-by: Victoria Lim <[email protected]>

Update docs/api-reference/data-management-api.md

a0ac3cb

Co-authored-by: Victoria Lim <[email protected]>

Update docs/api-reference/data-management-api.md

eef0b33

Co-authored-by: Victoria Lim <[email protected]>

Update docs/api-reference/data-management-api.md

da9d9ca

Co-authored-by: Victoria Lim <[email protected]>

Update docs/api-reference/data-management-api.md

760eb48

Co-authored-by: Victoria Lim <[email protected]>

Update docs/api-reference/data-management-api.md

c59811c

Co-authored-by: Victoria Lim <[email protected]>

Update docs/api-reference/data-management-api.md

a2cb26b

Co-authored-by: Victoria Lim <[email protected]>

Update docs/api-reference/data-management-api.md

cc662ad

Co-authored-by: Victoria Lim <[email protected]>

Update docs/api-reference/data-management-api.md

ebe3785

Co-authored-by: Victoria Lim <[email protected]>

Update docs/api-reference/data-management-api.md

9931797

Co-authored-by: Victoria Lim <[email protected]>

Update docs/api-reference/data-management-api.md

e3dca37

Co-authored-by: Victoria Lim <[email protected]>

Update docs/api-reference/data-management-api.md

be837de

Co-authored-by: Victoria Lim <[email protected]>

Update docs/api-reference/data-management-api.md

70c8d19

Co-authored-by: Victoria Lim <[email protected]>

Updated

524075c

Updated

1a47328

Updated

4ca9b7c

Updated

0865dff

vtlim reviewed Nov 17, 2023

View reviewed changes

docs/api-reference/data-management-api.md Show resolved Hide resolved

writer-jill and others added 3 commits November 20, 2023 12:51

Update docs/api-reference/data-management-api.md

ec6c380

Co-authored-by: Victoria Lim <[email protected]>

Update docs/api-reference/data-management-api.md

9273f2d

Co-authored-by: Victoria Lim <[email protected]>

Update docs/api-reference/data-management-api.md

acf1bcc

Co-authored-by: Victoria Lim <[email protected]>

vtlim approved these changes Nov 20, 2023

View reviewed changes

krishnanand5 and others added 22 commits November 20, 2023 19:34

+ Switching Comparison from String to JSON (apache#15364)

9b5f780

Web console: fix time shifting (apache#15359)

d748a9a

* fix time shifting

Fixing 1 flaky test in testAPIs() (apache#15375)

7d3b324

Fixed 2 Flaky Tests (apache#15376)

fce5638

Document Nuances in SegmentMetadataCache Behaviour (apache#15367)

536c4a2

* Document segment metadata cache behaviour * Fix typo * Minor update * Minor change

Query from deep storage doc fixes. (apache#15382)

68e7aaf

Fixing outdated query from deep storage docs.

Prevent multiple attempts to publish segments for the same sequence (a…

af8b5c6

…pache#14995) * Prevent a race that may cause multiple attempts to publish segments for the same sequence

fix redirect for api docs and misc array-related typos (apache#15387)

d5b7c26

Co-authored-by: 317brian <[email protected]>

Fetch active task payloads from memory (apache#15377)

ac24cea

The TaskQueue maintains a map of active task ids to tasks, which can be utilized to get active task payloads, before falling back to the metadata store.

support non-constant expressions for path arguments for json_value an…

5ea04f7

…d json_query (apache#15320) * support dynamic expressions for path arguments for json_value and json_query

Web console: reset spec before looking for tile (apache#15396)

acd3881

* reset spec before looking for tile * improve logging * log screenshots * get and log jpeg * other test tidy up

Make numCorePartitions as 0 for tombstones (apache#15379)

e3c0f2f

* Make numCorePartitions as 0 in the TombstoneShardSpec. * fix up test * Add tombstone core partition tests * review comment * Need to register the test shard type to make jackson happy

Updating

41482e3

Merge branch 'master' into dm-doc-refactor

Update .spelling

8cab8ea

vtlim merged commit 6ed343c into apache:master Nov 20, 2023
11 checks passed

LakshSingla added this to the 29.0.0 milestone Jan 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data management API doc refactor #15087

Data management API doc refactor #15087

writer-jill commented Oct 4, 2023 •

edited

Loading

vtlim left a comment

Data management API doc refactor #15087

Data management API doc refactor #15087

Conversation

writer-jill commented Oct 4, 2023 • edited Loading

vtlim left a comment

Choose a reason for hiding this comment

writer-jill commented Oct 4, 2023 •

edited

Loading