Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data management API doc refactor #15087

Merged
merged 57 commits into from
Nov 20, 2023
Merged

Conversation

writer-jill
Copy link
Contributor

@writer-jill writer-jill commented Oct 4, 2023

Refactor the data management API documentation.

This PR has:

  • been self-reviewed.

docs/api-reference/data-management-api.md Outdated Show resolved Hide resolved
docs/api-reference/data-management-api.md Outdated Show resolved Hide resolved
docs/api-reference/data-management-api.md Outdated Show resolved Hide resolved
docs/api-reference/data-management-api.md Outdated Show resolved Hide resolved
docs/api-reference/data-management-api.md Outdated Show resolved Hide resolved
docs/api-reference/data-management-api.md Outdated Show resolved Hide resolved
docs/api-reference/data-management-api.md Outdated Show resolved Hide resolved
docs/api-reference/data-management-api.md Outdated Show resolved Hide resolved
docs/api-reference/data-management-api.md Outdated Show resolved Hide resolved
docs/api-reference/data-management-api.md Outdated Show resolved Hide resolved
writer-jill and others added 26 commits October 5, 2023 12:50
Copy link
Member

@vtlim vtlim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

krishnanand5 and others added 22 commits November 20, 2023 19:34
…ationWithDefaults` (apache#15317)

* + Fix for Flaky Test

* + Replacing TreeMap with LinkedHashMap

* + Changing data structure from LinkedHashMap to HashMap

* Fixed flaky test in S3DataSegmentPusherConfigTest.testSerializationValidatingMaxListingLength

* Minor Changes
…` query. (apache#15243)

* MSQ generates tombstones honoring the query's granularity.

This change tweaks to only account for the infinite-interval tombstones.
For finite-interval tombstones, the MSQ query granualrity will be used
which is consistent with how MSQ works.

* more tests and some cleanup.

* checkstyle

* comment edits

* Throw TooManyBuckets fault based on review; add more tests.

* Add javadocs for both methods on reconciling the methods.

* review: Move testReplaceTombstonesWithTooManyBucketsThrowsException to MsqFaultsTest

* remove unused imports.

* Move TooManyBucketsException to indexing package for shared exception handling.

* lower max bucket for tests and fixup count

* Advance and count the iterator.

* checkstyle
Saw bug where MSQ controller task would continue to hold the task slot even after cancel was issued.
This was due to a deadlock created on work launch. The main thread was waiting for tasks to spawn and the cancel thread was waiting for tasks to finish.
The fix was to instruct the MSQWorkerTaskLauncher thread to stop creating new tasks which would enable the main thread to unblock and release the slot.

Also short circuited the taskRetriable condition. Now the check is run in the MSQWorkerTaskLauncher thread as opposed to the main event thread loop. This will result in faster task failure in case the task is deemed to be non retriable.
* Document segment metadata cache behaviour

* Fix typo

* Minor update

* Minor change
…n` by changing string to key:value pair (apache#15207)

* Fix capacity response in mm-less ingestion (apache#14888)

Changes:
- Fix capacity response in mm-less ingestion.
- Add field usedClusterCapacity to the GET /totalWorkerCapacity response.
This API should be used to get the total ingestion capacity on the overlord.
- Remove method `isK8sTaskRunner` from interface `TaskRunner`

* Using Map to perform comparison

* Minor Change

---------

Co-authored-by: George Shiqi Wu <[email protected]>
There is a problem with Quantiles sketches and KLL Quantiles sketches.
Queries using the histogram post-aggregator fail if:

- the sketch contains at least one value, and
- the values in the sketch are all equal, and
- the splitPoints argument is not passed to the post-aggregator, and
- the numBins argument is greater than 2 (or not specified, which
  leads to the default of 10 being used)

In that case, the query fails and returns this error:

    {
      "error": "Unknown exception",
      "errorClass": "org.apache.datasketches.common.SketchesArgumentException",
      "host": null,
      "errorCode": "legacyQueryException",
      "persona": "OPERATOR",
      "category": "RUNTIME_FAILURE",
      "errorMessage": "Values must be unique, monotonically increasing and not NaN.",
      "context": {
        "host": null,
        "errorClass": "org.apache.datasketches.common.SketchesArgumentException",
        "legacyErrorCode": "Unknown exception"
      }
    }

This behaviour is undesirable, since the caller doesn't necessarily
know in advance whether the sketch has values that are diverse
enough. With this change, the post-aggregators return [N, 0, 0...]
instead of crashing, where N is the number of values in the sketch,
and the length of the list is equal to numBins. That is what they
already returned for numBins = 2.

Here is an example of a query that would fail:

    {"queryType":"timeseries",
     "dataSource": {
       "type": "inline",
       "columnNames": ["foo", "bar"],
       "rows": [
          ["abc", 42.0],
          ["def", 42.0]
       ]
     },
     "intervals":["0000/3000"],
     "granularity":"all",
     "aggregations":[
       {"name":"the_sketch", "fieldName":"bar", "type":"quantilesDoublesSketch"}],
     "postAggregations":[
       {"name":"the_histogram",
        "type":"quantilesDoublesSketchToHistogram",
        "field":{"type":"fieldAccess","fieldName":"the_sketch"},
        "numBins": 3}]}

I believe this also fixes issue apache#10585.
Fixing outdated query from deep storage docs.
…pache#14995)

* Prevent a race that may cause multiple attempts to publish segments for the same sequence
The TaskQueue maintains a map of active task ids to tasks, which can be utilized to get active task payloads, before falling back to the metadata store.
Fixed the following flaky tests:

    org.apache.druid.math.expr.ParserTest#testApplyFunctions
    org.apache.druid.math.expr.ParserTest#testSimpleMultiplicativeOp1
    org.apache.druid.math.expr.ParserTest#testFunctions
    org.apache.druid.math.expr.ParserTest#testSimpleLogicalOps1
    org.apache.druid.math.expr.ParserTest#testSimpleAdditivityOp1
    org.apache.druid.math.expr.ParserTest#testSimpleAdditivityOp2

The above mentioned tests have been reported as flaky (tests assuming deterministic implementation of a non-deterministic specification ) when ran against the NonDex tool.
The tests contain assertions (Assertion 1 & Assertion 2) that compare an ArrayList created from a HashSet using the ArrayList() constructor with another List. However, HashSet does not guarantee the ordering of elements and thus resulting in these flaky tests that assume deterministic implementation of HashSet. Thus, when the NonDex tool shuffles the HashSet elements, it results in the test failures:

Co-authored-by: ythorat2 <[email protected]>
This patch introduces a param snapshotTime in the iceberg inputsource spec that allows the user to ingest data files associated with the most recent snapshot as of the given time. This helps the user ingest data based on older snapshots by specifying the associated snapshot time.
This patch also upgrades the iceberg core version to 1.4.1
…d json_query (apache#15320)

* support dynamic expressions for path arguments for json_value and json_query
* reset spec before looking for tile

* improve logging

* log screenshots

* get and log jpeg

* other test tidy up
* Make numCorePartitions as 0 in the TombstoneShardSpec.

* fix up test

* Add tombstone core partition tests

* review comment

* Need to register the test shard type to make jackson happy
…g-segment retry bug. (apache#15260)

* Fix NPE caused by realtime segment closing race, fix possible missing-segment retry bug.

Fixes apache#12168, by returning empty from FireHydrant when the segment is
swapped to null. This causes the SinkQuerySegmentWalker to use
ReportTimelineMissingSegmentQueryRunner, which causes the Broker to look
for the segment somewhere else.

In addition, this patch changes SinkQuerySegmentWalker to acquire references
to all hydrants (subsegments of a sink) at once, and return a
ReportTimelineMissingSegmentQueryRunner if *any* of them could not be acquired.
I suspect, although have not confirmed, that the prior behavior could lead to
segments being reported as missing even though results from some hydrants were
still included.

* Some more test coverage.
Merge branch 'master' into dm-doc-refactor
@vtlim vtlim merged commit 6ed343c into apache:master Nov 20, 2023
11 checks passed
yashdeep97 added a commit to yashdeep97/druid that referenced this pull request Dec 1, 2023
Co-authored-by: Victoria Lim <[email protected]>
Co-authored-by: George Shiqi Wu <[email protected]>
Co-authored-by: 317brian <[email protected]>
Co-authored-by: ythorat2 <[email protected]>
Co-authored-by: Krishna Anandan <[email protected]>
Co-authored-by: Vadim Ogievetsky <[email protected]>
Co-authored-by: Abhishek Radhakrishnan <[email protected]>
Co-authored-by: Karan Kumar <[email protected]>
Co-authored-by: Rishabh Singh <[email protected]>
Co-authored-by: Magnus Henoch <[email protected]>
Co-authored-by: AmatyaAvadhanula <[email protected]>
Co-authored-by: Charles Smith <[email protected]>
Co-authored-by: Yashdeep Thorat <[email protected]>
Co-authored-by: Atul Mohan <[email protected]>
Co-authored-by: Clint Wylie <[email protected]>
Co-authored-by: Gian Merlino <[email protected]>
@LakshSingla LakshSingla added this to the 29.0.0 milestone Jan 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.