Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metric -- count of queries waiting for merge buffers #15025

Conversation

kaisun2000
Copy link
Contributor

@kaisun2000 kaisun2000 commented Sep 22, 2023

Description

Currently there is no metric to indicate the contention level for the shared resource -- merge buffer pools. Thus, it is hard to know that the query performance suffers from the contention for shared merge buffer pools. We did meet such performance degradation issue in the production environment.

This PR exposed the count of waiting queries (threads) blocking at in the merge buffers pools.

Summary of the changes for a quick reference:

  • BlockingPool interface was updated with a new method getPendingQueries to return the count of pending queries.
  • DefaultBlockingPool class was updated with the implementation of gathering the number of pending queries with an AtomicLong counter.
  • MergeBufferPoolMonitor class was created with the shared merge buffer pool (annotated with @Merging) injected. User can add this monitor to the configuration file such as druid.monitoring.monitors=["org.apache.druid.java.util.metrics.MergeBufferPoolMonitor", ...]
  • Unit test is added. This unit test validated the MergeBufferPoolMonitor would emit correct count of pending queries with a query blocking at the DefaultBlockingPool by taking more than the number of available buffers in the DefaultBlockingPool

Release note

NEW: You can now add a metric mergeBuffer\pendingRequests to monitor the contention level of the merge buffer pool. This metric is exposed through the QueryCountStatsMonitor


Key changed/added classes in this PR
  • You can monitor the number of pending queries at the merge buffer pool with the metric mergeBuffer/pendingRequest
  • Add the MergeBufferPoolMonitor to the configuration files of Broker, Historical or Router as druid.monitoring.monitors=["org.apache.druid.java.util.metrics.MergeBufferPoolMonitor", ...]

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

Copy link
Contributor

@suneet-s suneet-s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your first contribution!

I took a quick pass of the code and have some suggestions - specifically in the MergeBufferPoolMonitor that should make it easier to deploy.

Since this monitor runs only once on every emission period, the metrics only provide a snapshot in time of how many queries were waiting on merge buffers at that instant. Have you considered reporting metrics like mean, P98 and max time spent waiting on merge buffers instead?

That should provide visibility into how long queries are taking to acquire merge buffers on all the different services.

Another metric that may be interesting is the number of queries that timed out while waiting to acquire merge buffers.

Copy link
Contributor

@LakshSingla LakshSingla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the first PR! Left a few review comments.
Also, we should add the monitor to docs/configuration/index.md along with the other monitors.


@Inject
public MergeBufferPoolMonitor(
@Merging BlockingPool<ByteBuffer> mergeBufferPoolIn
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The monitor itself shouldn't care about the actual blocking pool. It should have a "stats provider" (interface) which exposes the method getPendingQueries.

You can take a look at the example: QueryCountStatsProvider and QueryResource for the pattern. You will need to wire it up in Guice as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LakshSingla is there a reason you want to do it that way? what structure do you have in mind exactly?

@abhishekagarwal87
Copy link
Contributor

We don't need to add another monitor. The new metrics can be emitted from the QueryCountStatsMonitor.

@kaisun2000
Copy link
Contributor Author

kaisun2000 commented Sep 27, 2023

We don't need to add another monitor. The new metrics can be emitted from the QueryCountStatsMonitor.

@abhishekagarwal87, just want to pan out the overall structure of add this metric from the QueryCountStatsMonitor and confirm the design before I make the changes.

QueryCountStatsMonitor takes a QueryCountStatsProvider in its constructor to provide the QueryCountStats (aggregate stats instead of query specific stats). The QueryCountStatsProvider is implemented as QueryResource (which implements the logic for query REST endpoints).

So here, I can add a parameter in QueryResource constructor as @Merging BlockingPool<ByteBuffer> mergeBufferPoolIn. And let Guice to inject the shared merge buffer pool. This way, we can reuse the same monitor.

In the interface of QueryCountStatsProvider, we need to add a method to report the pending query count such as public long getPendingQueriesCount()

We will also need to keep the same changes in this PR to add a method in BlockingPool to report the pending queries in the pool, as public long getPendingQueries()

Let me know if I missed anything. Or next I will update the PR to reflect this changes discussed here.

@abhishekagarwal87
Copy link
Contributor

@kaisun2000 - How about injecting BlockingPool directly into the QueryCountStatsMonitor? Though seems like Laksh has objections to injecting pool into a monitor. I will let him chime in too.

@kaisun2000
Copy link
Contributor Author

kaisun2000 commented Oct 2, 2023

@kaisun2000 - How about injecting BlockingPool directly into the QueryCountStatsMonitor? Though seems like Laksh has objections to injecting pool into a monitor. I will let him chime in too.

@abhishekagarwal87, thx for coordinating this. @LakshSingla , let me know your take here.

In this case, I will inject the BlockinPool to QueryCountStatsMonitor and not touching the QueryCountStatsProvider interface. Will send the revised PR next.

@LakshSingla
Copy link
Contributor

Though seems like Laksh has objections to injecting pool into a monitor. I will let him chime in too.

I had concerns with the monitor directly interacting with the merge buffer pool, instead of the stats provider that it was supposed to work with. However, I don't have a strong stance on the same, since its mostly internal implementation. The changes.

Copy link
Contributor

@LakshSingla LakshSingla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM left a few comments. TY for taking this up!

Copy link
Contributor

@LakshSingla LakshSingla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code coverage issues, however, the coverage looks good to me. The insufficiency is due to the DummyBlockingPool not being tested anywhere, which is okay to me.

LGTM 🚀

@LakshSingla
Copy link
Contributor

ty for the contribution, I am merging the PR. Can you please add a release note to the description @kaisun2000? (a few line user-facing summary of the change) It helps in prepping up the release notes and calling out user-facing changes.

@LakshSingla LakshSingla merged commit e2cc1c4 into apache:master Oct 9, 2023
63 of 65 checks passed
@kaisun2000
Copy link
Contributor Author

@LakshSingla, just added the release notes section. Thx for help to get this PR reviewed and checked in.

@abhishekagarwal87
Copy link
Contributor

thank you for your contribution @kaisun2000

@LakshSingla LakshSingla added this to the 28.0 milestone Oct 12, 2023
ektravel pushed a commit to ektravel/druid that referenced this pull request Oct 16, 2023
Add 'mergeBuffer/pendingRequests' metric that exposes the count of waiting queries (threads) blocking in the merge buffers pools.
CaseyPan pushed a commit to CaseyPan/druid that referenced this pull request Nov 17, 2023
Add 'mergeBuffer/pendingRequests' metric that exposes the count of waiting queries (threads) blocking in the merge buffers pools.
LakshSingla pushed a commit that referenced this pull request Sep 3, 2024
…des (#16992)

#15025 adds mergeBuffer/pendingRequests metric in QueryCountStatsMonitor. Since real-time nodes also use the same merge buffers for queries and have QueryCountStatsMonitor , the documentation is being updated to include this metric.
airlock-confluentinc bot pushed a commit to confluentinc/druid that referenced this pull request Sep 4, 2024
)

* Add metric -- count of queries waiting for merge buffers (apache#15025)

Add 'mergeBuffer/pendingRequests' metric that exposes the count of waiting queries (threads) blocking in the merge buffers pools.

* Update addMetric to build

---------

Co-authored-by: kaisun2000 <[email protected]>
hardikbajaj added a commit to confluentinc/druid that referenced this pull request Sep 4, 2024
) (#223)

* Add metric -- count of queries waiting for merge buffers (apache#15025)

Add 'mergeBuffer/pendingRequests' metric that exposes the count of waiting queries (threads) blocking in the merge buffers pools.

* Update addMetric to build

---------

Co-authored-by: kaisun2000 <[email protected]>
edgar2020 pushed a commit to edgar2020/druid that referenced this pull request Sep 5, 2024
…des (apache#16992)

apache#15025 adds mergeBuffer/pendingRequests metric in QueryCountStatsMonitor. Since real-time nodes also use the same merge buffers for queries and have QueryCountStatsMonitor , the documentation is being updated to include this metric.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants