Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support for federated clusters #15181

Closed
wants to merge 4 commits into from

Conversation

599166320
Copy link
Contributor

@599166320 599166320 commented Oct 17, 2023

Fixes #14535.

Description

  1. Druid provides a friendly and unified query gateway for users across multiple data centers and clusters through cluster federation queries.

  2. As the number of nodes and metadata increases, a single Druid cluster can become excessively large, leading to mutual interference among tasks and deteriorating scheduling performance. Implementing federated queries can help avoid such issues. It allows for breaking down large clusters into relatively independent ones as needed, making scheduling more agile and lightweight.


Key changed/added classes in this PR
  • QueryContexts
  • BrokerServerView
  • CachingClusteredClient
  • TimelineServerView

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@599166320 599166320 changed the title Feature federated cluster support for federated clusters Oct 17, 2023
@cryptoe
Copy link
Contributor

cryptoe commented Oct 19, 2023

@599166320
How big is the deployment you are working witch is causing things like

 leading to mutual interference among tasks and deteriorating scheduling performance.

I really donot understand what this means. Are you talking about ingestion tasks or query scheduling.

The change in the current form looks hackish .
I think its a better design pattern that one cluster owns one data source if you really want to break things up since then you can configure load/rules compaction etc only on one cluster for a data source. Ingestion also gets simpler.

@599166320
Copy link
Contributor Author

@cryptoe
Thank you for your response. For the second point, I also hesitated whether to include it. However, I decided to include it to see what everyone thinks.

Usually, we deploy a Druid cluster in one data center, with approximately a hundred servers in each data center. The entire cluster is quite stable. However, we've noticed that the master node operates in a master-slave configuration, storing a significant amount of metadata and handling heavy scheduling tasks. In theory, there might be bottlenecks, so we wanted to bring it up for discussion.

Of course, what we are more concerned about is the cost of dedicated network traffic.

@599166320
Copy link
Contributor Author

@cryptoe
Is there a better way to implement the practical application of federated queries? @abhishekagarwal87 mentioned that it might potentially break certain protocols. In my opinion, the native query for Historical and Broker currently shares the code using QueryResource as the entry point, and the parameter structures are almost identical.

However, if strict constraints are necessary, Query.queryContext might require some improvements. What are your thoughts on this?

@cryptoe
Copy link
Contributor

cryptoe commented Oct 27, 2023

Things like lookups, post aggregators stuff with the current approach needs to be thought through.

I think the correct way to do it would be to use something like https://github.com/lyft/presto-gateway and pass a query context to select the correct cluster you want or make some mapping to data source -> cluster on this gateway nodes.

@cryptoe
Copy link
Contributor

cryptoe commented Nov 6, 2023

Another thing I was thinking is how order by's would look. The broker expects things to be sorted by the grouping key and then sorts stuff on the order by key IIRC.
In this case the cluster 2 broker will return the rows already sorted on the order by key, which will break the merging logic of the grouping keys on the broker for cluster 1.

@599166320
Copy link
Contributor Author

@cryptoe According to your description, are you concerned about a sorting like the one below?

SELECT
  COUNT(*) c,
  regionName
FROM wikipedia
GROUP BY regionName
ORDER BY regionName, c DESC
LIMIT 10

In fact, when a query like this is forwarded from one broker to another broker in a cluster, the LIMIT 10 part is removed. It will be transformed into a native query similar to the one below:

{
    "queryType": "groupBy",
    "dataSource":
    {
        "type": "table",
        "name": "wikipedia"
    },
    "intervals":
    {
        "type": "intervals",
        "intervals":
        [
            "-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"
        ]
    },
    "granularity":
    {
        "type": "all"
    },
    "dimensions":
    [
        {
            "type": "default",
            "dimension": "regionName",
            "outputName": "d0",
            "outputType": "STRING"
        }
    ],
    "aggregations":
    [
        {
            "type": "count",
            "name": "a0"
        }
    ],
    "limitSpec":
    {
        "type": "NoopLimitSpec"
    },
    "context":
    {
        "applyLimitPushDown": false,
        "defaultTimeout": 300000,
        "federatedClusterBrokers": "",
        "finalize": false,
        "fudgeTimestamp": "-4611686018427387904",
        "groupByOutermost": false,
        "groupByStrategy": "v2",
        "maxQueuedBytes": 5242880,
        "maxScatterGatherBytes": 9223372036854775807,
        "queryFailTime": 1699369214285,
        "queryId": "605a751b-f0ee-43fe-a754-b702035622df",
        "resultAsArray": true,
        "sqlQueryId": "e8bed46a-f2de-4fc3-89c6-febfe048debc",
        "timeout": 29981
    }
}

So, you don't need to worry about this sorting issue.

Copy link

github-actions bot commented Mar 9, 2024

This pull request has been marked as stale due to 60 days of inactivity.
It will be closed in 4 weeks if no further activity occurs. If you think
that's incorrect or this pull request should instead be reviewed, please simply
write any comment. Even if closed, you can still revive the PR at any time or
discuss it on the [email protected] list.
Thank you for your contributions.

@github-actions github-actions bot added the stale label Mar 9, 2024
Copy link

github-actions bot commented Apr 7, 2024

This pull request/issue has been closed due to lack of activity. If you think that
is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Apr 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

support for federated clusters
3 participants