-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CHYT #224
Add CHYT #224
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR.
It would be great if the benchmark script could be more "automatic", i.e. download the database, configure it, start it, import the data, and run the scripts without user intervention. It will make it much easier for "outsiders" like me to reproduce the results.
@@ -0,0 +1,14 @@ | |||
#### CHYT powered by ClickHouse | |||
|
|||
1. Install YTsaurus cluster. Visit [YTsaurus Getting started webpage](https://ytsaurus.tech/docs/en/overview/try-yt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understand, the code is open-source, right? Script benchmark.sh
ideally does as much setup as possible automatically, i.e. no user intervention. For examples how to do that, please see clickhouse/benchmark.sh, postgresql/benchmark.sh and duckdb/benchmark.sh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, YTsaurus is open-source system. But CHYT is a small part of it. Benchmark.sh
uses pre-installed cluster with default clique to do benchmark test.
YTsaurus cluster can be installed, for example, using k8s operator
All possible variants are described in documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned, we need to reduce the variability here ... As someone who wants to verify the benchmark results, I like to run benchmark.sh
and have it install everything by itself. The only think that I would be able to choose is the hardware the system runs on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a possibility to create a demo cluster for everyone who wants to try YTsaurus. Would it be ok to verify the results?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One installation option is Docker. That seems the alternative with the least amount of complexity and the best reproducibility (compared to k8s and the demo cluster).
My preference would be if benchmark.sh sets up the docker container, does other preparations, and then runs the measurements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I still don't understand what really being measured here.
If the results json files in this PR refer to measurements for a locally setup cluster and different "clique" sizes: In that case, please add deterministic setup instructions (ideally using Docker) to benchmark.sh. Also, the term "serverless" is confusing as it is used in ClickBench for (commercial) database-as-a-service offerings - please remove this term. Please specify the exact machine specs instead (CPU, RAM). Instead of five different measurement sets that were seemingly created using five different "clique" sizes, it would be good to keep it simpler, e.g. two sets of measurements.
If the results json files in this PR refer to measurements for commercial DBaaS offering with different t-shirt sizes, then please describe the needed steps to setup such a cluser in README.md.
Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deploying a cluster with 360 vCPUs and 720 GB of RAM using a single VM can be quite challenging. In our case, we use a Kubernetes cluster with nodes of the type c6a.8xlarge and network SSDs that perform similarly to gp2 volumes.
We also aim to demonstrate various cluster sizes, not just the smallest one. If necessary, we can remove the "serverless" tag and instead specify the number of CHYT instances in the configuration.
Given the large size of our cluster, Docker deployment was not utilized for benchmarking, as it may yield different results. The easiest way to reproduce our results is by booking a demo cluster through our website.
Additionally, I can include a step-by-step guide in the README.md file to assist with the setup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, let's add a step by step guide in the README, I'll afterwards try my best to reproduce, then I will merge.
{ | ||
"system": "CHYT", | ||
"date": "2024-09-16", | ||
"machine": "192GB", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am confused. L. 6 says "serverless" which typically means the results were measured in a database-as-a-service offering (such as ClickHouse Cloud). Was that the case?
If not, it would be good to specify the exact machine specs for reproducibility, see e.g. duckdb/results/c5.4xlarge.json.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For CHYT with 48, 96 and 192 GB we use 1, 2 and 4 instances with 12 vCPU and 48 Gb RAM
For CHYT with 360 and 720 GB -- 9 and 18 instances with 10 vCPU and 40 Gb RAM
You only configure the count and size of instances and then YTsaurus will schedule them across computational nodes of cluster.
Resolves: #119
This PR adds CHYT (ClickHouse over YTsaurus) results and benchmark scripts for ClickBench