Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[clickhouse] Clickana monitoring dashboard tool #7207

Merged
merged 41 commits into from
Dec 19, 2024

Conversation

karencfv
Copy link
Contributor

@karencfv karencfv commented Dec 5, 2024

Overview

As part of Stage 1 of RFD468 we'll be observing how a ClickHouse cluster behaves in comparison with a single node server. This commit introduces a basic tool that lets us visualize internal ClickHouse metric information.

As a starting point, Clickana only has 4 charts, and the user may not choose what these are. Additionally, it is only capable of rendering data by making API calls. I'd like to make the tool more flexible; other capabilities will be added in follow up PRs.

Usage

clickana --help                                    
Usage: clickana [OPTIONS] --clickhouse-addr <CLICKHOUSE_ADDR>

Options:
  -l, --log-path <LOG_PATH>                    Path to the log file [env: CLICKANA_LOG_PATH=] [default: /tmp/clickana.log]
  -a, --clickhouse-addr <CLICKHOUSE_ADDR>      Address where a clickhouse admin server is listening on
  -s, --sampling-interval <SAMPLING_INTERVAL>  The interval to collect monitoring data in seconds [default: 60]
  -t, --time-range <TIME_RANGE>                Range of time to collect monitoring data in seconds [default: 3600]
  -r, --refresh-interval <REFRESH_INTERVAL>    The interval at which the dashboards will refresh [default: 60]
  -h, --help                                   Print help

Manual Testing

root@oxz_clickhouse_015f9c34:~# /opt/oxide/clickana/bin/clickana -a [fd00:1122:3344:101::e]:8888
Screenshot 2024-12-12 at 4 11 15 PM

Next Steps

  • Let the user set which metrics they would like to visualise in each chart. This may be nice to do through a TOML file or something. We could let them choose which unit to represent them in as well perhaps.
  • Have more metrics available.
  • It'd be nice to have the ability to take the timeseries as JSON instead of calling the API as well. This could be useful in the future to have some insight into our customer's racks for debugging purposes. We could include ClickHouse internal metric timeseries as part of the support bundles and they could be visualised via Clickana. WDYT @smklein ?

Related: #6953

@@ -1200,7 +1200,7 @@ pub enum Timestamp {
#[derive(Debug, Serialize, Deserialize, JsonSchema, PartialEq)]
#[serde(rename_all = "snake_case")]
pub struct SystemTimeSeries {
pub time: Timestamp,
pub time: String,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is seriously doing my head in. Since Timestamp is an untagged enum, serde is having a hard time deserializing. My custom deserializer didn't work, but I'll see if I can find a way

@karencfv
Copy link
Contributor Author

Almost there...

Screenshot 2024-12-10 at 9 00 49 PM

Comment on lines +1601 to +1614
// The ClickHouse client connects via the TCP port
let ch_address = {
let mut addr = *address;
addr.set_port(CLICKHOUSE_TCP_PORT);
addr.to_string()
};

let clickhouse_admin_config =
PropertyGroupBuilder::new("config")
.add_property("http_address", "astring", admin_address)
.add_property(
"ch_address",
"astring",
address.to_string(),
ch_address.to_string(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really sure when this broke, but it wouldn't have been caught as nothing was calling clickhouse_cli (a wrapper around the clickhouse client command) yet.

@karencfv karencfv marked this pull request as ready for review December 12, 2024 03:40
@karencfv
Copy link
Contributor Author

karencfv commented Dec 18, 2024

Just want to make a clarification. The timeseries rendered in each dashboard are per clickhouse server, not of the whole cluster aggregated into a single view. I misremembered the system tables' engines. They're MergeTree (not ReplicatedMergeTree as I thought), so unique to each server.
Frankly, it's probably more useful this way. If there are unexpected discrepancies between the nodes, we'll be able to identify them more easily.

@karencfv
Copy link
Contributor Author

Ok, so the docs aren't super great, but I've been playing around with this to confirm. They are definitely per server.

oximeter_cluster-1 :) SELECT toStartOfInterval(event_time, INTERVAL 60 SECOND) AS t, avg(ProfileEvent_Query)
FROM system.metric_log
WHERE event_date >= toDate(now() - 86400) AND event_time >= now() - 86400
GROUP BY t
ORDER BY t WITH FILL STEP 60
SETTINGS date_time_output_format = 'iso'

SELECT
    toStartOfInterval(event_time, toIntervalSecond(60)) AS t,
    avg(ProfileEvent_Query)
FROM system.metric_log
WHERE (event_date >= toDate(now() - 86400)) AND (event_time >= (now() - 86400))
GROUP BY t
ORDER BY t ASC WITH FILL STEP 60
SETTINGS date_time_output_format = 'iso'

Query id: a9eed161-d54c-4d11-b23a-21aeca60ef28

┌────────────────────t─┬─avg(ProfileEvent_Query)─┐
│ 2024-12-18T07:02:00Z │                       0 │
│ 2024-12-18T07:03:00Z │                       0 │
│ 2024-12-18T07:04:00Z │      0.2833333333333333 │
│ 2024-12-18T07:05:00Z │      0.5666666666666667 │
│ 2024-12-18T07:06:00Z │      0.6666666666666666 │
│ 2024-12-18T07:07:00Z │                    0.45 │
│ 2024-12-18T07:08:00Z │      0.2833333333333333 │
│ 2024-12-18T07:09:00Z │                       0 │
│ 2024-12-18T07:10:00Z │                       0 │
│ 2024-12-18T07:11:00Z │                       0 │
│ 2024-12-18T07:12:00Z │     0.03333333333333333 │
│ 2024-12-18T07:13:00Z │    0.016666666666666666 │
│ 2024-12-18T07:14:00Z │                       0 │
│ 2024-12-18T07:15:00Z │                       0 │
│ 2024-12-18T07:16:00Z │     0.18333333333333332 │
│ 2024-12-18T07:17:00Z │    0.016666666666666666 │
│ 2024-12-18T07:18:00Z │     0.43333333333333335 │
│ 2024-12-18T07:19:00Z │    0.016666666666666666 │
│ 2024-12-18T07:20:00Z │                       0 │
│ 2024-12-18T07:21:00Z │                       0 │
│ 2024-12-18T07:22:00Z │                       0 │
│ 2024-12-18T07:23:00Z │                       0 │
│ 2024-12-18T07:24:00Z │    0.016666666666666666 │
│ 2024-12-18T07:25:00Z │    0.016666666666666666 │
│ 2024-12-18T07:26:00Z │      0.6333333333333333 │
│ 2024-12-18T07:27:00Z │                       0 │
│ 2024-12-18T07:28:00Z │                       0 │
└──────────────────────┴─────────────────────────┘

27 rows in set. Elapsed: 0.009 sec. Processed 1.57 thousand rows, 10.60 KB (176.72 thousand rows/s., 1.19 MB/s.)
Peak memory usage: 51.12 KiB.
oximeter_cluster-2 :) SELECT toStartOfInterval(event_time, INTERVAL 60 SECOND) AS t, avg(ProfileEvent_Query)
FROM system.metric_log
WHERE event_date >= toDate(now() - 86400) AND event_time >= now() - 86400
GROUP BY t
ORDER BY t WITH FILL STEP 60
SETTINGS date_time_output_format = 'iso'

SELECT
    toStartOfInterval(event_time, toIntervalSecond(60)) AS t,
    avg(ProfileEvent_Query)
FROM system.metric_log
WHERE (event_date >= toDate(now() - 86400)) AND (event_time >= (now() - 86400))
GROUP BY t
ORDER BY t ASC WITH FILL STEP 60
SETTINGS date_time_output_format = 'iso'

Query id: 37a3b84f-8844-40b5-ac3a-df4b60ba7b1a

┌────────────────────t─┬─avg(ProfileEvent_Query)─┐
│ 2024-12-18T07:02:00Z │                       0 │
│ 2024-12-18T07:03:00Z │                       0 │
│ 2024-12-18T07:04:00Z │      0.2833333333333333 │
│ 2024-12-18T07:05:00Z │      0.5666666666666667 │
│ 2024-12-18T07:06:00Z │      0.5666666666666667 │
│ 2024-12-18T07:07:00Z │      0.2833333333333333 │
│ 2024-12-18T07:08:00Z │      0.2833333333333333 │
│ 2024-12-18T07:09:00Z │                       0 │
│ 2024-12-18T07:10:00Z │                       0 │
│ 2024-12-18T07:11:00Z │                       0 │
│ 2024-12-18T07:12:00Z │     0.03333333333333333 │
│ 2024-12-18T07:13:00Z │    0.016666666666666666 │
│ 2024-12-18T07:14:00Z │                       0 │
│ 2024-12-18T07:15:00Z │                       0 │
│ 2024-12-18T07:16:00Z │    0.016666666666666666 │
│ 2024-12-18T07:17:00Z │                       0 │
│ 2024-12-18T07:18:00Z │     0.03333333333333333 │
│ 2024-12-18T07:19:00Z │    0.016666666666666666 │
│ 2024-12-18T07:20:00Z │                       0 │
│ 2024-12-18T07:21:00Z │                       0 │
│ 2024-12-18T07:22:00Z │                       0 │
│ 2024-12-18T07:23:00Z │                       0 │
│ 2024-12-18T07:24:00Z │    0.016666666666666666 │
│ 2024-12-18T07:25:00Z │    0.016666666666666666 │
│ 2024-12-18T07:26:00Z │                       0 │
│ 2024-12-18T07:27:00Z │                       0 │
│ 2024-12-18T07:28:00Z │                       0 │
└──────────────────────┴─────────────────────────┘

27 rows in set. Elapsed: 0.021 sec. Processed 2.75 thousand rows, 16.81 KB (134.15 thousand rows/s., 818.93 KB/s.)
Peak memory usage: 64.65 KiB.

Copy link
Contributor

@andrewjstone andrewjstone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good @karencfv.

Just some relatively minor suggestions from me.

results.len()
);
}
// TODO: Eventually we may want to not have a set amount of charts and make the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future PRs: I think it would be useful to cool to be able to have a little menu of charts on the side of the pane, and then you can scroll and select which ones to show without having to restart the app, or mess with a toml file.

You could also allow toggling between a set of predefined layouts to make it always look nice. So you could show, 1, 2, 4, 6, 8 charts or something and allow selecting which to show in each view. You could even remember which charts to show in each layout, so you could toggle back and forth between different layouts and see all the charts, some with more detail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oooooohhhhh nice!! I like that idea. I'll add your comment in the TODO

let s = self.clone();
let c = client.clone();

let task = tokio::spawn(async move {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this works, It seems somewhat heavy handed to spawn a task to get api data in parallel for each chart and then immediately join to wait for them all. Spawn is typically used for longer running tasks that stay around.

A more common way to do this is when you want concurrency but don't need to leave the current thread is to use FuturesUnordered. https://betterprogramming.pub/futuresunordered-an-efficient-way-to-manage-multiple-futures-in-rust-a24520abc3f6 has a pretty good overview.

Using FuturesUnordered would also remove the need to clone self and client as they can just be borrowed immutably.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thanks for the tip

let log = self.new_logger()?;
let client = ClickhouseServerClient::new(&admin_url, log.clone());

let tick_rate = Duration::from_secs(self.refresh_interval);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a "rate", but a duration. I'd suggest naming it to tick_interval.

};
use std::fmt::Display;

const GIBIBYTE_F64: f64 = 1073741824.0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems really odd to me to represent number of bytes by floats, as they are always whole numbers.

I realize that clickhouse returns floats for timeseries, but I think for types where it makes sense we should instead normalize those to integers rather than normalizing our data and computations to fit the raw data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a dataset, Ratatui requires the data points to be f64, so I think we're stuck with f64 sadly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, got it. That makes sense. Feel free to ignore these comments then :)

let mid_label_as_unit =
values.avg(lower_label_as_unit, upper_label_as_unit);

// To nicely display the mid value label for the Y axis, we do the following:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can get rid of this parsing if you just convert all the values to integers at ingestion time.

The only reason I guess you wouldn't want to do this is if there are metrics where there are fractions we actually care about.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get that this looks super weird 😄 , but I added it for the cases when there is very little variance between each point, and the bounds end up being very close to each other. The rounding made the mid point not be mid at all, and the data didn't really match up with the labels anymore

.iter()
.map(|ts| {
(
ts.time.trim_matches('"').parse::<f64>().unwrap_or_else(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to parse a timestamp into an f64? Can't we use an actual time type instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above, the datapoints in the dataset need to be f64 for ratatui to render them

Copy link
Contributor Author

@karencfv karencfv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @andrewjstone !

};
use std::fmt::Display;

const GIBIBYTE_F64: f64 = 1073741824.0;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a dataset, Ratatui requires the data points to be f64, so I think we're stuck with f64 sadly

let mid_label_as_unit =
values.avg(lower_label_as_unit, upper_label_as_unit);

// To nicely display the mid value label for the Y axis, we do the following:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get that this looks super weird 😄 , but I added it for the cases when there is very little variance between each point, and the bounds end up being very close to each other. The rounding made the mid point not be mid at all, and the data didn't really match up with the labels anymore

.iter()
.map(|ts| {
(
ts.time.trim_matches('"').parse::<f64>().unwrap_or_else(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above, the datapoints in the dataset need to be f64 for ratatui to render them

let s = self.clone();
let c = client.clone();

let task = tokio::spawn(async move {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thanks for the tip

results.len()
);
}
// TODO: Eventually we may want to not have a set amount of charts and make the
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oooooohhhhh nice!! I like that idea. I'll add your comment in the TODO

@karencfv karencfv enabled auto-merge (squash) December 19, 2024 00:03
@karencfv karencfv merged commit d337333 into oxidecomputer:main Dec 19, 2024
19 checks passed
@karencfv karencfv deleted the clickana branch December 19, 2024 02:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants