From 7947f752ba6d226aa65d8ecfb030959c6a6896b2 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Karen=20C=C3=A1rcamo?= Date: Wed, 20 Nov 2024 08:55:32 +1300 Subject: [PATCH] [clickhouse] Enable `system.metric_log` and `system.asynchronous_metric_log` tables (#7100) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ## Overview In order to reliably roll out a replicated ClickHouse cluster, we'll be running a set of long running testing (stage 1 of RFD [#468](https://rfd.shared.oxide.computer/rfd/0468#_stage_1)). ClickHouse provides a `system` database with several tables with information about the system. For monitoring purposes, the [`system.asynchronous_metric_log`](https://clickhouse.com/docs/en/operations/system-tables/asynchronous_metric_log) and [`system.metric_log`](https://clickhouse.com/docs/en/operations/system-tables/metric_log) tables are particularly useful. With them we can retrieve information about queries per second, CPU usage, memory usage etc. Full lists of available metrics [here](https://clickhouse.com/docs/en/operations/system-tables/metrics) and [here](https://clickhouse.com/docs/en/operations/system-tables/asynchronous_metrics) During our long running testing I'd like to give these tables a TTL of 30 days. Once we are confident the system is stable and we roll out the cluster to all racks, we can reduce TTL to 7 or 14 days. This PR will only enable the tables themselves. There will be follow up PRs to actually retrieve the data we'll be monitoring ## Manual testing ### Queries per second ```console oxz_clickhouse_eecd32cc-ebf2-4196-912f-5bb440b104a0.local :) SELECT toStartOfInterval(event_time, INTERVAL 60 SECOND) AS t, avg(ProfileEvent_Query) FROM system.metric_log WHERE event_date >= toDate(now() - 86400) AND event_time >= now() - 86400 GROUP BY t ORDER BY t WITH FILL STEP 60 SETTINGS date_time_output_format = 'iso' SELECT toStartOfInterval(event_time, toIntervalSecond(60)) AS t, avg(ProfileEvent_Query) FROM system.metric_log WHERE (event_date >= toDate(now() - 86400)) AND (event_time >= (now() - 86400)) GROUP BY t ORDER BY t ASC WITH FILL STEP 60 SETTINGS date_time_output_format = 'iso' Query id: 1b91946b-fe8b-4074-bc94-f071f72f55f5 ┌────────────────────t─┬─avg(ProfileEvent_Query)─┐ │ 2024-11-18T06:40:00Z │ 1.3571428571428572 │ │ 2024-11-18T06:41:00Z │ 1.3666666666666667 │ │ 2024-11-18T06:42:00Z │ 1.3666666666666667 │ │ 2024-11-18T06:43:00Z │ 1.3666666666666667 │ │ 2024-11-18T06:44:00Z │ 1.3666666666666667 │ ``` ### Disk usage ```console oxz_clickhouse_eecd32cc-ebf2-4196-912f-5bb440b104a0.local :) SELECT toStartOfInterval(event_time, INTERVAL 60 SECOND) AS t, avg(value) FROM system.asynchronous_metric_log WHERE event_date >= toDate(now() - 86400) AND event_time >= now() - 86400 AND metric = 'DiskUsed_default' GROUP BY t ORDER BY t WITH FILL STEP 60 SETTINGS date_time_output_format = 'iso' SELECT toStartOfInterval(event_time, toIntervalSecond(60)) AS t, avg(value) FROM system.asynchronous_metric_log WHERE (event_date >= toDate(now() - 86400)) AND (event_time >= (now() - 86400)) AND (metric = 'DiskUsed_default') GROUP BY t ORDER BY t ASC WITH FILL STEP 60 SETTINGS date_time_output_format = 'iso' Query id: bcf9cd9b-7fea-4aea-866b-d69e60a7c0b6 ┌────────────────────t─┬─────────avg(value)─┐ │ 2024-11-18T06:42:00Z │ 860941425.7777778 │ │ 2024-11-18T06:43:00Z │ 865134523.7333333 │ │ 2024-11-18T06:44:00Z │ 871888896 │ │ 2024-11-18T06:45:00Z │ 874408891.7333333 │ │ 2024-11-18T06:46:00Z │ 878761984 │ │ 2024-11-18T06:47:00Z │ 881646933.3333334 │ │ 2024-11-18T06:48:00Z │ 883998788.2666667 │ ``` Related https://github.com/oxidecomputer/omicron/issues/6953 --- clickhouse-admin/types/src/config.rs | 32 +++++++++++++++++++ .../types/testutils/replica-server-config.xml | 32 +++++++++++++++++++ smf/clickhouse/config.xml | 32 +++++++++++++++++++ 3 files changed, 96 insertions(+) diff --git a/clickhouse-admin/types/src/config.rs b/clickhouse-admin/types/src/config.rs index 27eb569b91..120ff32312 100644 --- a/clickhouse-admin/types/src/config.rs +++ b/clickhouse-admin/types/src/config.rs @@ -132,6 +132,38 @@ impl ReplicaConfig { 10000 + + system + metric_log
+ + Engine = MergeTree ORDER BY event_time TTL event_date + INTERVAL 30 DAY + 7500 + 1000 + 1048576 + 8192 + 524288 + false +
+ + + system + asynchronous_metric_log
+ + Engine = MergeTree ORDER BY event_time TTL event_date + INTERVAL 30 DAY + 7500 + 1000 + 1048576 + 8192 + 524288 + false +
+ {temp_files_path} {user_files_path} default diff --git a/clickhouse-admin/types/testutils/replica-server-config.xml b/clickhouse-admin/types/testutils/replica-server-config.xml index 3aeacd073d..8a0687e9af 100644 --- a/clickhouse-admin/types/testutils/replica-server-config.xml +++ b/clickhouse-admin/types/testutils/replica-server-config.xml @@ -49,6 +49,38 @@ 10000 + + system + metric_log
+ + Engine = MergeTree ORDER BY event_time TTL event_date + INTERVAL 30 DAY + 7500 + 1000 + 1048576 + 8192 + 524288 + false +
+ + + system + asynchronous_metric_log
+ + Engine = MergeTree ORDER BY event_time TTL event_date + INTERVAL 30 DAY + 7500 + 1000 + 1048576 + 8192 + 524288 + false +
+ ./data/tmp ./data/user_files default diff --git a/smf/clickhouse/config.xml b/smf/clickhouse/config.xml index 58ae5dcaf5..352023300a 100644 --- a/smf/clickhouse/config.xml +++ b/smf/clickhouse/config.xml @@ -13,6 +13,38 @@ 10000 + + system + metric_log
+ + Engine = MergeTree ORDER BY event_time TTL event_date + INTERVAL 30 DAY + 7500 + 1000 + 1048576 + 8192 + 524288 + false +
+ + + system + asynchronous_metric_log
+ + Engine = MergeTree ORDER BY event_time TTL event_date + INTERVAL 30 DAY + 7500 + 1000 + 1048576 + 8192 + 524288 + false +
+ true 9000