Skip to content

Commit

Permalink
[clickhouse] Enable system.metric_log and `system.asynchronous_metr…
Browse files Browse the repository at this point in the history
…ic_log` tables (#7100)

## Overview

In order to reliably roll out a replicated ClickHouse cluster, we'll be
running a set of long running testing (stage 1 of RFD
[#468](https://rfd.shared.oxide.computer/rfd/0468#_stage_1)).

ClickHouse provides a `system` database with several tables with
information about the system. For monitoring purposes, the
[`system.asynchronous_metric_log`](https://clickhouse.com/docs/en/operations/system-tables/asynchronous_metric_log)
and
[`system.metric_log`](https://clickhouse.com/docs/en/operations/system-tables/metric_log)
tables are particularly useful. With them we can retrieve information
about queries per second, CPU usage, memory usage etc. Full lists of
available metrics
[here](https://clickhouse.com/docs/en/operations/system-tables/metrics)
and
[here](https://clickhouse.com/docs/en/operations/system-tables/asynchronous_metrics)

During our long running testing I'd like to give these tables a TTL of
30 days. Once we are confident the system is stable and we roll out the
cluster to all racks, we can reduce TTL to 7 or 14 days.

This PR will only enable the tables themselves. There will be follow up
PRs to actually retrieve the data we'll be monitoring

## Manual testing

### Queries per second

```console
oxz_clickhouse_eecd32cc-ebf2-4196-912f-5bb440b104a0.local :) SELECT toStartOfInterval(event_time, INTERVAL 60 SECOND) AS t, avg(ProfileEvent_Query)
FROM system.metric_log
WHERE event_date >= toDate(now() - 86400) AND event_time >= now() - 86400
GROUP BY t
ORDER BY t WITH FILL STEP 60
SETTINGS date_time_output_format = 'iso'

SELECT
    toStartOfInterval(event_time, toIntervalSecond(60)) AS t,
    avg(ProfileEvent_Query)
FROM system.metric_log
WHERE (event_date >= toDate(now() - 86400)) AND (event_time >= (now() - 86400))
GROUP BY t
ORDER BY t ASC WITH FILL STEP 60
SETTINGS date_time_output_format = 'iso'

Query id: 1b91946b-fe8b-4074-bc94-f071f72f55f5

┌────────────────────t─┬─avg(ProfileEvent_Query)─┐
│ 2024-11-18T06:40:00Z │      1.3571428571428572 │
│ 2024-11-18T06:41:00Z │      1.3666666666666667 │
│ 2024-11-18T06:42:00Z │      1.3666666666666667 │
│ 2024-11-18T06:43:00Z │      1.3666666666666667 │
│ 2024-11-18T06:44:00Z │      1.3666666666666667 │
```

### Disk usage

```console
oxz_clickhouse_eecd32cc-ebf2-4196-912f-5bb440b104a0.local :) SELECT toStartOfInterval(event_time, INTERVAL 60 SECOND) AS t, avg(value)
FROM system.asynchronous_metric_log
WHERE event_date >= toDate(now() - 86400) AND event_time >= now() - 86400
AND metric = 'DiskUsed_default'
GROUP BY t
ORDER BY t WITH FILL STEP 60
SETTINGS date_time_output_format = 'iso'

SELECT
    toStartOfInterval(event_time, toIntervalSecond(60)) AS t,
    avg(value)
FROM system.asynchronous_metric_log
WHERE (event_date >= toDate(now() - 86400)) AND (event_time >= (now() - 86400)) AND (metric = 'DiskUsed_default')
GROUP BY t
ORDER BY t ASC WITH FILL STEP 60
SETTINGS date_time_output_format = 'iso'

Query id: bcf9cd9b-7fea-4aea-866b-d69e60a7c0b6

┌────────────────────t─┬─────────avg(value)─┐
│ 2024-11-18T06:42:00Z │  860941425.7777778 │
│ 2024-11-18T06:43:00Z │  865134523.7333333 │
│ 2024-11-18T06:44:00Z │          871888896 │
│ 2024-11-18T06:45:00Z │  874408891.7333333 │
│ 2024-11-18T06:46:00Z │          878761984 │
│ 2024-11-18T06:47:00Z │  881646933.3333334 │
│ 2024-11-18T06:48:00Z │  883998788.2666667 │

```

Related #6953
  • Loading branch information
karencfv authored Nov 19, 2024
1 parent b4fa875 commit 7947f75
Show file tree
Hide file tree
Showing 3 changed files with 96 additions and 0 deletions.
32 changes: 32 additions & 0 deletions clickhouse-admin/types/src/config.rs
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,38 @@ impl ReplicaConfig {
<flush_interval_milliseconds>10000</flush_interval_milliseconds>
</query_log>
<metric_log>
<database>system</database>
<table>metric_log</table>
<!--
TTL will be 30 days until we've finished long running tests.
After that, we can reduce it to a week or two.
-->
<engine>Engine = MergeTree ORDER BY event_time TTL event_date + INTERVAL 30 DAY</engine>
<flush_interval_milliseconds>7500</flush_interval_milliseconds>
<collect_interval_milliseconds>1000</collect_interval_milliseconds>
<max_size_rows>1048576</max_size_rows>
<reserved_size_rows>8192</reserved_size_rows>
<buffer_size_rows_flush_threshold>524288</buffer_size_rows_flush_threshold>
<flush_on_crash>false</flush_on_crash>
</metric_log>
<asynchronous_metric_log>
<database>system</database>
<table>asynchronous_metric_log</table>
<!--
TTL will be 30 days until we've finished long running tests.
After that, we can reduce it to a week or two.
-->
<engine>Engine = MergeTree ORDER BY event_time TTL event_date + INTERVAL 30 DAY</engine>
<flush_interval_milliseconds>7500</flush_interval_milliseconds>
<collect_interval_milliseconds>1000</collect_interval_milliseconds>
<max_size_rows>1048576</max_size_rows>
<reserved_size_rows>8192</reserved_size_rows>
<buffer_size_rows_flush_threshold>524288</buffer_size_rows_flush_threshold>
<flush_on_crash>false</flush_on_crash>
</asynchronous_metric_log>
<tmp_path>{temp_files_path}</tmp_path>
<user_files_path>{user_files_path}</user_files_path>
<default_profile>default</default_profile>
Expand Down
32 changes: 32 additions & 0 deletions clickhouse-admin/types/testutils/replica-server-config.xml
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,38 @@
<flush_interval_milliseconds>10000</flush_interval_milliseconds>
</query_log>

<metric_log>
<database>system</database>
<table>metric_log</table>
<!--
TTL will be 30 days until we've finished long running tests.
After that, we can reduce it to a week or two.
-->
<engine>Engine = MergeTree ORDER BY event_time TTL event_date + INTERVAL 30 DAY</engine>
<flush_interval_milliseconds>7500</flush_interval_milliseconds>
<collect_interval_milliseconds>1000</collect_interval_milliseconds>
<max_size_rows>1048576</max_size_rows>
<reserved_size_rows>8192</reserved_size_rows>
<buffer_size_rows_flush_threshold>524288</buffer_size_rows_flush_threshold>
<flush_on_crash>false</flush_on_crash>
</metric_log>

<asynchronous_metric_log>
<database>system</database>
<table>asynchronous_metric_log</table>
<!--
TTL will be 30 days until we've finished long running tests.
After that, we can reduce it to a week or two.
-->
<engine>Engine = MergeTree ORDER BY event_time TTL event_date + INTERVAL 30 DAY</engine>
<flush_interval_milliseconds>7500</flush_interval_milliseconds>
<collect_interval_milliseconds>1000</collect_interval_milliseconds>
<max_size_rows>1048576</max_size_rows>
<reserved_size_rows>8192</reserved_size_rows>
<buffer_size_rows_flush_threshold>524288</buffer_size_rows_flush_threshold>
<flush_on_crash>false</flush_on_crash>
</asynchronous_metric_log>

<tmp_path>./data/tmp</tmp_path>
<user_files_path>./data/user_files</user_files_path>
<default_profile>default</default_profile>
Expand Down
32 changes: 32 additions & 0 deletions smf/clickhouse/config.xml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,38 @@
<flush_interval_milliseconds>10000</flush_interval_milliseconds>
</query_log>

<metric_log>
<database>system</database>
<table>metric_log</table>
<!--
TTL will be 30 days until we've finished long running tests.
After that, we can reduce it to a week or two.
-->
<engine>Engine = MergeTree ORDER BY event_time TTL event_date + INTERVAL 30 DAY</engine>
<flush_interval_milliseconds>7500</flush_interval_milliseconds>
<collect_interval_milliseconds>1000</collect_interval_milliseconds>
<max_size_rows>1048576</max_size_rows>
<reserved_size_rows>8192</reserved_size_rows>
<buffer_size_rows_flush_threshold>524288</buffer_size_rows_flush_threshold>
<flush_on_crash>false</flush_on_crash>
</metric_log>

<asynchronous_metric_log>
<database>system</database>
<table>asynchronous_metric_log</table>
<!--
TTL will be 30 days until we've finished long running tests.
After that, we can reduce it to a week or two.
-->
<engine>Engine = MergeTree ORDER BY event_time TTL event_date + INTERVAL 30 DAY</engine>
<flush_interval_milliseconds>7500</flush_interval_milliseconds>
<collect_interval_milliseconds>1000</collect_interval_milliseconds>
<max_size_rows>1048576</max_size_rows>
<reserved_size_rows>8192</reserved_size_rows>
<buffer_size_rows_flush_threshold>524288</buffer_size_rows_flush_threshold>
<flush_on_crash>false</flush_on_crash>
</asynchronous_metric_log>

<mlock_executable>true</mlock_executable>

<tcp_port>9000</tcp_port>
Expand Down

0 comments on commit 7947f75

Please sign in to comment.