Add write_load to _cat/shards #117947

henrikno · 2024-12-03T23:54:41Z

Description

I was looking at a cluster where a couple of nodes were working really hard, and other weren't, so a case of imbalanced shards. And I was looking for a way to try to figure out which shards might be contributing the most to the load of the particular nodes that were showing 90%+ CPU.
My first approach was to capture index stats with shard level stats twice a few minutes apart and diff them. And e.g. sort by the difference of in indexing_index_time_in_millis. It worked ok, but requires multiple API calls and a script to compute the diff.
But then I noticed there's already a write_load on a per shard level. I've only seen it on e.g. node level and data-stream level, but this correlates with the highest indexing time per shard, and the top shards by write load matches the nodes that have high CPU.

Made a script that calls GET /_stats/docs,indexing,merge?level=shards and just shows write load and sort by it:

python3 shard_write_load.py
                                                                                index shard  primary                   node  write_load
                                             .ds-traces-apm-default-2024.12.03-001590    10    False A0z-R5vDSvOl8ywdKdHAJA      1.9748
                                             .ds-traces-apm-default-2024.12.03-001590     9     True A0z-R5vDSvOl8ywdKdHAJA      1.9660
                                             .ds-traces-apm-default-2024.12.03-001590     2    False ZyoEhPYfTJa3vi8jsNQzJg      1.5078
           .ds-metrics-elasticsearch.stack_monitoring.index-default-2024.11.29-000118     1     True 7U_93ikgR6WRtfnf0KiCmg      1.4968
           .ds-metrics-elasticsearch.stack_monitoring.index-default-2024.11.29-000118     2    False gS7uMutcRYKmRkA0AAD86w      1.4273
                                             .ds-traces-apm-default-2024.12.03-001590     6     True gS7uMutcRYKmRkA0AAD86w      1.4027
                                             .ds-traces-apm-default-2024.12.03-001590    11    False gS7uMutcRYKmRkA0AAD86w      1.4007

Indeed the top nodes here are the ones with high CPU, and now it's easier to see which indices/shards to move.

It would be awesome if we had this as a column in _cat/shards I could just add ?s=write_load:desc to make this easier.

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2024-12-04T06:49:15Z

Pinging @elastic/es-data-management (Team:Data Management)

henrikno added >enhancement needs:triage Requires assignment of a team area label labels Dec 3, 2024

dnhatn added the :Data Management/Stats Statistics tracking and retrieval APIs label Dec 4, 2024

elasticsearchmachine added Team:Data Management Meta label for data/management team and removed needs:triage Requires assignment of a team area label labels Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add write_load to _cat/shards #117947

Add write_load to _cat/shards #117947

henrikno commented Dec 3, 2024

elasticsearchmachine commented Dec 4, 2024

Add write_load to _cat/shards #117947

Add write_load to _cat/shards #117947

Comments

henrikno commented Dec 3, 2024

Description

elasticsearchmachine commented Dec 4, 2024