pgremapper performance limit? #52

k0ste · 2024-09-04T13:36:50Z

Hi, I find the cluster, where cancel-backfill is not super fast as usual (took minutes)
1423 osds | 37920 pgs - not super big. But have 38 rack buckets with 67 hosts

What I see is that pgremapper use only ~ 3 cores of 24 cores CPU, and that's suspicious (low) for golang. May be this some limit in code or on compiler? Or scalability milestone for an application?

The text was updated successfully, but these errors were encountered:

jbaergen-do · 2024-09-04T14:59:40Z

Were any PGs unclean at the time? pgremapper will issue quite a few extra commands in that case, and would be limited in performance by command round-tripping. Increasing --concurrency can help with this.

If you run with --verbose it will spit out all of the Ceph commands being run; I'd be interested to know whether it's issuing commands the whole time or if it was sitting there thinking.

k0ste · 2024-09-05T05:19:50Z

If you run with --verbose it will spit out all of the Ceph commands being run; I'd be interested to know whether it's issuing commands the whole time or if it was sitting there thinking.

It was thinking. Everything I said concerns not what the pgremapper will do actually, but the moment when the pgremapper plan what to do, e.g. dry-run: cancel-backfill --verbose (without --yes)

jbaergen-do · 2024-09-05T15:11:46Z

Hmm, OK. The OSD and PG counts are consistent with some systems we've tested on in the past, and I don't think the rack/host count should affect cancel-backfill. Reviewing the code, I see that we do run the PG calculations in parallel, controlled by the concurrency setting, but depending on how much of the computation time is spent in mappingState.tryRemap(), the lock in that function might be a bottleneck...

k0ste · 2024-09-05T15:51:09Z

the lock in that function might be a bottleneck...

In perf top I see this

This guy?

jbaergen-do · 2024-09-05T15:54:20Z

Oh, interesting, that appears to be spending a bunch of time in garbage collection. Maybe there are things we can do to be more memory-allocation-efficient here.

k0ste · 2024-11-22T09:00:40Z

Maybe there are things we can do to be more memory-allocation-efficient here

We would be happy to test the patch if it is possible to do something in this place. According to my observations, the pgremapper can consume 7-8 cores in another places. In this cluster it is limited to 2-3 cores

jbaergen-do · 2024-11-22T13:56:01Z

Apologies - this is still on our list but it hasn't been able to bubble to the top yet

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pgremapper performance limit? #52

pgremapper performance limit? #52

k0ste commented Sep 4, 2024

jbaergen-do commented Sep 4, 2024

k0ste commented Sep 5, 2024

jbaergen-do commented Sep 5, 2024

k0ste commented Sep 5, 2024

jbaergen-do commented Sep 5, 2024

k0ste commented Nov 22, 2024

jbaergen-do commented Nov 22, 2024

pgremapper performance limit? #52

pgremapper performance limit? #52

Comments

k0ste commented Sep 4, 2024

jbaergen-do commented Sep 4, 2024

k0ste commented Sep 5, 2024

jbaergen-do commented Sep 5, 2024

k0ste commented Sep 5, 2024

jbaergen-do commented Sep 5, 2024

k0ste commented Nov 22, 2024

jbaergen-do commented Nov 22, 2024