Optimise Memory Consumption during Setup #2005

thorstenhater · 2022-10-19T08:29:17Z

Introduction

Historically, our setup phase is quite cavalier with performance and memory which has given rise
to various concerns for larger networks. As we are preparing for large deployments in the next
HPC generation, we need to address this.

⚠️ Important ⚠️

Tests on MPI are still failing for domain decomposition due to a cyclic shift of groups
which shouldn't influence the test or functionality. But I am too stupid to fix the test to
pass.

Label Resolution

Our label resolution scheme concatenates a global vector of all source labels across all MPI tasks.
In my recent experiments, this exhausts the node-local memory (512GB) at around 150M cells,
with only a few --- $\mathcal{O}(10)$ --- labels per cell.

Fix

In each cell group, build a lazy map of the GIDs we are connected to and the associated source
labels. This reduces the required memory to the actual high-water-mark, but incurs the cost
of instantiating cells multiple times.

If this still not enough, we can proceed to minimize the memory consumption even more by

(simple) evict labels from the cache if we are close to the limit. Costs more instantiations, but
saves memory Done
(optimal) observing which cells connect to a given source GID, instantiating the cell for only as
long as needed. Now memory is O(1), but we need to fiddle with construction order all the while
keeping the connectivity stable. Which might be complicated

Thingification during Cable Cell Construction

Cable cells constantly invoke thingify(region, provider) during construction. Mostly however, we
paint those region in succession, ie:

  # ...
  .paint('(tag 1)', ...)
  .paint('(tag 1)', ...)
  .paint('(tag 1)', ...)
  #...

This costs a lost of temporary allocations.

Fix

Simply cache the last region, following the idea that this is the 90% case.

Temporary data structures during Setup

embed_pwlin is particularly bad with managing allocation so much so that it shows up on time
profiles as a major cost centre. This originates from structures which are nested vectors under the
hood and must return values for reasons of genericity.

Fix

Return references where possible, convert the rest to continuation passing style so we never actually
return a value but take a callback which is invoked with the references.

`mechanism_catalogue` allocates temporaries

operator[] returns a value, where a const reference would be sufficient.

Fix

Obvious, no?

General

Split large functions -- especially in fvm_layout -- into smaller ones. This fences in temporary memory allocations.

Clean-up and Extras

Rename `mc_cell_` to the proper `cable_cell_`

This also renames -- finally! -- mc_cell_* to cable_cell_*

Add RSS field to Profiler

To check for the actual memory consumption recording of the high watermark memory was added to the
profiler. It now prints the memory used at the 1st hit of this cost centre and the associated maximum. Thus,
we can determine allocation during runtime and setup.

Example (not very helpful, but illustrative)

REGION             CALLS  WALL/s  THREAD/s      %  MAX_RSS/kB  1st_RSS/kB
------             -----  ------  --------  -----  ----------  ----------
root                   -   0.686     5.485  100.0           -           -
  advance              -   0.685     5.481   99.9           -           -
    integrate          -   0.685     5.476   99.8           -           -
      current          -   0.281     2.246   40.9           -           -
        pas        40200   0.189     1.515   27.6       29248       29232
        zero       40200   0.035     0.283    5.2       29248       29232
        hh         40200   0.031     0.249    4.5       29248       29232
        expsyn     40200   0.025     0.198    3.6       29248       29232

Add logging `operator new`

This will -- if enabled at compile time -- spit out the amount of Bytes allocated and the callstack
of the allocation site to stderr during runtime. ⚠️ This is hilariously slow, but really helps pinpointing
allocations. Use the script/memory-log-to-profile.py to convert log files to a report like this:

Size/kB   Count
4628      88274     root
4628      88274       start
4627      88269         main
4487      86748           arb::simulation::simulation(arb::recipe const&, std::__1::shared_ptr<arb::execution_context>, arb::domain_decomposition const&, unsigned long long)
4487      86748             arb::simulation::simulation(arb::recipe const&, std::__1::shared_ptr<arb::execution_context>, arb::domain_decomposition const&, unsigned long long)
4487      86747               arb::simulation_state::simulation_state(arb::recipe const&, arb::domain_decomposition const&, std::__1::shared_ptr<arb::execution_context>, unsigned long long)
4487      86747                 arb::simulation_state::simulation_state(arb::recipe const&, arb::domain_decomposition const&, std::__1::shared_ptr<arb::execution_context>, unsigned long long)
3106      69279                   void arb::simulation_state::foreach_group_index<arb::simulation_state::foreach_group_index(arb::recipe const&, arb::domain_decomposition const&, std::__1::shared_ptr<arb::execution_context>, unsigned long long)::$_0>(arb::simulation_state::foreach_group_index(arb::recipe const&, arb::domain_decomposition const&, std::__1::shared_ptr<arb::execution_context>, unsigned long long)::$_0&&)
3102      69276                     arb::simulation_state::foreach_group_index<arb::simulation_state::foreach_group_index(arb::recipe const&, arb::domain_decomposition const&, std::__1::shared_ptr<arb::execution_context>, unsigned long long)::$_0>(arb::simulation_state::foreach_group_index(arb::recipe const&, arb::domain_decomposition const&, std::__1::shared_ptr<arb::execution_context>, unsigned long long)::$_0&&)::{lambda(int)#1}::operator()(int) const
3102      69276                       arb::simulation_state::simulation_state(arb::recipe const&, arb::domain_decomposition const&, std::__1::shared_ptr<arb::execution_context>, unsigned long long)::$_0::operator()(std::__1::unique_ptr<arb::cell_group, std::__1::default_delete<arb::cell_group> >&, int) const

Linked

Fixes #1969

…umonguous-buffer-of-all-the-labels-in-the-world

AdhocMan

Just some minor issues / comments.

arbor/communication/communicator.cpp

arbor/label_resolution.hpp

test/unit-distributed/test_communicator.cpp

…umonguous-buffer-of-all-the-labels-in-the-world

AdhocMan

Looks good to me. I'll leave merging up to you, in case you'd like feedback from others on this change.

boeschf · 2023-01-31T10:48:26Z

do we still need the struct arb::connectivity?

thorstenhater · 2023-01-31T14:04:55Z

No and it should no longer be here.

boeschf · 2023-03-07T10:21:48Z

arbor/profile/profiler.cpp

+    res.max_rss = (n.max_rss == -1) ? "-" : std::to_string(n.max_rss);
+    res.fst_rss = (n.fst_rss == -1) ? "-" : std::to_string(n.fst_rss);


out of curiosity: why are you not using snprintf here?

No particular reason, that was just the way it was done here before.

boeschf · 2023-03-07T10:42:33Z

arbor/communication/communicator.cpp

+            if (!sources.count(sgid)) sources.emplace(sgid, label_map{sgid, rec});
+            auto src_lid = sources.at(sgid).resolve(c.source);


Suggested change

if (!sources.count(sgid)) sources.emplace(sgid, label_map{sgid, rec});

auto src_lid = sources.at(sgid).resolve(c.source);

auto src_lid = sources.try_emplace(sgid, sgid, rec).first->second.resolve(c.source);

So, if I do this, will the value be constructed? If so, it's precisely running against what I want.

… on test 8.13MB -> 4.63MB

Ponder whether we can make MT build more efficient.

brenthuisman · 2023-07-04T11:37:46Z

Looks like this'll be essential for large simulations, the use-case we promote Arbor for. Is it possible to provide either strategy as a user option? Give me fast init, or give me low memory usage? Or even auto-activate the low mem path upon ncells > threshold (e.g. 100M?).

If not that, maybe a troubleshooting checklist for when e.g. people run out of memory or other fail situations.

Use the simple but well-known FNV-1a hash function to map `cell_tag_type` aka `std::string` to an `uint64_t` for label resolution. The former type has a size of 32B or more and the latter 8B, thus cutting the storage and bandwidth requirements by 3/4. The hash function is implemented from the reference given on the authors' website/wikipedia and is extremely simple. If we ever experience issues, we might consider switching this to something of higher quality via an external library, candidates are `xxHASH` and `Murmur3`. https://github.com/Cyan4973/xxHash Note: This should further relieve the memory pressure on larger scale simulation as formulated in #1969 and make #2005 less urgent. There is no performance impact (at laptop scale), but the memory savings are worth it.

Spin-off from #2005. Make the primary load balancing cleaner and faster and more maintainable. Thus: - remove all MPI calls, this is now purely local - remove temporary data structures and/or coral them into their own little scopes - simplify `super_cells` vs `regular_cells` - sort less - sparse connection tables Partially inspired by external feedback

thorstenhater · 2024-08-20T08:52:33Z

This has been splintered and merged in those splinters.

thorstenhater added 2 commits October 17, 2022 17:01

Only accept recipes.

c00ea33

Stop communicator update-connections from concatenating a huge vector.

cacf3d1

thorstenhater requested review from bcumming, noraabiakar and boeschf October 19, 2022 08:29

thorstenhater added 4 commits October 19, 2022 11:38

Merge remote-tracking branch 'origin/master' into hpc/do-not-make-a-h…

16f90ee

…umonguous-buffer-of-all-the-labels-in-the-world

Merge remote-tracking branch 'origin/master' into hpc/do-not-make-a-h…

12c8a2d

…umonguous-buffer-of-all-the-labels-in-the-world

Merge.

3d02e95

Merge.

c99070b

thorstenhater marked this pull request as ready for review January 19, 2023 11:06

Merge remote-tracking branch 'origin/master' into hpc/do-not-make-a-h…

f86c479

…umonguous-buffer-of-all-the-labels-in-the-world

AdhocMan requested changes Jan 26, 2023

View reviewed changes

arbor/communication/communicator.cpp Outdated Show resolved Hide resolved

arbor/label_resolution.hpp Outdated Show resolved Hide resolved

test/unit-distributed/test_communicator.cpp Show resolved Hide resolved

thorstenhater added 2 commits January 27, 2023 09:11

Merge remote-tracking branch 'origin/master' into hpc/do-not-make-a-h…

ca15a1f

…umonguous-buffer-of-all-the-labels-in-the-world

Fix review commits.

adbec3c

AdhocMan previously approved these changes Jan 27, 2023

View reviewed changes

Remove an obsolete hierarchy.

6273fc0

thorstenhater dismissed AdhocMan’s stale review via 6273fc0 January 31, 2023 14:31

thorstenhater added 5 commits February 1, 2023 13:42

Add high-water-mark memory to profiling.

29f79e3

Add first-seen memory mark and iron out printing.

a90fc04

Appease black.

bf1e65c

Fix and refactor source labels/resolver handling.

285a734

Revert ring.cpp.

41b6499

thorstenhater requested a review from AdhocMan February 13, 2023 10:43

boeschf reviewed Mar 7, 2023

View reviewed changes

thorstenhater added 4 commits June 13, 2023 09:35

Nasty merge: binning, fixed-dt, and delivery.

dc84213

More mergen.

3816910

Make pw_zip_iterator less atrocious.

0bda6f8

Implement region and locset caching.

4a17c28

Add logging new, caching and early deletion for get_sources. Memory…

b81674a

… on test 8.13MB -> 4.63MB

thorstenhater changed the title ~~Remove a global data-structure~~ Optimise Memory Consumption during Setup Jun 17, 2023

thorstenhater added 11 commits June 17, 2023 14:31

Add ankerl containers and try them for label resolution.

21b2ade

Shrink discretisation after building.

58a5f13

Ponder whether we can make MT build more efficient.

Make load balance waaaaay simpler at the cost of trivial permutation.

2064268

Modernise mechcat a bit.

a71a3be

Pull out diffusion building.

35925b6

Move more, copy less.

a38759c

Get rid of an imediate vector of all GIDs.

ef14746

Allow mechcat to return const refs.

d7802bf

Add memory logger properly.

836ef5c

Black/Flake

a669ae4

Clean-up map change

903d02d

This was referenced Aug 11, 2023

Improve partition_load_balance #2206

Merged

Reduce transient memory allocations in integrate/pw_lin. #2208

Closed

thorstenhater mentioned this pull request Aug 29, 2023

Switch std::string -> hashes for label resolution. #2215

Merged

2 tasks

thorstenhater closed this Aug 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimise Memory Consumption during Setup #2005

Optimise Memory Consumption during Setup #2005

thorstenhater commented Oct 19, 2022 •

edited

Loading

AdhocMan left a comment

AdhocMan left a comment

boeschf commented Jan 31, 2023

thorstenhater commented Jan 31, 2023

boeschf Mar 7, 2023

thorstenhater Jun 20, 2023

boeschf Mar 7, 2023

thorstenhater Jun 20, 2023

brenthuisman commented Jul 4, 2023 •

edited

Loading

thorstenhater commented Aug 20, 2024

		res.max_rss = (n.max_rss == -1) ? "-" : std::to_string(n.max_rss);
		res.fst_rss = (n.fst_rss == -1) ? "-" : std::to_string(n.fst_rss);

		if (!sources.count(sgid)) sources.emplace(sgid, label_map{sgid, rec});
		auto src_lid = sources.at(sgid).resolve(c.source);

	if (!sources.count(sgid)) sources.emplace(sgid, label_map{sgid, rec});
	auto src_lid = sources.at(sgid).resolve(c.source);
	auto src_lid = sources.try_emplace(sgid, sgid, rec).first->second.resolve(c.source);

Optimise Memory Consumption during Setup #2005

Optimise Memory Consumption during Setup #2005

Conversation

thorstenhater commented Oct 19, 2022 • edited Loading

Introduction

⚠️ Important ⚠️

Label Resolution

Fix

Thingification during Cable Cell Construction

Fix

Temporary data structures during Setup

Fix

mechanism_catalogue allocates temporaries

Fix

General

Clean-up and Extras

Rename mc_cell_* to the proper cable_cell_*

Add RSS field to Profiler

Add logging operator new

Linked

AdhocMan left a comment

Choose a reason for hiding this comment

AdhocMan left a comment

Choose a reason for hiding this comment

boeschf commented Jan 31, 2023

thorstenhater commented Jan 31, 2023

boeschf Mar 7, 2023

Choose a reason for hiding this comment

thorstenhater Jun 20, 2023

Choose a reason for hiding this comment

boeschf Mar 7, 2023

Choose a reason for hiding this comment

thorstenhater Jun 20, 2023

Choose a reason for hiding this comment

brenthuisman commented Jul 4, 2023 • edited Loading

thorstenhater commented Aug 20, 2024

thorstenhater commented Oct 19, 2022 •

edited

Loading

`mechanism_catalogue` allocates temporaries

Rename `mc_cell_` to the proper `cable_cell_`

Add logging `operator new`

brenthuisman commented Jul 4, 2023 •

edited

Loading