Switch `std::string` -> hashes for label resolution. #2215

thorstenhater · 2023-08-29T10:47:56Z

Use the simple but well-known FNV-1a hash function to map cell_tag_type aka std::string to an uint64_t
for label resolution. The former type has a size of 32B or more and the latter 8B, thus cutting the storage and bandwidth
requirements by 3/4.

The hash function is implemented from the reference given on the authors' website/wikipedia and is extremely
simple. If we ever experience issues, we might consider switching this to something of higher quality via an
external library, candidates are xxHASH and Murmur3.

https://github.com/Cyan4973/xxHash

Note: This should further relieve the memory pressure on larger scale simulation as formulated in #1969 and make
#2005 less urgent.

There is no performance impact (at laptop scale), but the memory savings are worth it.

TODO

Implement collision checks.
Gather some performance numbers.

AdhocMan

This looks good to me. It's great to be able to avoid a lot of the internal string handling!
For the high-level network feature, it might be best to expose hashes to the user. So it could be useful to have access to the hash function, for example to allow comparison of hashed labels for custom selection functions. It's probably an uncommon use-case, but what do think about making the hash function part of the arbor interface?

thorstenhater · 2023-09-19T07:59:38Z

So, the user can see the hashes, they are in fact returned from decor::placements and there's the reverse mapping
in decor::hashes. Thus I'd say we can expose the hash function. Since this would make the concrete hash function
part of the interface layer, shall we think once more about using something different? I think xxHash is the only contender.

AdhocMan · 2023-10-06T14:28:16Z

I don't know if our simple hashing requirements justify adding another dependency. For the network generation feature, a change of hash function at a later stage would no longer be an issue. But I'd expect FNV-1a to be good enough for our purpose.
Once there the hash function is part of the interface, this PR is good to merge from my side.

AdhocMan · 2023-10-06T14:51:50Z

Technically, we already have a hash function exposed as part of the interface in hash_def.hpp. So maybe provide the hash as an overload of the hash_value function for string_view and string types if we want to use our own implementation instead of std::hash?

arbor/include/arbor/util/hash_def.hpp

AdhocMan · 2023-10-20T12:53:32Z

arbor/include/arbor/util/hash_def.hpp

+//             that will be here until the end of time.
+
+template <typename T>
+inline constexpr std::size_t internal_hash(T&& data) {


Thanks for moving it here!
I just have one suggestion.
Calling internal_hash and hash_value would give different results for a single type due to the multiplication with the prime number in hash_value_combine. How about merging the internal_hash function into hash_value?
So something like this:

template <typename T> inline constexpr std::size_t hash_value(const T& t) { if constexpr (std::is_convertible_v<T, std::string_view>) { constexpr std::size_t prime = 0x100000001b3; constexpr std::size_t offset_basis = 0xcbf29ce484222325; std::size_t hash = offset_basis; for (uint8_t byte: std::string_view{t}) { hash = hash ^ byte; hash = hash * prime; } return hash; } else { return std::hash<std::decay_t<T>>{}(t); } } template <typename T, typename... TAIL> inline constexpr std::size_t hash_value(const T& t, const TAIL&... tail) { constexpr std::size_t prime = 93481; return prime * hash_value(t) + hash_value(tail...); }

This way we would simply always use hash_value throughout the code.

Also, a comment mentioning the different behaviour for char* and const char* types would be good. For these, the content they point to is hashed. Any other type of pointer will have the hash of the pointer value itself.
Or we do not allow usage with other types of pointers.

Ok. I decided to add a slightly more intricate logic layer:

char*, const char*, char[] are hashed as strings

integral types are hashed byte-wise, to avoid the STL int-hash-is-id 'feature'

void* is hashed like an integral value

other pointer types are denied

everything else is hashed via STL

Idea being that hashes of pointer types are likely be unintended and must be opted-in to via casting to void*.
The deviation from STL is meant to spread out integer-keyed maps a bit better. I hope they do not use a similar
scheme for real values...

I think this is a decent solution for the problem of how to treat different pointer types.
The first comment is not yet addressed however. Calling hash_value and internal_hash for a single type will yield a different hash. As suggested above, we could just overload the hash_value function, such that the current internal_hash function will just become the hash_value function for single types. This way we can also get rid of hash_value_combine, which appears to be only used here.

Ok, I changed the hash_combine function (below) s.t.

hash_combine(vs) = H(0, vs) H(h, v:vs) = H(h * prime + internal_hash(v), vs)

shouldn't that address your comment?

Yeah, I missed that change. I think just having hash_value would still be overall cleaner then internal_hash, hash_combine and hash_value. But from a functionality point of view I'm ok with it as is.

Well, the surface now just is hash_value. I should adjust the usage accordingly and stuff internal_hash and hash_combine into a detail namespace.

AdhocMan

Almost there 😉
Just two small suggestions.

AdhocMan · 2023-11-28T12:57:08Z

arbor/include/arbor/util/hash_def.hpp

+
+#include <iostream>


Suggested change

#include <iostream>

I think this is not needed.

Oh dear my old vice ;)

AdhocMan · 2023-11-28T13:08:26Z

arbor/include/arbor/util/hash_def.hpp

+//             that will be here until the end of time.
+
+template <typename T>
+inline constexpr std::size_t internal_hash(T&& data) {


I think this is a decent solution for the problem of how to treat different pointer types.
The first comment is not yet addressed however. Calling hash_value and internal_hash for a single type will yield a different hash. As suggested above, we could just overload the hash_value function, such that the current internal_hash function will just become the hash_value function for single types. This way we can also get rid of hash_value_combine, which appears to be only used here.

AdhocMan

Looks good to me now 👍

thorstenhater added 3 commits August 29, 2023 12:37

Switch std::string -> hashes for label resolution.

c8421fd

Invariant checks for hash collisions.

d13a5b1

Account for same keys on cell label range.

ee037fc

thorstenhater requested review from AdhocMan and boeschf August 30, 2023 07:02

thorstenhater mentioned this pull request Sep 4, 2023

Feature: High-level network specification #2050

Merged

3 tasks

thorstenhater added 3 commits September 12, 2023 14:18

Push hashes up into decor.

f008f72

Merge remote-tracking branch 'origin/master' into perf/hashed-labels

eb2c7fb

Map back to labels.

d8546fc

AdhocMan reviewed Sep 19, 2023

View reviewed changes

Add internal_hash to hash_def and integrate w/ hash_value.

cf2a5f8

AdhocMan requested changes Oct 20, 2023

View reviewed changes

thorstenhater added 4 commits November 22, 2023 10:20

Add some testing, treat pointers.

682a6aa

Merge remote-tracking branch 'origin/master' into perf/hashed-labels

fbdf107

Merge part two.

ef792c9

The missing test.

97ca7a4

thorstenhater requested a review from AdhocMan November 24, 2023 09:42

AdhocMan requested changes Nov 28, 2023

View reviewed changes

thorstenhater added 2 commits November 29, 2023 09:36

Merge remote-tracking branch 'origin/master' into perf/hashed-labels

0cec44f

Shuffle internal hash and combine into a detail namespace.

c583b2f

thorstenhater requested a review from AdhocMan November 29, 2023 08:53

AdhocMan approved these changes Nov 29, 2023

View reviewed changes

thorstenhater merged commit c391775 into arbor-sim:master Nov 29, 2023
22 checks passed

thorstenhater deleted the perf/hashed-labels branch November 29, 2023 12:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch `std::string` -> hashes for label resolution. #2215

Switch `std::string` -> hashes for label resolution. #2215

thorstenhater commented Aug 29, 2023 •

edited

Loading

AdhocMan left a comment

thorstenhater commented Sep 19, 2023 •

edited

Loading

AdhocMan commented Oct 6, 2023

AdhocMan commented Oct 6, 2023

AdhocMan Oct 20, 2023 •

edited

Loading

AdhocMan Oct 20, 2023

thorstenhater Nov 22, 2023

AdhocMan Nov 28, 2023

thorstenhater Nov 28, 2023

AdhocMan Nov 28, 2023

thorstenhater Nov 28, 2023

AdhocMan left a comment

AdhocMan Nov 28, 2023

thorstenhater Nov 28, 2023

AdhocMan Nov 28, 2023

AdhocMan left a comment

Switch std::string -> hashes for label resolution. #2215

Switch std::string -> hashes for label resolution. #2215

Conversation

thorstenhater commented Aug 29, 2023 • edited Loading

TODO

AdhocMan left a comment

Choose a reason for hiding this comment

thorstenhater commented Sep 19, 2023 • edited Loading

AdhocMan commented Oct 6, 2023

AdhocMan commented Oct 6, 2023

AdhocMan Oct 20, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AdhocMan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AdhocMan left a comment

Choose a reason for hiding this comment

Switch `std::string` -> hashes for label resolution. #2215

Switch `std::string` -> hashes for label resolution. #2215

thorstenhater commented Aug 29, 2023 •

edited

Loading

thorstenhater commented Sep 19, 2023 •

edited

Loading

AdhocMan Oct 20, 2023 •

edited

Loading