Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch std::string -> hashes for label resolution. #2215

Merged
merged 13 commits into from
Nov 29, 2023

Conversation

thorstenhater
Copy link
Contributor

@thorstenhater thorstenhater commented Aug 29, 2023

Use the simple but well-known FNV-1a hash function to map cell_tag_type aka std::string to an uint64_t
for label resolution. The former type has a size of 32B or more and the latter 8B, thus cutting the storage and bandwidth
requirements by 3/4.

The hash function is implemented from the reference given on the authors' website/wikipedia and is extremely
simple. If we ever experience issues, we might consider switching this to something of higher quality via an
external library, candidates are xxHASH and Murmur3.

https://github.com/Cyan4973/xxHash

Note: This should further relieve the memory pressure on larger scale simulation as formulated in #1969 and make
#2005 less urgent.

There is no performance impact (at laptop scale), but the memory savings are worth it.

TODO

  • Implement collision checks.
  • Gather some performance numbers.

Copy link
Contributor

@AdhocMan AdhocMan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. It's great to be able to avoid a lot of the internal string handling!
For the high-level network feature, it might be best to expose hashes to the user. So it could be useful to have access to the hash function, for example to allow comparison of hashed labels for custom selection functions. It's probably an uncommon use-case, but what do think about making the hash function part of the arbor interface?

@thorstenhater
Copy link
Contributor Author

thorstenhater commented Sep 19, 2023

So, the user can see the hashes, they are in fact returned from decor::placements and there's the reverse mapping
in decor::hashes. Thus I'd say we can expose the hash function. Since this would make the concrete hash function
part of the interface layer, shall we think once more about using something different? I think xxHash is the only contender.

@AdhocMan
Copy link
Contributor

AdhocMan commented Oct 6, 2023

I don't know if our simple hashing requirements justify adding another dependency. For the network generation feature, a change of hash function at a later stage would no longer be an issue. But I'd expect FNV-1a to be good enough for our purpose.
Once there the hash function is part of the interface, this PR is good to merge from my side.

@AdhocMan
Copy link
Contributor

AdhocMan commented Oct 6, 2023

Technically, we already have a hash function exposed as part of the interface in hash_def.hpp. So maybe provide the hash as an overload of the hash_value function for string_view and string types if we want to use our own implementation instead of std::hash?

arbor/include/arbor/util/hash_def.hpp Show resolved Hide resolved
// that will be here until the end of time.

template <typename T>
inline constexpr std::size_t internal_hash(T&& data) {
Copy link
Contributor

@AdhocMan AdhocMan Oct 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for moving it here!
I just have one suggestion.
Calling internal_hash and hash_value would give different results for a single type due to the multiplication with the prime number in hash_value_combine. How about merging the internal_hash function into hash_value?
So something like this:

template <typename T>
inline constexpr std::size_t hash_value(const T& t) {
    if constexpr (std::is_convertible_v<T, std::string_view>) {
        constexpr std::size_t prime = 0x100000001b3;
        constexpr std::size_t offset_basis = 0xcbf29ce484222325;

        std::size_t hash = offset_basis;

        for (uint8_t byte: std::string_view{t}) {
            hash = hash ^ byte;
            hash = hash * prime;
        }

        return hash;
    }
    else { return std::hash<std::decay_t<T>>{}(t); }
}

template <typename T, typename... TAIL>
inline constexpr std::size_t hash_value(const T& t, const TAIL&... tail) {
    constexpr std::size_t prime = 93481;
    return prime * hash_value(t) + hash_value(tail...);
}

This way we would simply always use hash_value throughout the code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, a comment mentioning the different behaviour for char* and const char* types would be good. For these, the content they point to is hashed. Any other type of pointer will have the hash of the pointer value itself.
Or we do not allow usage with other types of pointers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I decided to add a slightly more intricate logic layer:

  • char*, const char*, char[] are hashed as strings
  • integral types are hashed byte-wise, to avoid the STL int-hash-is-id 'feature'
  • void* is hashed like an integral value
  • other pointer types are denied
  • everything else is hashed via STL

Idea being that hashes of pointer types are likely be unintended and must be opted-in to via casting to void*.
The deviation from STL is meant to spread out integer-keyed maps a bit better. I hope they do not use a similar
scheme for real values...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a decent solution for the problem of how to treat different pointer types.
The first comment is not yet addressed however. Calling hash_value and internal_hash for a single type will yield a different hash. As suggested above, we could just overload the hash_value function, such that the current internal_hash function will just become the hash_value function for single types. This way we can also get rid of hash_value_combine, which appears to be only used here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I changed the hash_combine function (below) s.t.

hash_combine(vs) = H(0, vs)
H(h, v:vs) = H(h * prime + internal_hash(v), vs)

shouldn't that address your comment?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I missed that change. I think just having hash_value would still be overall cleaner then internal_hash, hash_combine and hash_value. But from a functionality point of view I'm ok with it as is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the surface now just is hash_value. I should adjust the usage accordingly and stuff internal_hash and hash_combine into a detail namespace.

Copy link
Contributor

@AdhocMan AdhocMan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there 😉
Just two small suggestions.

Comment on lines 22 to 23

#include <iostream>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#include <iostream>

I think this is not needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh dear my old vice ;)

// that will be here until the end of time.

template <typename T>
inline constexpr std::size_t internal_hash(T&& data) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a decent solution for the problem of how to treat different pointer types.
The first comment is not yet addressed however. Calling hash_value and internal_hash for a single type will yield a different hash. As suggested above, we could just overload the hash_value function, such that the current internal_hash function will just become the hash_value function for single types. This way we can also get rid of hash_value_combine, which appears to be only used here.

Copy link
Contributor

@AdhocMan AdhocMan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me now 👍

@thorstenhater thorstenhater merged commit c391775 into arbor-sim:master Nov 29, 2023
22 checks passed
@thorstenhater thorstenhater deleted the perf/hashed-labels branch November 29, 2023 12:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants