This project's goal is to devise, implement, and evaluate techniques for generating optimized hash functions tailored for string keys whose format can be approximated by a regular expression inferred through profiling. These functions will be used to improve the performance of C++'s STL data structures, such as std::unordered_map
, std::unordered_set
, std::unordered_multimap
and std::unordered_multiset
, in addition to any other std::hash
specialization for user-defined C++ types.
These are the most important dependencies for building and running all Sepe programs:
Dependency | Version | Installation Link |
---|---|---|
clang | >= 14.0.0 | llvm.org |
CMake | >= 3.20 | cmake.org |
Rust | >= 1.7 | rust.org |
Python | >= 3.10 | python.org |
Rust is only necessary if you want to run the experiments. If you are only interested in the hash functions generation, only clang
is necessary.
You can follow these two steps to use optimized hash functions generated from this project:
- Obtain your synthesized hash function in one of the two ways:
- Using a set of key examples.
- Using the regular expression of the keys.
- Integrate the optimized hash function into your code .
To synthesize hash functions from key examples, you only need to create a file containing a non-exhaustive but representative key set.
Supposing your key strings are saved in the txt-file-with-strings
file, you can run the following command:
./bin/keysynth "$(./bin/keybuilder < txt-file-with-strings)"
To build the hash function from the regular expression of your keys, use:
make
./scripts/make_hash_from_regex.sh [REGEX]
Example: Generating a custom hash function for IPV4 keys
./scripts/make_hash_from_regex.sh "(([0-9]{3})\.){3}[0-9]{3}" #or single quotes in zshell
See more about regular expressions in the keygen section.
Suppose your code has a C++ STL std::unordered_map with IPV4 std::string as keys and int as values.
void yourCode(void){
std::unordered_map<std::string, int, synthesizedOffXorHash> map;
map["255.255.255.255"] = 42;
// more code that uses map object
}
After running, ./scripts/make_hash_from_regex.sh "(([0-9]{3})\.){3}[0-9]{3}"
, you should get the following output with two function options:
// Helper function, include in your codebase:
inline static uint64_t load_u64_le(const char* b) {
uint64_t Ret;
// This is a way for the compiler to optimize this func to a single movq instruction
memcpy(&Ret, b, sizeof(uint64_t));
return Ret;
}
// Pext Hash Function:
struct synthesizedPextHash {
// Omitted for brevity in this code snippet
};
// OffXor Hash Function:
struct synthesizedOffXorHash {
std::size_t operator()(const std::string& key) const {
const std::size_t hashable0 = load_u64_le(key.c_str()+0);
const std::size_t hashable1 = load_u64_le(key.c_str()+7);
size_t tmp0 = hashable0 ^ hashable1;
return tmp0;
}
};
If in doubt, we always recommend using the synthesizedOffXorHash variant, according to our benchmarks.
Copy and paste the desired hash function, in this example, synthesizedOffXorHash
, into your codebase and then add its name as the third argument in the std::unordered_map template.
inline static uint64_t load_u64_le(const char* b) {
uint64_t Ret;
// This is a way for the compiler to optimize this func to a single movq instruction
memcpy(&Ret, b, sizeof(uint64_t));
return Ret;
}
struct synthesizedOffXorHash {
std::size_t operator()(const std::string& key) const {
const std::size_t hashable0 = load_u64_le(key.c_str()+0);
const std::size_t hashable1 = load_u64_le(key.c_str()+7);
size_t tmp0 = hashable0 ^ hashable1;
return tmp0;
}
};
void yourCode(void){
std::unordered_map<std::string, int, synthesizedOffXorHash> map;
map["255.255.255.255"] = 42;
// more code that uses map object
}
Building and running with default parameters:
./scripts/install_abseil.sh # necessary for keyuser
make && make benchmark
./bin/sepe-runner [REGEXES]
Valid regexes are listed in the Regexes.toml
file.
Example: Benchmarking all IPV4 hash functions with default parameters
./bin/sepe-runner IPV4
./scripts/keyuser_interpreter.py -p IPV4_performance.csv
For more options, see sepe-runner section:
keygen
generates (standard output) n random keys from Regex.
Not all valid regexes are accepted since we did not implement the OR
(|
), Kleene Star
(*
), Plus
(+
), and DOT
(.
) operators.
./bin/keygen REGEX [number_of_elements] [seed]
Example: Generating 2 random IPV4 keys with seed 223554
./bin/keygen "(([0-9]{3})\.){3}[0-9]{3}" -n 2 -s 223554
313.797.178.390
445.982.868.308
For more options, do:
./bin/keygen --help
We recommend using keyuser via sepe-runner
keyuser
benchmarks custom hash functions with keys received from standard input.
<standard_output_keys> | ./bin/keyuser [hashes] <num_operations> <insert> <search> <elimination> [seed] [verbose]
If no [hashes] are specified, only generic hash functions are executed
Example: Benchmarking 2 IPV4 Keys with 10 total operations using STDHashBin PextIPV4 hash functions. 50% insertions, 30% search, and 20% elimination operations.
./bin/keygen "(([0-9]{3})\.){3}[0-9]{3}" -n 2 -s 223554 | ./bin/keyuser --hashes STDHashBin PextIPV4 -n 10 -i 50 -s 30 -e 20
For more options, do:
./bin/keyuser --help
keybuilder
creates a regex from a series of strings passed through standard input, separated by a new line.
./bin/keybuilder < txt-file-with-strings
keysynth
synthesizes the hash functions based on the regex generated by the keybuilder
. It is picky about the regex's format, so it is not recommended to hand-write it. Use keybuilder
instead.
./bin/keysynth "$(./bin/keybuilder < txt-file-with-strings)"
sepe-runner
is a helper program that connects the other programs together as needed.
Regexes.toml
is a configuration file containing all accepted sepe-runner
regular expressions and their associated Hash Functions. Changing this file also requires changing keyuser
.
./bin/sepe-runner Regex-entry-in-Regexes.toml
Some relevant parameters are:
-k, --keys
: Number of keys to generate-o, --operations
: Number of operations to run-i, --insert
: Percentage of insertion operations-s, --search
: Percentage of search operations-e, --elimination
: Percentage of elimination operations--histogram
: Generate the distribution histogram for the given regex, do not run experiments
Example: Running the IPV4 benchmark
./bin/sepe-runner IPV4
For more options, do:
./bin/sepe-runner --help
The scripts
folder contains some helper scripts that may be useful for some people:
align_csv.sh
- pretty printskeyuser
's generated.csv
files for easier analysisbenchmark.sh
- helper to run many benchmarks at onceinstall_abseil.sh
- installs the abseil library locally. Necessary forkeyuser
make_hash_from_regex.sh
- creates a hash function from a user-defined regexkeyuser_interpreter.py
- interprets the results generated fromkeyuser
's benchmarks
This script is used to help interpret the output of keyuser
. It can plot graphs, generate tables, and perform statistical analysis.
The most relevant configurations are:
-d DISTRIBUTION, --distribution DISTRIBUTION
Name of the distribution file to interpret. Exclusive with -p option.
-p [PERFORMANCE ...], --performance [PERFORMANCE ...]
Name of the CSV performance files to interpret. Exclusive with -d option.
-pg, --plot-graph Option to plot the results in graphs.
-hf [HASH_FUNCTIONS ...], --hash-functions [HASH_FUNCTIONS ...]
Name of the hash functions to analyze.
Example for interpreting performance using IPV4 keys:
./bin/sepe-runner IPV4 && ./scripts/keyuser_interpreter.py -p IPV4_performance.csv
Example for interpreting hash distribution using IPV4 keys:
./bin/sepe-runner --histogram IPV4 && ./scripts/keyuser_interpreter.py -d IPV4_distribution.py
The artifact branch reproduces the research questions from the paper. All scripts to reproduce the RQs are available in a Docker container. RQ1 and RQ2 can be reproduced with a single script \texttt{rq1_rq2_benchmark.sh}. All other RQs have an individual script \texttt{rq_benchmark.sh}.