What are we looking for in a benchmarking tool? #112

SWilson4 · 2024-04-12T18:22:31Z

SWilson4
Apr 12, 2024
Maintainer

As discussed in the 2024/04/09 OQS developers call, we (well, I) would like to revamp this subproject so that we can collect useful and current benchmarking data. Before diving in, we should assess what we hope to get out of the work.

Here are some preliminary questions that come to mind:

How do we intend to use benchmarking data?

For example: Do we want to inform "data-driven" decisions about algorithm use? Test for performance regressions after algorithm updates? Provide data for researchers to use in tables in their papers? The answers to this will naturally inform the second question:

What data do we want to collect?

The current (old?) benchmarking code defines five levels for data collection, three of which are/were implemented. Does this still reflect the data we want to capture?

Where do we want this to run?

We had been using an ad-hoc collection of AWS VMs and Douglas's M1 Macbook. Do we want to add more platforms (e.g., s390x)? Should we add some performance-testing guarantees to platform support documentation? Do we want to keep running everything in Docker containers, or should we run on the host OS directly?

Where do we want the code to live?

New repo? New name? Within liboqs? etc.

Let the discussion begin...

SWilson4 · 2024-04-12T18:24:52Z

SWilson4
Apr 12, 2024
Maintainer Author

I'll get things started with my thoughts on the above questions.

The main use cases I see for benchmarking data are

evaluating algorithmic optimizations,
comparing algorithm candidates, and
regression testing.

I think (2) is perhaps the most important to the crypto community at large, especially as we bring on board more candidates from the NIST signature on-ramp. I think (3) would be a good addition to our automated release testing. (We did this manually, sometimes, but it's not clear to me how consistently.)

On that front, I think we ought to keep collecting library-level (keygen / encaps / sign / etc) data as well as application- and network-level data. I imagine that TLS handshake performance data in particular will be valuable for evaluating new signature algorithms.

Somewhat complicated infrastructure setup has given us problems in the past, and I think we have a good chance to correct this with a rewrite. IMO, we should track setup/trigger scripts in GitHub whenever possible. Ideally, I think we would run on a CI platform, but I don't think that we would get reliable data that way. (Worth investigating?) For a start, I think we should collect data on x64/arm64 and add support for other platforms by popular demand/contribution.

To sum it up, I would say that the previous profiling code did a good job of capturing most of the data I would consider valuable; the setup was simply difficult to maintain and debug when things went askew. This is by no means an affront to @baentsch, who (as I understand it) wrote most of the previous code on his own. It seems to me that we now have more resources to throw at the problem and hence make improvements.

As for where to put this... I'm all for a fresh start in a new repo, which links back to this one as appropriate for documentation. I also submit that "profiling" is a word with connotations best avoided and suggest open-quantum-safe/benchmarking as a new name.

Tagging @open-quantum-safe/tsc for input, as this hits both "approving project or system proposals (including, but not limited to, incubation, deprecation, and changes to a sub-project’s scope)" and "organizing sub-projects and removing sub-projects". Also @open-quantum-safe/core.

2 replies

bhess Apr 16, 2024
Collaborator

Thank you @SWilson4 for taking that on! I think both use cases, regression testing and evaluating/comparing algorithm candidates are very valuable.
For testing, another useful thing could be to include the common code in the benchmarking since many algorithms highly depend on, e.g., Keccak performance.
For evaluating/comparing algorithms, another useful metric could be memory/stack usage.
Then more "real-world" scenario such as level 4/5 in the current 'profiling' project would be nice as well, although probably tricker in the setup. Perhaps this could be done in a "less automated" way. I'm interested to contribute as well (not sure about the timing though yet).

davidgca Nov 4, 2024

Hi everyone! After implementing Locust and running some tests on a single VMs with OQS-enabled Nginx, and referring to Spencer’s point 3, I wanted to share my perspective on automated regression tests for performance within CI, in case it’s of interest. Please feel free to use any part of this comment and disregard what doesn’t align with OQS's goals.

The concept is similar to projects where I’ve integrated performance testing into CI, simulating up to 5 million IoT devices, deploying 200 VMs, and reaching loads up to 64,000 transactions per second (TPS).

Architecture:

Load Injection: Locust cluster with 1 master and multiple workers, on dedicated VMs. This setup simulates n clients and allows us to measure response time percentiles, as well as CPU and memory consumption on the machines running the cluster.
Server Side: Nginx cluster. Here, we could deploy a cluster with n OQS-enabled Nginx processes and static mocks, for example: nginx.conf:

    location /clients {
        default_type application/json;
        return 200 'ANY_JSON';
    }

For dynamic mocks, OpenResty could be an option, which essentially brings Lua into Nginx. Lua's capabilities could be used here to simulate server behavior, e.g., setting a normal distribution for response times, percentage of requests with extended response times (to mimic server timeouts), etc.

Automated Analysis: I have some doubts about using GitHub Actions tools for this. Personally, I still using Jenkins plugins like the Performance Plugin and Plot Plugin, which allow p90 response time comparisons against a baseline and historical percentile tracking, respectively. While it might be more relevant to track server-side CPU and memory usage (Nginx), response times are also useful for evaluating performance between different quantum cryptography groups, as well as comparing Post-Quantum and Pre-Quantum Cryptography impacts. In fact, IMHO measuring the performance impact between PostQC and PreQC might be the primary requirement for performance in a given project. What do you think @SWilson4 @baentsch @bhess ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What are we looking for in a benchmarking tool? #112

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

What are we looking for in a benchmarking tool? #112

SWilson4 Apr 12, 2024 Maintainer

How do we intend to use benchmarking data?

What data do we want to collect?

Where do we want this to run?

Where do we want the code to live?

Replies: 1 comment · 2 replies

SWilson4 Apr 12, 2024 Maintainer Author

bhess Apr 16, 2024 Collaborator

davidgca Nov 4, 2024

SWilson4
Apr 12, 2024
Maintainer

Replies: 1 comment 2 replies

SWilson4
Apr 12, 2024
Maintainer Author

bhess Apr 16, 2024
Collaborator