[WIP] [RFC] Distributed Correctness Testing Framework #3220

Swiddis · 2024-12-23T21:40:41Z

[WIP] [RFC] Distributed Correctness Testing Framework

Date: Dec 23, 2024
Status: Work In Progress

Overview

While the current integration testing framework is effective for development, there have been several bug reports around query correctness due to testing blind-spots. Some common sources for issues include:

Using complex data types such as arrays, multi-typed fields, nested fields, or timestamp fields.
Queries involving multiple indices such as JOINs or wildcard indices.
Interactions between our plugin and supported external interfaces such as JDBC and Spark.
Running queries with large datasets such as those involving large documents, high-cardinality data, as well as associated LIMIT queries or aggregations.

In addition to the obvious challenges with predicting these edge cases ahead of time, we have the additional issue that the SQL plugin has multiple runtime contexts:

Different query engines such as Legacy, V2, Spark, or (soon) Catalyst.
Several configuration options that have effects on the runtime behavior of queries (e.g. pagination mode, size or memory limits, type tolerance).
Supporting both SQL and PPL querying.
Index-specific settings such as shards or date formats.

The current integration testing process is based on doesn't scale sufficiently to detect edge cases under all of these scenarios. Historically the method has been "fix the bugs as we go", but given the scale of the project nowadays and in-progress refactoring initiatives, a testing suite that can keep up with the scale is needed.

Inspired by the likes of database testing initiatives in other projects, especially SQLancer, Google's OSS-Fuzz, and the PostgreSQL BuildFarm, this RFC proposes a project to implement a distributed random testing process, which will be able to generate a large amount of test scenarios to validate correct behavior. With such a suite, the plugin can be "soak tested" by running a large number of randomized tests and reporting errors.

Glossary

Crash Bugs: Bugs that are directly visible as erroneous behavior (e.g. error responses, crashes). Distinct from Logic Bugs.
Fuzzing: A testing technique that involves providing invalid, unexpected, or random data as inputs to a system. This usually has a heavier emphasis on asserting the system doesn't crash, as opposed to making sure the responses are actually correct.
Logic Bugs: Bugs where the system does not have an obviously incorrect behavior (e.g. error responses, crashes), but nonetheless returns incorrect results. Distinct from Crash Bugs.
Property-based testing: A testing methodology that focuses on verifying properties of the system under test for a wide range of randomly-selected inputs. Further reading: What is Hypothesis? and ScalaCheck.

Current State

Our current testing infrastructure consists of several components that serve different purposes, but none of them clearly address the gaps identified here.

The Main Integration Testing Suite:

Provides comprehensive coverage for current features, individually.
Effective for catching regressions in existing functionality.
Limited in its ability to detect edge cases or cross-feature interactions.

Comparison Testing Suite:

Attempts to validate query results by comparing against SQLite.
Useful for basic SQL functionality shared between our implementation and SQLite.
Limited by significant differences in feature sets:
- Cannot test OpenSearch-specific features (e.g. complex data types, PPL).
- May produce false positives due to intentional differences in behavior.
These tests have not actually been run in a very long time, and seem to currently be out of commission.

We may consider reusing elements of the comparison testing framework for the new suite. In particular, both this framework and the proposed solution connect to OpenSearch via JDBC. The main concern is whether we can parallelize the workload easily.

Mutation Testing Suite (PiTest):

Helps validate the effectiveness of our existing test suite under code mutations.
Identifies areas where our tests may be insufficient.
Still constrained by the scope of our existing tests, as it doesn't generate new test cases.
These tests also seem to currently be out of commission.

While these tools provide valuable testing capabilities, they fall short in several key areas:

Limited ability to generate diverse, representative test data.
Insufficient coverage of complex interactions between different SQL and PPL features.
Difficulty in testing across multiple runtime contexts and configurations.
Challenges in scaling to test large datasets or high-volume query scenarios.

Analysis of Domain

The most important question to ask is, "why can't we use an existing system?" SQL testing in particular has a lot of existing similar initiatives. A cursory look at these initiatives (ref: SQLancer, Sqllogictest, SQuaLity) reveals a few issues:

These suites aren’t designed with PPL or generally non-SQL query languages in mind, so the work to migrate their code to support both languages would be significant.
These suites generally assume that the database is not read-only: they include detailed tests for modifying database files directly (including corrupting them), as well as several tests that depend on the behavior of SQL’s write functions (INSERT, UPDATE). Removing this functionality and replacing it with a system that can set up a correctly-configured OpenSearch cluster would be again similar to writing the system from scratch.
Since our SQL implementation has several limitations that aren’t shared by other systems, we also would need to remove tests that rely on these limitations not being present. This is nontrivial since many of these systems rely on generating queries in bulk, and may implicitly use those same assumptions even for functionality we support.

Compared to trying to adapt these solutions, it seems like the most reliable method for a long-term solution is to write a custom system. In particular, these methods are well-documented, so we likely can make something that can get a similar degree of effectiveness with less effort than the migration cost. This will also give us a lot of flexibility to toggle specific features under test, such as the assortment of unimplemented features in V2 (#1889, #1718, #1487). The flexibility of supporting OpenSearch-specific semantics and options will also open routes for implementing similar testing on core OpenSearch.

Despite these limitations, all of the linked projects have extensive research behind them that can be used to guide our own implementation. Referencing them and taking what we can is likely valuable.

Goals

Looking at the limitations of our current test cases, I propose the following goals:

Scalability: Currently, it's nontrivial to run our integration tests in parallel without multiple clusters due to cross-test interference. To run the tests at the scale necessary for these types of suites to find deep bugs, we need to ensure test isolation from the start. This ties in with the next goal:
Cost Effectiveness: The costs of the suite should directly correlate with the amount of tests we're running. For the most part, running clusters is the largest cost, so the suite should be able to load balance parallelized tests across a small number of clusters.
Model Real Usage: The existing suites use a small amount of test indices for specific features. The indices generated for the tests should more closely model real indices.

Implementation Strategy

[WIP] have a rough outline, but not a high-fidelity design.

The rough outline is: to create a separate project that can connect to OpenSearch SQL via the JDBC connector -- similar to the comparison testing. We use the JDBC connector because it will let us use ScalaCheck to write the properties, working well with our current development ecosystem. We'll use that connector to run queries on test-individual indices (for isolation). At first we will only generate simple queries, but gradually extend it to match newly introduced features and cover existing features. The tests will more-or-less run independently in the background (likely on EC2), as part of a cluster separate from GitHub actions, and report bugs asynchronously as we make changes. Common bug types can get specialized tests introduced in the main test suite.

The role of the test suite as a system is shown here. It will run independently of our existing per-PR CI on some schedule.

The test system itself needs two main components: the actual test runner, and a tool that can create clusters to test with various configs (e.g. with/without Spark, with/without Catalyst). Making the cluster manager a separate container will allow running clusters with varying configurations, but for the initial version (testing on just one configuration) we can just have the pipeline spin up a single test cluster that the runner uses.

Tests will follow these steps:

Configure a cluster. For each cluster:
Generate data sets, each containing (potentially multiple) indices (and accelerations for Spark). For each data set:
Generate (read-only) queries for that data set, including multiple languages.
Validate query results on the data set in parallel.
Once the queries are done, delete their data set. Once all data sets are done, stop the cluster and repeat.

For properties we can test, the following high-level strategies are available:

Property-based testing as a testing methodology is described in the glossary.
Pivoted Query Synthesis (PQS), Non-Optimizing Reference Engine Construction (NoREC), Ternary Logic Partitioning (TLP), Query Plan Guidance (QPG), and Differential Query Plans (DQP) are all described in the SQLancer Papers. (TODO: will add descriptions/examples of all of these in this doc.)
SQuaLity also has some techniques aggregated from other database systems.

Some open questions:

Does JDBC actually support PPL? If not, how do we plan on supporting PPL?
Where will testing results be reported?
How will we handle versioning/compatibility across OpenSearch versions?
What metrics will we use to measure the effectiveness of this new testing framework compared to our current testing methods?
What mechanisms will we put in place to ensure that the generated test cases are reproducible when bugs are found?
How will we handle testing of queries that involve very large result sets or long-running operations?
How will we handle testing of queries that involve external resources (e.g., SQL functions that call external APIs)? Spark is the primary example here.
How will we handle testing of security features and access controls within this framework?
How will we prioritize which features or combinations of features to test first? (Likely those under active development.)
What provisions will we make for testing query behavior under different cluster states (e.g., node failures, network partitions)?

The text was updated successfully, but these errors were encountered:

RyanL1997 · 2024-12-24T23:06:01Z

Hi @Swiddis , thanks for putting this together. I know some of the parts are still work in progress, so just wanna leave some general thoughts after reading the current version as my first round of review:

"For the edge cases listed in the overview section, is there a severity or priority ranking to help us identify the most critical issues to address first? Or is the intention to resolve all of them simultaneously? This would provide better clarity on the focus and scope of the implementation."
Interactions between our plugin and supported external interfaces such as JDBC and Spark.

I saw that you also mentioned in the open questions: "How will we handle versioning/compatibility across OpenSearch versions?" Especially for these external dependencies, I'm actually having the similar question that will backward compatibility testing be a focus as part of the proposed framework? Are there plans to define specific test cases or scenarios to validate compatibility with older versions, and how will this be managed over time as new features are introduced?

Challenges in scaling to test large datasets or high-volume query scenarios.

Personally, I need to do more homework on the scenario for large datasets. However, by taking a first look, I think It might be useful to include a plan for benchmarking these scenarios to measure performance impact. (This may not be the P0)

The tests will more-or-less run independently in the background (likely on EC2), as part of a cluster separate from GitHub actions, and report bugs asynchronously as we make changes. Common bug types can get specialized tests introduced in the main test suite.

I noticed the mention of the above. Does this imply the introduction of another internal test pipeline, distinct from the open-source GitHub-based pipeline? If so, how will the results from this EC2-based pipeline be integrated or communicated back to the broader testing framework?

RyanL1997 · 2024-12-24T23:11:11Z

Also, for this open question:

What metrics will we use to measure the effectiveness of this new testing framework compared to our current testing methods?

Here are some general ideas in my mind:

Coverage Metrics

Test Coverage Improvement: Measure the percentage of edge cases, scenarios, or code paths covered by the new framework compared to the current one.
Edge Case Handling: Number of previously missed edge cases or bugs identified and addressed by the new framework.

Accuracy and Reliability Metrics

False Positive Rate: Track reductions in the rate of incorrect bug reports generated by the framework.
Bug Detection Accuracy: Percentage of valid bugs detected out of the total issues flagged by the framework.

Efficiency Metrics

Execution Time: Compare the time taken to execute tests in the new framework versus the current one.
Test Resource Utilization: Assess compute resource usage (e.g., EC2/GH runner?) of the new testing framework versus existing pipelines.

Outcome-Based Metrics
*This one maybe a little bit tricky since we do have other pre-release procedures, such as sanity testing to reduce the bug rate

Post-Release Bug Rate: Measure the reduction in bugs reported after release due to better pre-release testing.
Severity of Undetected Bugs: Evaluate the severity of any bugs that still slip through the framework compared to before.

Integration and Usability Metrics

Integration Time: Time required to integrate the new framework into existing pipelines or workflows.
Ease of Debugging: Feedback from developers on how easily the framework helps identify and resolve bugs.

Swiddis · 2024-12-26T22:03:55Z

I tried setting up some crude benchmarking code to estimate how much test throughput we could get. I tried versions of a Locust script that tried two scenarios. Left: many parallel tests that each create, update, and delete their own index. Right: many tests sharing a small number of indices.

In general, insert and query requests are very fast while index create/delete is slow.

Index-per-test
Shared indices

This gives some data to back up that we should have batches of tests run on batch-scoped indices, it's a throughput difference of 25x on equal hardware. It's also more reliable -- with a small number of test-scoped index workers there were timeout failures with even just 50 tests in parallel, but batched workers can handle 1000 tests in parallel without any failures (cluster generally just slows to land at around 1800 requests/second on my machine). For a real soak-test we probably want to run at least O(a million) tests total, which my dev machine can do for this benchmark in ~10 minutes.

Swiddis · 2024-12-26T23:11:39Z

@RyanL1997

For the edge cases listed in the overview section, is there a severity or priority ranking to help us identify the most critical issues to address first? Or is the intention to resolve all of them simultaneously? This would provide better clarity on the focus and scope of the implementation.

I think we can work backwards from specific past bug reports (such as those linked in the overview) to the features the query generator needs to support, then see if we reproduce them. If the process is able to find specific known issues, we can have some confidence in its ability to find more unknown issues as we add more features.

I saw that you also mentioned in the open questions: "How will we handle versioning/compatibility across OpenSearch versions?" Especially for these external dependencies, I'm actually having the similar question that will backward compatibility testing be a focus as part of the proposed framework?

I think we should leave it at first. Since we don't typically update older versions, running tests there would mostly be data collection. I do know there have been some bugs involving specific upgrade flows (example), but I'm not convinced the extra complexity would be paid back.

Personally, I need to do more homework on the scenario for large datasets. However, by taking a first look, I think It might be useful to include a plan for benchmarking these scenarios to measure performance impact.

We can probably extend the testing to do some sort of benchmarking, but let's not step on the toes of opensearch-benchmark where we don't have to. The focus for the moment is just correctness. Since we'll be running many tests per index (see above comment), we can probably afford to make each index much larger than if we were doing 1 test per index.

Does this imply the introduction of another internal test pipeline, distinct from the open-source GitHub-based pipeline? If so, how will the results from this EC2-based pipeline be integrated or communicated back to the broader testing framework?

I think it's a separately-running job, I don't dislike having it scheduled in pipelines?

It will probably be reported by some sort of email mailing list, or a dashboard that's periodically checked. We shouldn't automatically publish issues to GH (both to avoid noise and also in case there's any issues that are security-relevant). That said, I think my original idea of running this 24/7 and just tracking failures also is complicated, it'd require a whole live notification system. For simplicity, let's make the suite run a configurable finite number of tests with an existing framework, like how Hypothesis does max_examples, and deposit the HTML report somewhere. We can upgrade it if we need to, I'm not convinced we will.

anasalkouz · 2024-12-26T23:15:19Z

Thanks @Swiddis for putting this proposal, I have few generic comments:

These suites generally assume that the database is not read-only: they include detailed tests for modifying database files directly (including corrupting them), as well as several tests that depend on the behavior of SQL’s write functions (INSERT, UPDATE). Removing this functionality and replacing it with a system that can set up a correctly-configured OpenSearch cluster would be again similar to writing the system from scratch.

If this is the main concern of using out the box existing frameworks, can we explore the option to provide those write functions if possible, even if those only for testing purposes.

Generate data sets, each containing (potentially multiple) indices (and accelerations for Spark). For each data set:

Generate (read-only) queries for that data set, including multiple languages.

How to make sure we keep data sets and queries up to date with our new development on SQL/PPL language? since those most probably will be on a separate repository

Swiddis · 2024-12-26T23:28:26Z

@anasalkouz

If this is the main concern of using out the box existing frameworks, can we explore the option to provide those write functions if possible, even if those only for testing purposes.

It's possible, but I'm not sure it's efficient. One way that could work is to have these frameworks create a SQL table locally, and we write some logic that transforms the SQL table into an OS index before starting the read queries. Getting all the datatypes to work right would be tricky, and we also would still face the other issues with PPL and OpenSearch-specific semantics.

To clarify: I think these frameworks do have valuable lessons to teach, and I think we would benefit a lot from copying select parts of their code where we can. I just think that building something from scratch will give us flexibility that we can't really have otherwise.

How to make sure we keep data sets and queries up to date with our new development on SQL/PPL language? since those most probably will be on a separate repository

So far the best idea I have is making it a reviewer step, reviewers should enforce whether we need to add test features for the new functionality¹. This is already done for OSD with the functional test repository.

Not strictly related to this RFC: to mechanize this, I think we would benefit a lot from introducing a full-featured "reviewer checklist" that specifies what to check for. We already have PR checklists for PR authors but they're mostly ignored. Having a reviewer checklist is a practice that seems pretty common for other organizations and may reduce review churn a lot for both authors and reviewers. ↩

Swiddis added enhancement New feature or request untriaged and removed untriaged labels Dec 23, 2024

github-actions bot added the untriaged label Dec 23, 2024

Swiddis removed the untriaged label Dec 23, 2024

Swiddis self-assigned this Dec 23, 2024

anasalkouz added the RFC Request For Comments label Dec 26, 2024

opensearch-infra bot added this to OpenSearch Roadmap Dec 26, 2024

github-project-automation bot moved this to New in OpenSearch Roadmap Dec 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] [RFC] Distributed Correctness Testing Framework #3220

[WIP] [RFC] Distributed Correctness Testing Framework #3220

Swiddis commented Dec 23, 2024 •

edited

Loading

RyanL1997 commented Dec 24, 2024

RyanL1997 commented Dec 24, 2024 •

edited

Loading

Swiddis commented Dec 26, 2024 •

edited

Loading

Swiddis commented Dec 26, 2024 •

edited

Loading

anasalkouz commented Dec 26, 2024

Swiddis commented Dec 26, 2024 •

edited

Loading

[WIP] [RFC] Distributed Correctness Testing Framework #3220

[WIP] [RFC] Distributed Correctness Testing Framework #3220

Comments

Swiddis commented Dec 23, 2024 • edited Loading

[WIP] [RFC] Distributed Correctness Testing Framework

Overview

Glossary

Current State

Analysis of Domain

Goals

Implementation Strategy

RyanL1997 commented Dec 24, 2024

RyanL1997 commented Dec 24, 2024 • edited Loading

Swiddis commented Dec 26, 2024 • edited Loading

Swiddis commented Dec 26, 2024 • edited Loading

anasalkouz commented Dec 26, 2024

Swiddis commented Dec 26, 2024 • edited Loading

Footnotes

Swiddis commented Dec 23, 2024 •

edited

Loading

RyanL1997 commented Dec 24, 2024 •

edited

Loading

Swiddis commented Dec 26, 2024 •

edited

Loading

Swiddis commented Dec 26, 2024 •

edited

Loading

Swiddis commented Dec 26, 2024 •

edited

Loading