-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] [RFC] Distributed Correctness Testing Framework #3220
Comments
Hi @Swiddis , thanks for putting this together. I know some of the parts are still work in progress, so just wanna leave some general thoughts after reading the current version as my first round of review:
I saw that you also mentioned in the open questions: "How will we handle versioning/compatibility across OpenSearch versions?" Especially for these external dependencies, I'm actually having the similar question that will backward compatibility testing be a focus as part of the proposed framework? Are there plans to define specific test cases or scenarios to validate compatibility with older versions, and how will this be managed over time as new features are introduced?
Personally, I need to do more homework on the scenario for large datasets. However, by taking a first look, I think It might be useful to include a plan for benchmarking these scenarios to measure performance impact. (This may not be the P0)
I noticed the mention of the above. Does this imply the introduction of another internal test pipeline, distinct from the open-source GitHub-based pipeline? If so, how will the results from this EC2-based pipeline be integrated or communicated back to the broader testing framework? |
Also, for this open question:
Here are some general ideas in my mind: Coverage Metrics
Accuracy and Reliability Metrics
Efficiency Metrics
Outcome-Based Metrics
Integration and Usability Metrics
|
I tried setting up some crude benchmarking code to estimate how much test throughput we could get. I tried versions of a Locust script that tried two scenarios. Left: many parallel tests that each create, update, and delete their own index. Right: many tests sharing a small number of indices. In general, insert and query requests are very fast while index create/delete is slow. This gives some data to back up that we should have batches of tests run on batch-scoped indices, it's a throughput difference of 25x on equal hardware. It's also more reliable -- with a small number of test-scoped index workers there were timeout failures with even just 50 tests in parallel, but batched workers can handle 1000 tests in parallel without any failures (cluster generally just slows to land at around 1800 requests/second on my machine). For a real soak-test we probably want to run at least O(a million) tests total, which my dev machine can do for this benchmark in ~10 minutes. |
I think we can work backwards from specific past bug reports (such as those linked in the overview) to the features the query generator needs to support, then see if we reproduce them. If the process is able to find specific known issues, we can have some confidence in its ability to find more unknown issues as we add more features.
I think we should leave it at first. Since we don't typically update older versions, running tests there would mostly be data collection. I do know there have been some bugs involving specific upgrade flows (example), but I'm not convinced the extra complexity would be paid back.
We can probably extend the testing to do some sort of benchmarking, but let's not step on the toes of opensearch-benchmark where we don't have to. The focus for the moment is just correctness. Since we'll be running many tests per index (see above comment), we can probably afford to make each index much larger than if we were doing 1 test per index.
I think it's a separately-running job, I don't dislike having it scheduled in pipelines? It will probably be reported by some sort of email mailing list, or a dashboard that's periodically checked. We shouldn't automatically publish issues to GH (both to avoid noise and also in case there's any issues that are security-relevant). That said, I think my original idea of running this 24/7 and just tracking failures also is complicated, it'd require a whole live notification system. For simplicity, let's make the suite run a configurable finite number of tests with an existing framework, like how Hypothesis does |
Thanks @Swiddis for putting this proposal, I have few generic comments:
If this is the main concern of using out the box existing frameworks, can we explore the option to provide those write functions if possible, even if those only for testing purposes.
How to make sure we keep data sets and queries up to date with our new development on SQL/PPL language? since those most probably will be on a separate repository |
It's possible, but I'm not sure it's efficient. One way that could work is to have these frameworks create a SQL table locally, and we write some logic that transforms the SQL table into an OS index before starting the read queries. Getting all the datatypes to work right would be tricky, and we also would still face the other issues with PPL and OpenSearch-specific semantics. To clarify: I think these frameworks do have valuable lessons to teach, and I think we would benefit a lot from copying select parts of their code where we can. I just think that building something from scratch will give us flexibility that we can't really have otherwise.
So far the best idea I have is making it a reviewer step, reviewers should enforce whether we need to add test features for the new functionality1. This is already done for OSD with the functional test repository. Footnotes
|
[WIP] [RFC] Distributed Correctness Testing Framework
Date: Dec 23, 2024
Status: Work In Progress
Overview
While the current integration testing framework is effective for development, there have been several bug reports around query correctness due to testing blind-spots. Some common sources for issues include:
JOIN
s or wildcard indices.LIMIT
queries or aggregations.In addition to the obvious challenges with predicting these edge cases ahead of time, we have the additional issue that the SQL plugin has multiple runtime contexts:
The current integration testing process is based on doesn't scale sufficiently to detect edge cases under all of these scenarios. Historically the method has been "fix the bugs as we go", but given the scale of the project nowadays and in-progress refactoring initiatives, a testing suite that can keep up with the scale is needed.
Inspired by the likes of database testing initiatives in other projects, especially SQLancer, Google's OSS-Fuzz, and the PostgreSQL BuildFarm, this RFC proposes a project to implement a distributed random testing process, which will be able to generate a large amount of test scenarios to validate correct behavior. With such a suite, the plugin can be "soak tested" by running a large number of randomized tests and reporting errors.
Glossary
Current State
Our current testing infrastructure consists of several components that serve different purposes, but none of them clearly address the gaps identified here.
We may consider reusing elements of the comparison testing framework for the new suite. In particular, both this framework and the proposed solution connect to OpenSearch via JDBC. The main concern is whether we can parallelize the workload easily.
While these tools provide valuable testing capabilities, they fall short in several key areas:
Analysis of Domain
The most important question to ask is, "why can't we use an existing system?" SQL testing in particular has a lot of existing similar initiatives. A cursory look at these initiatives (ref: SQLancer, Sqllogictest, SQuaLity) reveals a few issues:
INSERT
,UPDATE
). Removing this functionality and replacing it with a system that can set up a correctly-configured OpenSearch cluster would be again similar to writing the system from scratch.Compared to trying to adapt these solutions, it seems like the most reliable method for a long-term solution is to write a custom system. In particular, these methods are well-documented, so we likely can make something that can get a similar degree of effectiveness with less effort than the migration cost. This will also give us a lot of flexibility to toggle specific features under test, such as the assortment of unimplemented features in V2 (#1889, #1718, #1487). The flexibility of supporting OpenSearch-specific semantics and options will also open routes for implementing similar testing on core OpenSearch.
Despite these limitations, all of the linked projects have extensive research behind them that can be used to guide our own implementation. Referencing them and taking what we can is likely valuable.
Goals
Looking at the limitations of our current test cases, I propose the following goals:
Implementation Strategy
[WIP] have a rough outline, but not a high-fidelity design.
The rough outline is: to create a separate project that can connect to OpenSearch SQL via the JDBC connector -- similar to the comparison testing. We use the JDBC connector because it will let us use ScalaCheck to write the properties, working well with our current development ecosystem. We'll use that connector to run queries on test-individual indices (for isolation). At first we will only generate simple queries, but gradually extend it to match newly introduced features and cover existing features. The tests will more-or-less run independently in the background (likely on EC2), as part of a cluster separate from GitHub actions, and report bugs asynchronously as we make changes. Common bug types can get specialized tests introduced in the main test suite.
The role of the test suite as a system is shown here. It will run independently of our existing per-PR CI on some schedule.
The test system itself needs two main components: the actual test runner, and a tool that can create clusters to test with various configs (e.g. with/without Spark, with/without Catalyst). Making the cluster manager a separate container will allow running clusters with varying configurations, but for the initial version (testing on just one configuration) we can just have the pipeline spin up a single test cluster that the runner uses.
Tests will follow these steps:
For properties we can test, the following high-level strategies are available:
Some open questions:
The text was updated successfully, but these errors were encountered: