Adding semantic search workload that includes vector and bm25 search #342

martin-gaievski · 2024-07-14T22:17:05Z

Description

Adding workload for semantic search that is based on customized trec-covid dataset and includes vector, text and integer fields. That allows to run queries like neural search and hybrid query where neural is one of sub queries.

Modified version of trec-covid dataset has ~1M documents, 6 copies of each document from original dataset.

Issues Resolved

#341

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

martin-gaievski · 2024-07-25T00:38:41Z

@VijayanB can I get initial feedback for this PR while we're waiting for feedback on opensearch-project/opensearch-benchmark#591 from other folks?

Signed-off-by: Martin Gaievski <[email protected]>

martin-gaievski · 2024-10-23T19:30:09Z

I'm going to restore works for this PR and I want to summarize the feedback I got last time:

merge this workload with https://github.com/opensearch-project/opensearch-benchmark-workloads/tree/main/treccovid_semantic_search, let user to choose which dataset flavor they want to use
better class structure in workload.py, reuse code where possible instead of copy-pasting
host all artifacts related to the dataset in cloudfront (example for vectorsearch workload), use those URLs instead of github based

I put together main points from this list in a private branch: https://github.com/martin-gaievski/opensearch-benchmark-workloads/tree/adding_vectordataset_for_semantic_search/treccovid_semantic_search

@VijayanB please let me know if I've missed any important point from your comments

VijayanB · 2024-11-11T23:33:34Z

trec_covid_semantic_search/index.json

@@ -0,0 +1,46 @@
+{
+  "settings": {
+    "index.number_of_shards": {{number_of_shards | default(1)}},


IMO, we can leave defaults to cluster. in other words, lets set this value if number_of_shards are provided in params

VijayanB · 2024-11-11T23:35:17Z

trec_covid_semantic_search/index.json

+    "default_pipeline": "nlp-ingest-pipeline"
+  },
+  "mappings": {
+    "dynamic": "true",


any reason to allow dynamic? why not strict?

VijayanB · 2024-11-11T23:36:23Z

trec_covid_semantic_search/index.json

+        "dimension": 768,
+        "method": {
+            "name": "hnsw",
+            "space_type": "innerproduct",


this should be outside method.

VijayanB · 2024-11-11T23:38:15Z

trec_covid_semantic_search/runners.py

+class UpdateConcurrentSegmentSearchSettings(Runner):
+
+    RUNNER_NAME = "update-concurrent-segment-search-settings"
+
+    async def __call__(self, opensearch, params):
+        enable_setting = params.get("enable", "false")
+        max_slice_count = params.get("max_slice_count", None)
+        body = {
+            "persistent": {
+                "search.concurrent_segment_search.enabled": enable_setting
+            }
+        }
+        if max_slice_count is not None:
+            body["persistent"]["search.concurrent.max_slice_count"] = max_slice_count
+        request_context_holder.on_client_request_start()
+        await opensearch.cluster.put_settings(body=body)
+        request_context_holder.on_client_request_end()
+
+    def __repr__(self, *args, **kwargs):


can this be moved to opensearch-benchmarks?

VijayanB · 2024-11-11T23:40:10Z

trec_covid_semantic_search/workload.json

+  "corpora": [
+    {
+      "name": "trec-covid",
+      "base-url": "https://github.com/martin-gaievski/neural-search/releases/download/trec_covid_dataset_1M_v1",


i believe this is temporary. You must be using generic repository to fetch corpus

VijayanB · 2024-11-11T23:41:03Z

trec_covid_semantic_search/workload.json

+  ],
+  "corpora": [
+    {
+      "name": "trec-covid",


can we add smaller data set as well? something like 1k documents?

VijayanB · 2024-11-11T23:43:00Z

trec_covid_semantic_search/workload_queries_knn.json

+{
+  "base-url": "https://github.com/martin-gaievski/neural-search/releases/download/trec_covid_queries_knn",
+  "source-file": "queries.json.zip",
+  "compressed-bytes" : 193612,
+  "uncompressed-bytes": 519763
+}


not sure whether we discussed this or not, can this be converted to corpus?

VijayanB · 2024-11-11T23:48:31Z

treccovid_semantic_search/workload.py

+        if self._params['variable-queries'] > 0:
+            with open(script_dir + os.sep + 'workload_queries.json', 'r') as f:
+                d = json.loads(f.read())
+                source_file = d['source-file']
+                base_url = d['base-url']
+                compressed_bytes = d['compressed-bytes']
+                uncompressed_bytes = d['uncompressed-bytes']
+                compressed_path = script_dir + os.sep + source_file
+                uncompressed_path = script_dir + os.sep + Path(source_file).stem
+            if not os.path.exists(compressed_path):
+                downloader = Downloader(False, False)
+                downloader.download(base_url, None, compressed_path, compressed_bytes)
+            if not os.path.exists(uncompressed_path):
+                decompressor = Decompressor()
+                decompressor.decompress(compressed_path, uncompressed_path, uncompressed_bytes)


you don't need this if it is corpus

VijayanB · 2024-11-11T23:50:12Z

treccovid_semantic_search/workload.py

+
+        self._params = params
+        self._params['index'] = index
+        self._params['type'] = type


why do we need type?

martin-gaievski force-pushed the adding_vectordataset_for_semantic_search branch 2 times, most recently from 875dd1d to ef84aa8 Compare July 15, 2024 00:05

martin-gaievski mentioned this pull request Jul 15, 2024

Switch to a fork to run integ tests in CI opensearch-project/opensearch-benchmark#583

Closed

1 task

martin-gaievski force-pushed the adding_vectordataset_for_semantic_search branch from ef84aa8 to 43bc7ff Compare July 15, 2024 16:16

martin-gaievski marked this pull request as ready for review July 15, 2024 16:19

martin-gaievski requested review from IanHoang, gkamat, beaioun, cgchinmay, rishabh6788 and VijayanB as code owners July 15, 2024 16:19

martin-gaievski mentioned this pull request Jul 23, 2024

Added runners for enabling concurrent segment search via settings opensearch-project/opensearch-benchmark#591

Merged

1 task

martin-gaievski force-pushed the adding_vectordataset_for_semantic_search branch 2 times, most recently from 6cb13b7 to 42ddc44 Compare July 30, 2024 18:58

martin-gaievski added 4 commits August 8, 2024 18:33

Adding semantic search workload that includes vector and bm25 search

203fb92

Signed-off-by: Martin Gaievski <[email protected]>

Refactor code, add param files

bad64e2

Signed-off-by: Martin Gaievski <[email protected]>

Added custom runner for setting concurrent segment search params

079b4b4

Signed-off-by: Martin Gaievski <[email protected]>

Working version

79bc1f3

Signed-off-by: Martin Gaievski <[email protected]>

martin-gaievski force-pushed the adding_vectordataset_for_semantic_search branch from 42ddc44 to 79bc1f3 Compare September 9, 2024 23:02

VijayanB reviewed Nov 11, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding semantic search workload that includes vector and bm25 search #342

Adding semantic search workload that includes vector and bm25 search #342

martin-gaievski commented Jul 14, 2024

martin-gaievski commented Jul 25, 2024

martin-gaievski commented Oct 23, 2024 •

edited

Loading

VijayanB Nov 11, 2024

VijayanB Nov 11, 2024

VijayanB Nov 11, 2024

VijayanB Nov 11, 2024

VijayanB Nov 11, 2024

VijayanB Nov 11, 2024

VijayanB Nov 11, 2024

VijayanB Nov 11, 2024

VijayanB Nov 11, 2024

Adding semantic search workload that includes vector and bm25 search #342

Are you sure you want to change the base?

Adding semantic search workload that includes vector and bm25 search #342

Conversation

martin-gaievski commented Jul 14, 2024

Description

Issues Resolved

martin-gaievski commented Jul 25, 2024

martin-gaievski commented Oct 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martin-gaievski commented Oct 23, 2024 •

edited

Loading