Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding semantic search workload that includes vector and bm25 search #342

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

martin-gaievski
Copy link
Member

Description

Adding workload for semantic search that is based on customized trec-covid dataset and includes vector, text and integer fields. That allows to run queries like neural search and hybrid query where neural is one of sub queries.

Modified version of trec-covid dataset has ~1M documents, 6 copies of each document from original dataset.

Issues Resolved

#341

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@martin-gaievski
Copy link
Member Author

@VijayanB can I get initial feedback for this PR while we're waiting for feedback on opensearch-project/opensearch-benchmark#591 from other folks?

@martin-gaievski martin-gaievski force-pushed the adding_vectordataset_for_semantic_search branch 2 times, most recently from 6cb13b7 to 42ddc44 Compare July 30, 2024 18:58
@martin-gaievski martin-gaievski force-pushed the adding_vectordataset_for_semantic_search branch from 42ddc44 to 79bc1f3 Compare September 9, 2024 23:02
@martin-gaievski
Copy link
Member Author

martin-gaievski commented Oct 23, 2024

I'm going to restore works for this PR and I want to summarize the feedback I got last time:

I put together main points from this list in a private branch: https://github.com/martin-gaievski/opensearch-benchmark-workloads/tree/adding_vectordataset_for_semantic_search/treccovid_semantic_search

@VijayanB please let me know if I've missed any important point from your comments

@@ -0,0 +1,46 @@
{
"settings": {
"index.number_of_shards": {{number_of_shards | default(1)}},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, we can leave defaults to cluster. in other words, lets set this value if number_of_shards are provided in params

"default_pipeline": "nlp-ingest-pipeline"
},
"mappings": {
"dynamic": "true",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason to allow dynamic? why not strict?

"dimension": 768,
"method": {
"name": "hnsw",
"space_type": "innerproduct",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be outside method.

Comment on lines +19 to +37
class UpdateConcurrentSegmentSearchSettings(Runner):

RUNNER_NAME = "update-concurrent-segment-search-settings"

async def __call__(self, opensearch, params):
enable_setting = params.get("enable", "false")
max_slice_count = params.get("max_slice_count", None)
body = {
"persistent": {
"search.concurrent_segment_search.enabled": enable_setting
}
}
if max_slice_count is not None:
body["persistent"]["search.concurrent.max_slice_count"] = max_slice_count
request_context_holder.on_client_request_start()
await opensearch.cluster.put_settings(body=body)
request_context_holder.on_client_request_end()

def __repr__(self, *args, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be moved to opensearch-benchmarks?

"corpora": [
{
"name": "trec-covid",
"base-url": "https://github.com/martin-gaievski/neural-search/releases/download/trec_covid_dataset_1M_v1",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i believe this is temporary. You must be using generic repository to fetch corpus

],
"corpora": [
{
"name": "trec-covid",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add smaller data set as well? something like 1k documents?

Comment on lines +1 to +6
{
"base-url": "https://github.com/martin-gaievski/neural-search/releases/download/trec_covid_queries_knn",
"source-file": "queries.json.zip",
"compressed-bytes" : 193612,
"uncompressed-bytes": 519763
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure whether we discussed this or not, can this be converted to corpus?

Comment on lines +144 to +158
if self._params['variable-queries'] > 0:
with open(script_dir + os.sep + 'workload_queries.json', 'r') as f:
d = json.loads(f.read())
source_file = d['source-file']
base_url = d['base-url']
compressed_bytes = d['compressed-bytes']
uncompressed_bytes = d['uncompressed-bytes']
compressed_path = script_dir + os.sep + source_file
uncompressed_path = script_dir + os.sep + Path(source_file).stem
if not os.path.exists(compressed_path):
downloader = Downloader(False, False)
downloader.download(base_url, None, compressed_path, compressed_bytes)
if not os.path.exists(uncompressed_path):
decompressor = Decompressor()
decompressor.decompress(compressed_path, uncompressed_path, uncompressed_bytes)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need this if it is corpus


self._params = params
self._params['index'] = index
self._params['type'] = type
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need type?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants