Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Push out new blog and edit to releases page #3510

Merged
merged 11 commits into from
Dec 18, 2024
25 changes: 25 additions & 0 deletions _community_members/kdooyong.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
---
name: Dooyong Kim
short_name: kdooyong
title: 'OpenSearch Community Member: Dooyong Kim'
primary_title: Dooyong Kim
breadcrumbs:
icon: community
items:
- title: Community
url: /community/index.html
- title: Members
url: /community/members/index.html
- title: 'Dooyong Kim's Profile'
url: '/community/members/kdooyong.html'
photo: '/assets/media/community/members/kdooyong.jpg'
github: 0ctopus13prime
linkedin: doo-yong-kim-a8a97b126
job_title_and_company: 'Software engineer at AWS'
personas:
- author
permalink: '/community/members/kdooyong.html'
redirect_from: '/authors/kdooyong/'
---

**Dooyong Kim** is a software engineer at AWS working on the OpenSearch Project and Amazon OpenSearch Service. His primary interests include machine learning and vector search. Outside of work, he enjoys reading and taking a nap on a sofa in the afternoon.

Check failure on line 25 in _community_members/kdooyong.md

View workflow job for this annotation

GitHub Actions / vale

[vale] _community_members/kdooyong.md#L25

[OpenSearch.Spelling] Error: Dooyong. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.
Raw output
{"message": "[OpenSearch.Spelling] Error: Dooyong. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_community_members/kdooyong.md", "range": {"start": {"line": 25, "column": 3}}}, "severity": "ERROR"}
255 changes: 255 additions & 0 deletions _posts/2024-12-04-enable-pluggable-storage-in-opensearch-vectordb.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,255 @@
---
layout: post
title: "Deep dive: Enabling pluggable storage in the OpenSearch vector engine"
authors:
- kdooyong
- navneev
- vamshin
- kolchfa
date: 2024-12-18
categories:
- technical-posts
- community
meta_keywords: OpenSearch vector engine, vector search, pluggable storage, Faiss, NMSLIB, k-NN search, searchable snapshots, performance optimization
meta_description: Showcase how a pluggable storage was enabled in OpenSearch vector db with a practical example of searchable snapshot which was not working in the past.
has_math: false
has_science_table: true
---

In 2019, OpenSearch introduced the Vector Engine, which supports three native engines: Non-Metric Space Library (NMSLIB), Facebook AI Similarity Search (Faiss), and Lucene. Unlike Lucene, which is Java based, Faiss and NMSLIB are C++ libraries that OpenSearch accesses through a lightweight Java Native Interface (JNI) layer. However, these native engines handle I/O using file-based APIs, with Faiss relying on `FILE` pointers and NMSLIB using `std::fstream` to manage graph indexes.

This blog post explains how we addressed these limitations by introducing an abstraction layer for loading data into native engines without compromising performance. We'll start with an overview of k-NN search, discuss the challenges of file API dependencies, and explain the solution we implemented. Finally, we'll explore how these changes support searchable snapshots of vector indexes, which involves running approximate k-NN search on remote snapshots using native engines.

## What is k-NN search?

The k-nearest neighbors (k-NN) search algorithm identifies the k closest vectors to a given query vector. It uses a distance metric, such as cosine similarity, to measure similarity between vectors, with closer points considered to be more similar.

In the OpenSearch vector database, you can choose from different vector search algorithms. A popular algorithm for approximate nearest neighbor (ANN) search in high-dimensional spaces is Hierarchical Navigable Small World (HNSW). HNSW organizes data points into a multi-layer graph in which each layer contains connections for efficient data navigation. Inspired by skip lists, HNSW graph layers have varying densities that increase proportionally with depth. This helps narrow the search space from broader to more specific regions---similar to locating an address by starting with a country and then narrowing down to a state, city, and street.

For more information about building a k-NN similarity search engine with OpenSearch, see our [documentation](https://opensearch.org/docs/latest/search-plugins/knn/index/).

## Challenges with file-based APIs

Native vector engines, such as Faiss and NMSLIB, offer high performance and predictable latencies. However, their reliance on file-based APIs makes them difficult to integrate with storage that is not file system based.

Lucene uses the Java-based **Directory** abstraction for reading and writing files. The **Directory** class abstracts file storage, enabling operations like reading, writing, and managing file metadata across diverse storage systems. This abstraction allows the Lucene vector engine to store files independently of the underlying OpenSearch storage.

Unlike Lucene, native engines tightly couple their operations to file-based I/O. To address these limitations, we applied Lucene's principles to native engines. By abstracting the I/O layer, we eliminated the engines' tight coupling to specific file APIs. This enhancement enables integration with any OpenSearch directory implementation, making vector search compatible with a broader range of storage systems.

## Introducing the loading layer

Both Faiss and NMSLIB load graph-based vector indexes from storage into memory. During this process, they use `fread` to fetch the bytes needed to reconstruct the graph.

To improve flexibility, we replaced `fread` with a read interface. Faiss provides an **IOReader** interface for reading index data from various storage systems. For NMSLIB, we introduced a similar read interface called **NmslibIOReader**. These interfaces allow the native engines to read data through an abstraction layer, making integration possible with OpenSearch's directory implementations.

Check failure on line 43 in _posts/2024-12-04-enable-pluggable-storage-in-opensearch-vectordb.md

View workflow job for this annotation

GitHub Actions / vale

[vale] _posts/2024-12-04-enable-pluggable-storage-in-opensearch-vectordb.md#L43

[OpenSearch.Spelling] Error: IOReader. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.
Raw output
{"message": "[OpenSearch.Spelling] Error: IOReader. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_posts/2024-12-04-enable-pluggable-storage-in-opensearch-vectordb.md", "range": {"start": {"line": 43, "column": 88}}}, "severity": "ERROR"}

Because k-NN search is conducted after the graph is loaded into memory, this change does not impact average search performance.

The following diagram provides a high-level overview of the loading layer in native engines.

![High-level overview of the loading layer in native engines](/assets/media/blog-images/2024-12-04-enable-pluggable-storage-in-opensearch-vectordb/loading_layer_high_level.png){:style="width: 100%; max-width: 800px;"}

## Performance benchmarks

The following sections present performance benchmark results.

### Benchmarking environment

We ran benchmarking tests in an environment with the following configuration.

| OpenSearch version | 2.18 |
| vCPUs | 48 |
| Physical memory | 128 GB |
| Storage type | Amazon Elastic Block Store (Amazon EBS)|
| JVM | 63 GB |
| Total number of vectors | 1M |
| Dimensions | 128 |

### Benchmarking results

During benchmarking, we observed that introducing the loading layer resulted in identical search performance compared to the baseline. Additionally, there were no differences in system metrics or JVM GC metrics when introducing the loading layer.

Based on these findings, we concluded that we successfully replaced the tight coupling of the File API with Lucene's **IndexInput**. This change maintained the same search performance. Additionally, with this change you can integrate a custom **Directory** in OpenSearch and save a vector index in your preferred storage system.

The following table presents our benchmarking results, comparing query latency with the loading layer (candidate) to the baseline.

| Engine | Metric | Description | Baseline | Candidate |
|--------|---|---|---|---|
| Faiss | Average query latency | The time taken to process a vector search query. | 3.5832 ms | 3.83349 ms |
| Faiss | p99 query latency | The p99 latency for processing a vector search query. | 22.1628 ms | 23.8439 ms |
| Faiss | Total Young Gen JVM GC time | The time spent on Young GC in the JVM. |0.338 sec | 0.342 sec |

The results demonstrate that replacing the file-based API with Lucene's **IndexInput** maintains search performance while enabling broader storage compatibility.

## Configuring searchable snapshots for vector search

With the loading layer in place, you can now perform vector searches directly on remote snapshots. At a high level, you'll create a vector index, take a snapshot of the index, and run a vector search on the snapshot. The following diagram illustrates these steps.

![Overview of searchable snapshots](/assets/media/blog-images/2024-12-04-enable-pluggable-storage-in-opensearch-vectordb/searchable_snapshots_overview.png){:style="width: 100%; max-width: 600px;"}

Follow these steps to configure searchable snapshots.

### Prerequisites

Configure your cluster for searchable snapshots. For detailed steps, see [Configuring a node to use searchable snapshots](https://opensearch.org/docs/latest/tuning-your-cluster/availability-and-recovery/snapshots/searchable_snapshot/#configuring-a-node-to-use-searchable-snapshots).

### Step 1: Create a local index

Use the following request to create a local vector index:

```json
PUT /knn-index/
{
"settings": {
"index": {
"knn": true
}
},
"mappings": {
"properties": {
"my_vector": {
"type": "knn_vector",
"dimension": 2
}
}
}
}
```

### Step 2: Ingest data

Ingest some data into the index:

```json
POST _bulk?refresh
{ "index": { "_index": "knn-index", "_id": "1" } }
{ "my_vector": [1.5, 2.5], "price": 12.2 }
{ "index": { "_index": "knn-index", "_id": "2" } }
{ "my_vector": [2.5, 3.5], "price": 7.1 }
{ "index": { "_index": "knn-index", "_id": "3" } }
{ "my_vector": [3.5, 4.5], "price": 12.9 }
{ "index": { "_index": "knn-index", "_id": "4" } }
{ "my_vector": [5.5, 6.5], "price": 1.2 }
{ "index": { "_index": "knn-index", "_id": "5" } }
{ "my_vector": [4.5, 5.5], "price": 3.7 }
```

### Step 3: Query the local index

Query the local index to ensure that it's configured correctly:

```json
POST knn-index/_search
{
"query": {
"knn": {
"my_vector": {
"vector": [2, 3],
"k": 2
}
}
}
}
```

The response returns the vectors closest to the query vector:

```json
{
"took": 16,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 0.6666667,
"hits": [
{
"_index": "knn-index",
"_id": "1",
"_score": 0.6666667,
"_source": {
"my_vector": [
1.5,
2.5
],
"price": 12.2
}
},
{
"_index": "knn-index",
"_id": "2",
"_score": 0.6666667,
"_source": {
"my_vector": [
2.5,
3.5
],
"price": 7.1
}
}
]
}
}
```

### Step 4: Take a snapshot

Take a snapshot of the index. For detailed steps, see [Take and restore snapshots](https://opensearch.org/docs/latest/tuning-your-cluster/availability-and-recovery/snapshots/snapshot-restore/). After taking the snapshot, delete the `knn-index` so that it is no longer available locally.

### Step 5: Create a searchable snapshot index from the snapshot

Use the following request to restore the original index from the snapshot in order to create a searchable snapshot index

```json
POST _snapshot/<SNAPSHOT_REPO>/<SNAPSHOT_NAME>/_restore
{
"storage_type": "remote_snapshot",
"indices": "knn-index"
}
```

To verify the successful creation of the searchable snapshot index, use the following request:

```json
GET /_cat/indices
```
For more information, see [Create a searchable snapshot index](https://opensearch.org/docs/latest/tuning-your-cluster/availability-and-recovery/snapshots/searchable_snapshot/#create-a-searchable-snapshot-index).

### Step 6: Run a vector search query

Run a vector search query on the searchable snapshot index:

```json
POST knn-index/_search
{
"query": {
"knn": {
"my_vector": {
"vector": [2, 3],
"k": 2
}
}
}
}
```

The query returns the same results as the local index query in [Step 3](#step-3-query-the-local-index).

## Conclusion

By introducing an I/O layer that uses Lucene's **Directory** abstraction, we eliminated the native engines' dependency on file-based APIs that limit storage to local file systems. This change allows the vector engine to read graph data structures from any storage system supported by OpenSearch's **Directory** implementation. Our extensive benchmarking tests confirmed that this change maintains the search performance of the original file-API-based approach. Notably, we observed no regression in search times after the graphs were loaded into memory (graph loading is a one-time operation for a properly scaled cluster).

With this new read interface, you can now use vector indexes with any OpenSearch **Directory** implementation. This added flexibility makes it possible to store vector data in remote storage solutions like Amazon Simple Storage Service (Amazon S3).

## Next steps

In version 2.18, we introduced the ability to use vector search queries with Lucene's **Directory** and **IndexInput** classes. Looking ahead, version 2.19 will expand this functionality to the native index creation process. Specifically, the k-NN plugin will begin using the **IndexOutput** class to write graph files directly to segments. For more information, see [this GitHub issue](https://github.com/opensearch-project/k-NN/issues/2033).

Additionally, the k-NN plugin now having the ability to stream vector data structure files presents an opportunity for partial loading of these files. This enhancement will reduce memory pressure on the cluster and deliver better price-performance, especially under high-stress conditions. For more information, see [this GitHub issue](https://github.com/opensearch-project/k-NN/issues/1693).
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/media/community/members/kdooyong.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion releases.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ Note: We have not added a major release to the 2024 schedule yet. If/when we a

OpenSearch [follows semver](https://opensearch.org/blog/technical-post/2021/08/what-is-semver/), which means we will only release breaking changes in major versions. All minor versions are compatible with every other minor version for that major. For example, 1.2.0 will work with 1.3.2, 1.4.1, etc, but may not work with 2.0.

For minor version releases, OpenSearch follows "release window" model as described in our [releasing documentation](https://github.com/opensearch-project/.github/blob/main/RELEASING.md). The goal is to release a new minor version approximately every six weeks which includes all the new features and fixes that are ready to go. Once we enter a release window, every day we will generate a release candidate. When the exit criteria are met by a release candidate, we will make an announcement, update the release date column above and publish new artifacts on that date.
For minor version releases, OpenSearch follows "release window" model as described in our [releasing documentation](https://github.com/opensearch-project/.github/blob/main/RELEASING.md). The goal is to release a new minor version approximately every eight weeks which includes all the new features and fixes that are ready to go. Once we enter a release window, every day we will generate a release candidate. When the exit criteria are met by a release candidate, we will make an announcement, update the release date column above and publish new artifacts on that date.

**Note:** If we cannot pass the exit criteria by 2 weeks after the start of the release cycle window (1 week for 1.x patch releases), we will cancel the minor release and hold changes until the next release window.

Expand Down
Loading