Skip to content
This repository has been archived by the owner on Jun 27, 2022. It is now read-only.

conde-nast-international/chroma-search-foxx

Repository files navigation

Chroma Search Service

This HTTP endpoint provides search capability to Chroma ArangoDb as ArangoDB Foxx Application.

Deployment

The service can be deployed to a running local Arango instance automatically with Foxx CLI with simply a npm script:

# For first-time deployment if this service hasn't been installed yet
npm run deploy:install 
# For deployment updates
npm run deploy 

Indexer for ElasticSearch

A set of APIs to synchronizing documents changes between Arango and ElasticSearch, it use a poll-to-update model using Arango write ahead logs as the source of truth for data changes, currently it updates the assets index only. All data changes persists to chroma db will be monitored, those changes that affect asset data will be updating the ElasticSearch index.

Various options on how the indexer works can be tuned in Web Console

  • /es/index/all Index all tagged assets on ElasticSearch Note: This one could take a while since it's a synchronous api, use it sparsely.
    • The ElasticSearch endpoint is configurable through the elasticsearch_host option.
  • /es/index/start Enqueue a job to sync any incoming changes related to assets to ElasticSearch until forever.
    • The indexing interval is configurable through the elasticsearch_index_interval option.
    • The indexing job retried times is configurable through the elasticsearch_index_max_fails option.

Brand Fuzzy Search

The /fuzzy endpoint provides a rather simple full-text fuzzy search to Chroma taxonomy collections

How does it work

The fuzzy-search works briefly in two stages for the search:

  • When the service starts, it will create search indexes on top of existing collection's texts, the text string will be tokenized into individual words, and each word will be normalized (e.g. remove punctuation) meanwhile a reverse map will be built to point to the original document.
  • In order to support for making a partial match on one string, a set of sub-strings will also indexed, all pointing to the belonging document of the original string
  • When a search query is received on the search endpoint, the query string is used to match to the best of any indexed text and a score will be produced using cosine similarity and Levenshtein distance algorithms. Results above a defined threshold are then returned.

How are the results ordered

When the query has matched for more than one documents, the order will be sorted asc based on conceptual distance between the indexed string and the document, which is defined as follows:

  • If the query matches the nth part of the text, e.g. if query saint matches the text Central Saint Martins, add a distance of n to conceptual distance. Matching on the start is better than matching on the middle.
  • If the query matches a word from the start but leave m chars unmatched at the end, add a distance of m * 0.1 to conceptual distance. Matching more of a word is better than matching less of it.
  • If the query matches a word which is a normalized version of the original, add a distance of 0.5 to conceptual distance.

Asset full-text search

The /search endpoint is an experiment implementation to search the asset collection with Arango 3.4 SearchView.

Pre-requisite

This service only works with the Arango 3.4 version.

About

Chroma fuzzy text search HTTP endpoint

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •