This HTTP endpoint provides search capability to Chroma ArangoDb as ArangoDB Foxx Application.
The service can be deployed to a running local Arango instance automatically with Foxx CLI with simply a npm script:
# For first-time deployment if this service hasn't been installed yet
npm run deploy:install
# For deployment updates
npm run deploy
A set of APIs to synchronizing documents changes between Arango and ElasticSearch, it use a poll-to-update model using Arango write ahead logs as the source of truth for data changes, currently it updates the assets
index only. All data changes persists to chroma
db will be monitored, those changes that affect asset data will be updating the ElasticSearch index.
Various options on how the indexer works can be tuned in Web Console
/es/index/all
Index all tagged assets on ElasticSearch Note: This one could take a while since it's a synchronous api, use it sparsely.- The ElasticSearch endpoint is configurable through the
elasticsearch_host
option.
- The ElasticSearch endpoint is configurable through the
/es/index/start
Enqueue a job to sync any incoming changes related to assets to ElasticSearch until forever.- The indexing interval is configurable through the
elasticsearch_index_interval
option. - The indexing job retried times is configurable through the
elasticsearch_index_max_fails
option.
- The indexing interval is configurable through the
The /fuzzy
endpoint provides a rather simple full-text fuzzy search to Chroma taxonomy collections
The fuzzy-search works briefly in two stages for the search:
- When the service starts, it will create search indexes on top of existing collection's texts, the text string will be tokenized into individual words, and each word will be normalized (e.g. remove punctuation) meanwhile a reverse map will be built to point to the original document.
- In order to support for making a partial match on one string, a set of sub-strings will also indexed, all pointing to the belonging document of the original string
- When a search query is received on the search endpoint, the query string is used to match to the best of any indexed text and a score will be produced using cosine similarity and Levenshtein distance algorithms. Results above a defined threshold are then returned.
When the query has matched for more than one documents, the order will be sorted asc based on conceptual distance between the indexed string and the document, which is defined as follows:
- If the query matches the nth part of the text, e.g. if query
saint
matches the textCentral Saint Martins
, add a distance ofn
to conceptual distance. Matching on the start is better than matching on the middle. - If the query matches a word from the start but leave m chars unmatched at the end, add a distance of
m * 0.1
to conceptual distance. Matching more of a word is better than matching less of it. - If the query matches a word which is a normalized version of the original, add a distance of
0.5
to conceptual distance.
The /search
endpoint is an experiment implementation to search the asset collection with Arango 3.4 SearchView.
This service only works with the Arango 3.4 version.