I'd like to contribute a change to how InMemoryDocumentStore implements query_by_embedding. #2088

baregawi · 2022-01-31T01:07:14Z

baregawi
Jan 31, 2022

Hello haystack team,

I just came across this awesome package while building systems like this at work and after working with it for a bit I wanted to contribute a few changes. The first one is a very simple change to how InMemoryDocumentStore implements query_by_embedding.

Currently the code loops and does an np.dot or scipy.spatial.distance.cosine for each prospective document. This is the most computationally intense step in that function and it is executing something like 1000x slower than it could. One dot product at a time does not allow numpy to behave in a cache friendly manner. And the fact that numpy is used makes it so CPUs are used even when a GPU is available on the machine.

The change would like this, the following code block from https://github.com/deepset-ai/haystack/blob/master/haystack/document_stores/memory.py#L202:

for doc in document_to_search:
    curr_meta = deepcopy(doc.meta)
    new_document = Document(
        id=doc.id,
        content=doc.content,
        meta=curr_meta,
        embedding=doc.embedding
    )
    new_document.embedding = doc.embedding if return_embedding is True else None

    if self.similarity == "dot_product":
        score = np.dot(query_emb, doc.embedding)
    elif self.similarity == "cosine":
        # cosine similarity score = 1 - cosine distance
        score = 1 - cosine(query_emb, doc.embedding)
    new_document.score = self.finalize_raw_score(score, self.similarity)

Would be changed to something like:

import math
import torch
...
self.device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
...

def get_scores(self, query_emb, document_to_search):
    query_emb = torch.as_tensor(query_emb).to(self.device)
    if len(query_emb.shape) == 1:
        query_emb = query_emb.unsqueeze(dim=0)

    if self.similarity == "cosine":
        # cosine similarity is a normed dot product
        query_emb_norm = torch.norm(query_emb, dim=1)
        query_emb  = torch.div(query_emb, query_emb_norm)

    doc_embeds = [ doc.embedding for doc in document_to_search ]

    max_samples_at_once = 5000
    n_slices = math.ceil(len(doc_embeds) / max_samples_at_once)
    curr_pos = 0
    scores = []
    while curr_pos < len(doc_embeds):
        doc_embeds_slice = doc_embeds[curr_pos:curr_pos + max_samples_at_once]
        doc_embeds_slice = torch.as_tensor(doc_embeds_slice).to(self.device)
        if self.similarity == "cosine":
            # cosine similarity is a normed dot product
            doc_embeds_slice_norm = torch.norm(doc_embeds_slice, dim=1)
            doc_embeds_slice  = torch.div(doc_embeds_slice, doc_embeds_slice_norm)
        with torch.no_grad():
            slice_scores = torch.matmul(doc_embeds_slice, query_emb.T).cpu().numpy().tolist()
        scores.extend(slice_scores)

    return scores
...

scores = self.get_scores(query_emb, document_to_search)
for doc, score in zip(document_to_search, scores):
    curr_meta = deepcopy(doc.meta)
    new_document = Document(
        id=doc.id,
        content=doc.content,
        meta=curr_meta,
        embedding=doc.embedding
    )
    new_document.embedding = doc.embedding if return_embedding is True else None
    new_document.score = self.finalize_raw_score(score, self.similarity)

This would execute MUCH quicker. And since we only run max_samples_at_once = 5000 at one time we wouldn't be using to much GPU memory at one time. I also use this pattern for dot products in a lot of places. It is memory stable on a GPU in a production environment.

What do people think?

julian-risch · 2022-01-31T08:32:43Z

julian-risch
Jan 31, 2022
Maintainer

Hi @baregawi that's a nice idea. It would be great if you create a pull request with your changes. I'd be happy to review it then. Looking through the code here it already looks quite good to me.

Regarding the initialization of self.device: I would suggest we make this consistent with the other modules in haystack, where we typically use something like:

self.devices, _ = initialize_device_settings(use_cuda=use_gpu)
device = 0 if self.devices[0].type == "cuda" else -1

Looking forward to your pull request. You can also make it a draft pull request so that we can give feedback early on. 👍

2 replies

baregawi Jan 31, 2022
Author

Looks like you already saw my pull request. Thank you for the quick response. Let's continue any discussions there!

julian-risch Feb 1, 2022
Maintainer

Yes, that's right. Thanks for opening the pull request so quickly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I'd like to contribute a change to how InMemoryDocumentStore implements query_by_embedding. #2088

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

I'd like to contribute a change to how InMemoryDocumentStore implements query_by_embedding. #2088

baregawi Jan 31, 2022

Replies: 1 comment · 2 replies

julian-risch Jan 31, 2022 Maintainer

baregawi Jan 31, 2022 Author

julian-risch Feb 1, 2022 Maintainer

baregawi
Jan 31, 2022

Replies: 1 comment 2 replies

julian-risch
Jan 31, 2022
Maintainer

baregawi Jan 31, 2022
Author

julian-risch Feb 1, 2022
Maintainer