Question about ref_labels, ref_emb and computing training pairs for different sets of embeddings #495

jsalbert · 2022-06-30T14:39:02Z

jsalbert
Jun 30, 2022

Hi,
First I wanted to say thanks for the package! It is really easy to use and well documented. However, I am struggling to see if something like what I will describe has been implemented, or I should write a workaround.

I am working on generating textual embeddings for queries.
I have two sets of embeddings: My query embeddings and my precomputed document embeddings.

When I generate pairs of positives and negatives I don't want to generate them for the pairs queries/queries or doc_embeddings/doc_embeddings. Only for queries --> doc_embeddings. As my doc_embeddings are already precomputed.

query_terms, document_embeddings, labels have the same length and for every query the document_embedding at the same position is the positive match. There could be other positives, as there could be different queries with the same document_embedding (they will share same label).

I was looking at ref_labels and ref_emb but I am not sure it does what I need by looking at the loss dictionary (without reducer).

Anchors are query_embeddings, positive and negatives are contained in document_embeddings. Labels is the same array for both.
loss_fn(query_embeddings, labels, ref_emb=document_embeddings, ref_labels=labels)

I am expecting a loss size of (batch_size * batch_size).

This would be an example of the simplest thing I am trying to achieve:

batch_size = 128
embedding_length = 128
query_embeddings = torch.rand((batch_size, embedding_length))
doc_embeddings = torch.rand((batch_size, embedding_length))
labels = torch.arange(batch_size)
labels[127] = 0

anchor_p, positives, anchor_n, negatives = [], [], [], []

for i in range(len(query_embeddings)):
    for j in range(len(query_embeddings)):
        if labels[i] == labels[j]:
            anchor_p.append(i)
            positives.append(j + batch_size)
        else:
            anchor_n.append(i)
            negatives.append(j + batch_size)

indices_tuple = (torch.tensor(anchor_p), torch.tensor(positives), torch.tensor(anchor_n), torch.tensor(negatives))
print(len(anchor_p), len(positives), len(anchor_n), len(negatives))

loss_fn = losses.TripletMarginLoss(margin=0.5, distance=CosineSimilarity(), reducer=reducers.DoNothingReducer()).cuda()
embeddings = torch.cat((query_embeddings, doc_embeddings))
labels = torch.cat((labels, labels))
loss = loss_fn(embeddings, labels, indices_tuple=indices_tuple)

I would also want to know easy would it be an integration with miners if this is the only way of generating pairs.

Thanks in advance.

Answered by KevinMusgrave

Jun 30, 2022

I was looking at ref_labels and ref_emb but I am not sure it does what I need by looking at the loss dictionary (without reducer).

What about it looks incorrect?

I would also want to know easy would it be an integration with miners if this is the only way of generating pairs.

These approaches should work:

from pytorch_metric_learning.utils import loss_and_miner_utils as lmu
from pytorch_metric_learning.miners import TripletMarginMiner

ref_labels = torch.clone(labels)

# All pairs (which get converted to triplets if you're using TripletMarginLoss)
pairs = lmu.get_all_pairs_indices(labels, ref_labels=ref_labels)
loss = loss_fn(query_embeddings, indices_tuple=pairs, ref_emb=doc_embeddings…

View full answer

KevinMusgrave · 2022-06-30T17:14:19Z

KevinMusgrave
Jun 30, 2022
Maintainer

I was looking at ref_labels and ref_emb but I am not sure it does what I need by looking at the loss dictionary (without reducer).

What about it looks incorrect?

I would also want to know easy would it be an integration with miners if this is the only way of generating pairs.

These approaches should work:

from pytorch_metric_learning.utils import loss_and_miner_utils as lmu
from pytorch_metric_learning.miners import TripletMarginMiner

ref_labels = torch.clone(labels)

# All pairs (which get converted to triplets if you're using TripletMarginLoss)
pairs = lmu.get_all_pairs_indices(labels, ref_labels=ref_labels)
loss = loss_fn(query_embeddings, indices_tuple=pairs, ref_emb=doc_embeddings)

# All triplets
triplets = lmu.get_all_triplets_indices(labels, ref_labels=ref_labels)
loss = loss_fn(query_embeddings, indices_tuple=triplets, ref_emb=doc_embeddings)

# All triplets (default behavior of TripletMarginLoss)
loss = loss_fn(query_embeddings, labels=labels, ref_emb=doc_embeddings, ref_labels=ref_labels)

# Using a miner
miner = TripletMarginMiner()
triplets = miner(query_embeddings, labels, ref_emb=doc_embeddings, ref_labels=ref_labels)
loss = loss_fn(query_embeddings, indices_tuple=triplets, ref_emb=doc_embeddings)

2 replies

KevinMusgrave Jun 30, 2022
Maintainer

Oh, maybe I know what problem you were encountering.

The get_all_pairs_indices and get_all_triplets_indices functions check if labels is ref_labels to determine if the diagonal (self-comparisons) should be removed. In your case, if ref_labels is pointing to the same object as labels, then you'll get 0 positive pairs and 0 triplets. You can do ref_labels = torch.clone(labels) to fix this problem.

I guess this behavior isn't documented anywhere. I made an issue to track this: #496.

jsalbert Jun 30, 2022
Author

Hey Kevin, that was exactly the problem. Good catch!
ref_labels = torch.clone(labels) solves my issue and now I am getting the expected triplets.
Thanks for your quick answer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about ref_labels, ref_emb and computing training pairs for different sets of embeddings #495

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Question about ref_labels, ref_emb and computing training pairs for different sets of embeddings #495

jsalbert Jun 30, 2022

Replies: 1 comment · 2 replies

KevinMusgrave Jun 30, 2022 Maintainer

KevinMusgrave Jun 30, 2022 Maintainer

jsalbert Jun 30, 2022 Author

jsalbert
Jun 30, 2022

Replies: 1 comment 2 replies

KevinMusgrave
Jun 30, 2022
Maintainer

KevinMusgrave Jun 30, 2022
Maintainer

jsalbert Jun 30, 2022
Author