Add support for hex representation for mixed tensors in queries #32231

jobergum · 2024-08-23T07:15:59Z

We support hex format in document JSON but not for queries.

{
    "put": "id:doc:doc::1",
    "fields": {
        "text": "To transport goods on water, use a boat",
        "embedding": {
            "0": "3DE38E393E638E393EAAAAAB",
            "1": "3EE38E393F0E38E43F2AAAAB",
            "2": "3F471C723F638E393F800000"
        }
    }
}

Is valid, but attempting to send this format in queries will throw a 400 bad request

vespa query 'yql=select * from doc where true' 'ranking=full' 'input.query(qt)={"0":"3DE38E393E638E393EAAAAAB"}'

{ "errors": [
            {
                "code": 3,
                "summary": "Illegal query",
                "message": "Could not set 'ranking.features.query(qt)' to '{\"0\":\"3DE38E393E638E393EAAAAAB\"}': Could not parse '{\"0\":\"3DE38E393E638E393EAAAAAB\"}' as a tensor of type tensor<float>(querytoken{},v[3]): At value position 0: Expected a '[' but got '\"'"
            }
]}

jobergum · 2024-10-16T10:24:20Z

I still experience the same with [8.424.11]

arnej27959 · 2024-10-16T10:26:23Z

use 'input.query(qt)={"0":3DE38E393E638E393EAAAAAB}'

jobergum · 2024-10-16T11:51:45Z

It's IMHO unfortunate that one then needs one format for JSON feed and one string format without quotes for queries. When I have a dict<string,string> - now I need to write a custom routine to produce a string for the query and not use the JSON representation of the dict<string,string>.

jobergum · 2024-10-16T11:57:25Z

Snippet from a notebook

import struct
import torch
import numpy as np


def binarize_tensor(tensor: torch.Tensor) -> str:
    """
    Binarize a floating-point 1-d tensor by thresholding at zero 
    and packing the bits into bytes. Returns the hex str representation of the bytes.
    """
    if not tensor.is_floating_point():
        raise ValueError("Input tensor must be of floating-point type.")
    return np.packbits(np.where(tensor > 0, 1, 0), axis=0).astype(np.int8).tobytes().hex()

def tensor_to_hex_bfloat16(tensor: torch.Tensor) -> str:
    if not tensor.is_floating_point():
        raise ValueError("Input tensor must be of float32 type.")
    def float_to_bfloat16_hex(f: float) -> str:
        packed_float = struct.pack('=f', f)
        bfloat16_bits = struct.unpack('=H', packed_float[2:])[0]
        return format(bfloat16_bits, '04X')
    hex_list = [float_to_bfloat16_hex(float(val)) for val in tensor.flatten()]
    return "".join(hex_list)

async def get_vespa_response(
        embedding: torch.Tensor, 
        qid: str, 
        session: VespaAsync,
        depth=20,
        profile = "float-float") -> List[ScoredDoc]: 
    
    # The query tensor API does not support hex formats yet
    # so this format will throw a parse error
    float_embedding = {index: tensor_to_hex_bfloat16(vector) 
                       for index, vector in enumerate(embedding)}
    binary_embedding = {index: binarize_tensor(vector) 
                     for index, vector in enumerate(embedding)}    
    response: VespaQueryResponse = await session.query(
        yql="select id from pdf_page where true", # brute force search, rank all pages
        ranking=profile,
        hits=5,
        timeout=10,
        body={
            "input.query(qt)" : float_embedding,
            "input.query(qtb)" : binary_embedding,
            "ranking.rerankCount": depth
        }
    )
    assert response.is_successful()
    scored_docs = []

This will not work with the the custom tensor format, but works for feeding

vespa_docs = []

for row, embedding in zip(ds, embeddings):
    embedding_full = dict()
    embedding_binary = dict()
    # You can experiment with pooling if you want to reduce the number of embeddings
    #pooled_embedding = pool_embeddings(embedding, pool_factor=2) # reduce the number of embeddings by a factor of 2
    for j, emb in enumerate(embedding):
        embedding_full[j] = tensor_to_hex_bfloat16(emb)
        embedding_binary[j] = binarize_tensor(emb)
    vespa_doc = {
        "id": row['docId'],
        "embedding": embedding_full,
        "binary_embedding": embedding_binary
    }
    vespa_docs.append(vespa_doc)

arnej27959 · 2024-10-16T13:15:45Z

there are many differences between the JSON formats and the "literal form". we can try to smooth over some of these differences but there's no way to get rid of them all.

bratseth · 2024-10-16T13:28:27Z

Maybe we should support inputting tensors in JSON format somehow?

jobergum · 2024-10-16T14:30:06Z

I understand that not all tensor formats translate to something representable in JSON, but I do think that mixed tensors with one mapped dimension and one indexed dimension could. Now I need two functions, one for feed and one for queries.

jobergum added the enhancement label Aug 23, 2024

kkraune added this to the soon milestone Aug 28, 2024

kkraune assigned arnej27959 Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for hex representation for mixed tensors in queries #32231

Add support for hex representation for mixed tensors in queries #32231

jobergum commented Aug 23, 2024

jobergum commented Oct 16, 2024

arnej27959 commented Oct 16, 2024

jobergum commented Oct 16, 2024

jobergum commented Oct 16, 2024

arnej27959 commented Oct 16, 2024

bratseth commented Oct 16, 2024

jobergum commented Oct 16, 2024

Add support for hex representation for mixed tensors in queries #32231

Add support for hex representation for mixed tensors in queries #32231

Comments

jobergum commented Aug 23, 2024

jobergum commented Oct 16, 2024

arnej27959 commented Oct 16, 2024

jobergum commented Oct 16, 2024

jobergum commented Oct 16, 2024

arnej27959 commented Oct 16, 2024

bratseth commented Oct 16, 2024

jobergum commented Oct 16, 2024