Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for hex representation for mixed tensors in queries #32231

Open
jobergum opened this issue Aug 23, 2024 · 7 comments
Open

Add support for hex representation for mixed tensors in queries #32231

jobergum opened this issue Aug 23, 2024 · 7 comments
Assignees
Milestone

Comments

@jobergum
Copy link
Member

We support hex format in document JSON but not for queries.

{
    "put": "id:doc:doc::1",
    "fields": {
        "text": "To transport goods on water, use a boat",
        "embedding": {
            "0": "3DE38E393E638E393EAAAAAB",
            "1": "3EE38E393F0E38E43F2AAAAB",
            "2": "3F471C723F638E393F800000"
        }
    }
}

Is valid, but attempting to send this format in queries will throw a 400 bad request

vespa query 'yql=select * from doc where true' 'ranking=full' 'input.query(qt)={"0":"3DE38E393E638E393EAAAAAB"}'
{ "errors": [
            {
                "code": 3,
                "summary": "Illegal query",
                "message": "Could not set 'ranking.features.query(qt)' to '{\"0\":\"3DE38E393E638E393EAAAAAB\"}': Could not parse '{\"0\":\"3DE38E393E638E393EAAAAAB\"}' as a tensor of type tensor<float>(querytoken{},v[3]): At value position 0: Expected a '[' but got '\"'"
            }
]}
@kkraune kkraune added this to the soon milestone Aug 28, 2024
@jobergum
Copy link
Member Author

I still experience the same with [8.424.11]

@arnej27959
Copy link
Member

use 'input.query(qt)={"0":3DE38E393E638E393EAAAAAB}'

@jobergum
Copy link
Member Author

It's IMHO unfortunate that one then needs one format for JSON feed and one string format without quotes for queries. When I have a dict<string,string> - now I need to write a custom routine to produce a string for the query and not use the JSON representation of the dict<string,string>.

@jobergum
Copy link
Member Author

Snippet from a notebook

import struct
import torch
import numpy as np


def binarize_tensor(tensor: torch.Tensor) -> str:
    """
    Binarize a floating-point 1-d tensor by thresholding at zero 
    and packing the bits into bytes. Returns the hex str representation of the bytes.
    """
    if not tensor.is_floating_point():
        raise ValueError("Input tensor must be of floating-point type.")
    return np.packbits(np.where(tensor > 0, 1, 0), axis=0).astype(np.int8).tobytes().hex()

def tensor_to_hex_bfloat16(tensor: torch.Tensor) -> str:
    if not tensor.is_floating_point():
        raise ValueError("Input tensor must be of float32 type.")
    def float_to_bfloat16_hex(f: float) -> str:
        packed_float = struct.pack('=f', f)
        bfloat16_bits = struct.unpack('=H', packed_float[2:])[0]
        return format(bfloat16_bits, '04X')
    hex_list = [float_to_bfloat16_hex(float(val)) for val in tensor.flatten()]
    return "".join(hex_list)

async def get_vespa_response(
        embedding: torch.Tensor, 
        qid: str, 
        session: VespaAsync,
        depth=20,
        profile = "float-float") -> List[ScoredDoc]: 
    
    # The query tensor API does not support hex formats yet
    # so this format will throw a parse error
    float_embedding = {index: tensor_to_hex_bfloat16(vector) 
                       for index, vector in enumerate(embedding)}
    binary_embedding = {index: binarize_tensor(vector) 
                     for index, vector in enumerate(embedding)}    
    response: VespaQueryResponse = await session.query(
        yql="select id from pdf_page where true", # brute force search, rank all pages
        ranking=profile,
        hits=5,
        timeout=10,
        body={
            "input.query(qt)" : float_embedding,
            "input.query(qtb)" : binary_embedding,
            "ranking.rerankCount": depth
        }
    )
    assert response.is_successful()
    scored_docs = []

This will not work with the the custom tensor format, but works for feeding

vespa_docs = []

for row, embedding in zip(ds, embeddings):
    embedding_full = dict()
    embedding_binary = dict()
    # You can experiment with pooling if you want to reduce the number of embeddings
    #pooled_embedding = pool_embeddings(embedding, pool_factor=2) # reduce the number of embeddings by a factor of 2
    for j, emb in enumerate(embedding):
        embedding_full[j] = tensor_to_hex_bfloat16(emb)
        embedding_binary[j] = binarize_tensor(emb)
    vespa_doc = {
        "id": row['docId'],
        "embedding": embedding_full,
        "binary_embedding": embedding_binary
    }
    vespa_docs.append(vespa_doc)

@arnej27959
Copy link
Member

there are many differences between the JSON formats and the "literal form". we can try to smooth over some of these differences but there's no way to get rid of them all.

@bratseth
Copy link
Member

Maybe we should support inputting tensors in JSON format somehow?

@jobergum
Copy link
Member Author

I understand that not all tensor formats translate to something representable in JSON, but I do think that mixed tensors with one mapped dimension and one indexed dimension could. Now I need two functions, one for feed and one for queries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants