Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(document-search): Option to choose between image and text embeddings for ImageElements #175

Closed
ludwiktrammer opened this issue Nov 6, 2024 · 2 comments · Fixed by #205
Assignees
Labels
document search Changes to the document search package feature New feature or request
Milestone

Comments

@ludwiktrammer
Copy link
Collaborator

ludwiktrammer commented Nov 6, 2024

Motivation

ImageElement always have an image and often also have text (added during processing). You can't have a single embedding that include both visual and textual information, so currently (since #172 ) we try to create two embeddings (vector store entries) for each image element. That way we can retrieve them using both.

We should add a method to decide which part of ImageElement we want to create an embedding for: textual, visual or both.

This decision should probably be made on the level of processing - because creating textual representation of the image (with OCR / LLM) and then deciding to not use it during ingestion would be wasteful.

Implementation idea

Deciding whether to embed the textual representation

To support flow where only the visual representation is embedded, and textual representation is not needed:

  • The description and ocr_extracted_text properties of ImageElement should be made optional
  • The return type of get_key() method of Element should be made optional
  • ImageElement should return the value None from get_key() for images with no textual representation
  • DocumentSearch's ingest_elements method should skip creating a textual embedding for elements that have get_key() returning None

User would then choose whether to embed the textual representation of the image by choosing the appropriate processor (one that creates the textual representation or one that doesn't)

Deciding whether to embed the visual representation

This part doesn't currently need any changes to our codebase - users already can decide whether they want a visual embedding by choosing an appropriate embedding model.

Supporting cases where both representations are to be embedded

Sometimes the configuration (described above) may indicate that both textual and visual representations should be embedded - which would mean creating two embeddings for a single Element. Currently we don't have a good support for such case. To enable this we should:

  • Add a embedding_type field to VectorStoreEntry (current options: text, image)
  • Change the VectorStoreEntry.id field so it depends both on Element.id and VectorStoreEntry.embedding_type (so the uuid hashing should probably be moved from Element.id here?)

Supporting cases where neither representation are to be embedded

Sometimes the configuration (described above) may indicate that no implementation should be embedded - this should generate a warning. This shouldn't generate an exception, because we still want to try to ingest the rest of the documents.

@ludwiktrammer ludwiktrammer added document search Changes to the document search package feature New feature or request labels Nov 6, 2024
@ludwiktrammer ludwiktrammer moved this to Backlog in ragbits Nov 6, 2024
@ludwiktrammer ludwiktrammer self-assigned this Nov 6, 2024
@ludwiktrammer ludwiktrammer moved this from Backlog to Ready in ragbits Nov 18, 2024
@ludwiktrammer ludwiktrammer removed their assignment Nov 18, 2024
@mhordynski mhordynski added this to the Ragbits 0.5 milestone Nov 18, 2024
@kdziedzic68
Copy link
Collaborator

if the element id already depends on element_type isn't it enough to make id of vector store entry dependent on it ?

@ludwiktrammer
Copy link
Collaborator Author

if the element id already depends on element_type isn't it enough to make id of vector store entry dependent on it ?

No, because in both cases (embedding based on visual representation and embedding based on visual representation) it's a vector based on ImageElement, so the element type will be the same.

@github-project-automation github-project-automation bot moved this from In review to Done in ragbits Nov 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
document search Changes to the document search package feature New feature or request
Projects
Status: Done
3 participants