You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If I try to use the OpenAIDocumentEmbedder with this, I get an error that I must use a list. Therefore I am forced to exhaust the python generator and load all documents into memory.
Many components in the haystack library require List when an Iterable would do just fine. For example, OpenAIDocumentEmbedder could be updated to use python generators instead of raw lists, making the process more memory effecitent.
The text was updated successfully, but these errors were encountered:
bendavis78
changed the title
Components expecting List[...] input should use Iterable
Components expecting List[...] input should use Iterable[...] instead
Oct 26, 2024
Here's another example of this issue I came across:
Because there are a large number of douments in my source data, I'm having to batch my runs (running pipeline.run() for each batch). This is fine when the number of documents remains constant throughout the pipeline (I would like to have been able to write a fetcher component that can yield documents as input to the pipeline, but since it's the first step in the pipeline I can easily pull that out into a separate function).
Now, lets say we add a splitter to our pipeline. Ideally the splitter could yield each chunk, and the embedder could just consume chunks in batches of 32. However, the way components are designed currently makes this impossible.
Because of this, we now have two distinct batch sizes to consider: 1) the number of input documents prior to splitting, and 2) the number of documents (chunks) after splitting that are sent in each request to the embedding model. This can result in uneven batches being sent to the embedder, potentially causing more requests than are actually needed.
For example, let's say I have a generator that yields my documents:
If I try to use the
OpenAIDocumentEmbedder
with this, I get an error that I must use a list. Therefore I am forced to exhaust the python generator and load all documents into memory.Many components in the haystack library require
List
when anIterable
would do just fine. For example,OpenAIDocumentEmbedder
could be updated to use python generators instead of raw lists, making the process more memory effecitent.The text was updated successfully, but these errors were encountered: