Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Components expecting List[...] input should use Iterable[...] instead #8494

Open
bendavis78 opened this issue Oct 26, 2024 · 1 comment
Open

Comments

@bendavis78
Copy link

bendavis78 commented Oct 26, 2024

For example, let's say I have a generator that yields my documents:

def get_documents():
    for some_thing in some_other_generator():
        yield some_thing

If I try to use the OpenAIDocumentEmbedder with this, I get an error that I must use a list. Therefore I am forced to exhaust the python generator and load all documents into memory.

Many components in the haystack library require List when an Iterable would do just fine. For example, OpenAIDocumentEmbedder could be updated to use python generators instead of raw lists, making the process more memory effecitent.

@bendavis78 bendavis78 changed the title Components expecting List[...] input should use Iterable Components expecting List[...] input should use Iterable[...] instead Oct 26, 2024
@bendavis78
Copy link
Author

bendavis78 commented Nov 18, 2024

Here's another example of this issue I came across:

Because there are a large number of douments in my source data, I'm having to batch my runs (running pipeline.run() for each batch). This is fine when the number of documents remains constant throughout the pipeline (I would like to have been able to write a fetcher component that can yield documents as input to the pipeline, but since it's the first step in the pipeline I can easily pull that out into a separate function).

Now, lets say we add a splitter to our pipeline. Ideally the splitter could yield each chunk, and the embedder could just consume chunks in batches of 32. However, the way components are designed currently makes this impossible.

Because of this, we now have two distinct batch sizes to consider: 1) the number of input documents prior to splitting, and 2) the number of documents (chunks) after splitting that are sent in each request to the embedding model. This can result in uneven batches being sent to the embedder, potentially causing more requests than are actually needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant