-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document Splitter always returns 1 document for split_type="passage" in pdfs #8491
Comments
Hello! The fact that your input is PDF or text does not have an impact on splitting. This works, for example: from haystack import Document
from haystack.components.preprocessors import DocumentSplitter
doc = Document(content="Hello World.\n\nMy name is John Doe.\n\nI live in Berlin", meta={"name": "doc1"})
splitter = DocumentSplitter(split_length=1, split_by="passage")
docs = splitter.run([doc])
print(docs)
# I get 3 documents |
Thanks @anakin87 for your response. I did mention above that it works with txt files as confirmed by your example. My issue is with PDFs (using pypdf). I will revisit by creating a simpler pdf with distinct two line returns. |
I also ran into the issue, but I think it might just be the fact that the PDF format is a bit of a mess when people create documents. Some of the PDFs I have can correctly be split but some can't. |
Hmm. This was a simple text based PDF created via google docs. I thought it would be a "simple" pdf :-). For now, am testing using txt and md files till I get a handle on this. Thanks @lbux for sharing. |
I'll give it another shot tomorrow just to be sure. I do remember the same document I used was having no issues with splitting by words or sentences, but it did fail when I did passage. I wonder if it's failing to read the |
For what it is worth, I did experiment with 3 pdfs and found that the entire document was a single chunk always. |
You are correct. The issue persists regardless of what converter is used ( |
The issue is with the converters. PyPDF is a bit worse in handling PDFs when no parameters are added compared to PDFMiner. However, both implementations are not configured to infer paragraphs breaks based on spacing or layout analysis. When we see something we consider a paragraph, it is stored as "\n" as opposed to "\n\n" which the splitter expects. And this is how PDFMiner converts it: This is how PyPDF converts it: @anakin87 Would love if someone from the team could take a deeper look into it. If my findings are right, many customers could be incorrectly converting their documents. |
@sjrl Have you by any chance encountered this problem in the past? |
@anakin87 We are also using PyPDF a lot but we either typically use:
so we don't have much experience with splitting by passage so we may have not noticed the missing newlines. |
It seems PDFMiner provides more easily understandable options to customize. Specifically, if you mess with the |
I think the main issue (with PDFMiner and probably PyPDF) is in the implementation of the converter:
Our downstream task expects "\n\n" but the current implementation does not add the delineators. The current implementation seems to concatenate all the text of LTTextContainer expecting it to already have the delineators which is not the case. If we explicitly add them like so:
It is true that With these changes, I was able to get the correct, expected output: This is a test document. Here we have a sentence.\n\n Further testing would probably have to be done, and I stripped some of the text to make it easier to debug (which should actually be done in the cleaner); however, I am confident this is the root cause. |
@lbux thanks for debugging this issue! We should probably investigate better, but your proposal sounds reasonable. |
Haystack's PyPDF's conversion method is this:
This also has the same limitation where we don't add the I then looked more into |
Of course! My current project is explicitly dependent on retrieving exact paragraphs, so I had to fix this one way or another haha. Hoping for a permanent solution when time allows. |
Describe the bug
When using Document Splitter with pdf and
split_type="passage"
, the result is always one document. This is using pypdf.Expected behavior
The understanding I have is that it splits based on at least two line breaks
\n\n
Additional context
When I tested using plain text it seems to be splitting correctly
To Reproduce
dir = '...'
files = [
{"filename": "rules.pdf", "meta": {"split_by" : "passage", "split_length":1, "split_overlap":0, "split_threshold":0}},
{"filename": "rules.txt", "meta": {"split_by" : "passage", "split_length":1, "split_overlap":0, "split_threshold":0}}
]
for file in files:
# set the filepath
file_path = Path(dir) / file["filename"]
router_res = file_type_router.run(sources=[file_path])
txt_docs = []
if 'text/plain' in router_res:
txt_docs = text_file_converter.run(sources=router_res['text/plain'])
elif 'application/pdf' in router_res:
txt_docs = pdf_converter.run(sources=router_res['application/pdf'])
elif 'text/markdown' in router_res:
txt_docs = markdown_converter.run(sources=router_res['text/markdown'])
document_splitter = DocumentSplitter(
split_by=file['meta']['split_by'],
split_length=file['meta']['split_length'],
split_overlap=file['meta']['split_overlap'],
split_threshold=file['meta']['split_threshold']
)
splitter_res = document_splitter.run([txt_docs['documents'][0]])
print(len(splitter_res['documents']))
System:
The text was updated successfully, but these errors were encountered: