Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a Recursive Chunking strategy #8548

Open
davidsbatista opened this issue Nov 15, 2024 · 3 comments
Open

Add a Recursive Chunking strategy #8548

davidsbatista opened this issue Nov 15, 2024 · 3 comments
Assignees

Comments

@davidsbatista
Copy link
Contributor

davidsbatista commented Nov 15, 2024

Use a set of predefined separators to split text recursively. The process follows these steps:

  • It starts with a list of separator characters, typically ordered from most to least specific (e.g., ["\n\n", "\n", " ", ""]).
  • The splitter attempts to divide the text using the first separator ("\n\n" in this case).
  • If the resulting chunks are still larger than the specified chunk size, it moves to the next separator in the list ("\n").
  • This process continues recursively, using progressively less specific separators until the chunks meet the desired size criteria.
@davidsbatista davidsbatista self-assigned this Nov 15, 2024
@sjrl
Copy link
Contributor

sjrl commented Nov 20, 2024

@davidsbatista This sounds great! One idea I had for this is some way to indicate that we'd like to utilize something like NLTK to do sentence splitting. So normally I think the list of separator characters would look like ["\n\n", ".", " "] to accomplish splitting by paragrah, then sentence, and then by word. And I was wondering if we could replace "." with something like "nltk" or some other tag to indicate we'd like to use a separate algorithm to handle the splitting.

What do you think?

@sjrl
Copy link
Contributor

sjrl commented Nov 20, 2024

Also I wanted to ask will the splitting by separators (e.g. ["\n\n", ".", " "]) be handled using a regex splitter? I think supporting regex would be great so we could provide more complicated separators to better handle complex documents and do things like header detection.

@davidsbatista
Copy link
Contributor Author

that's a good suggestions, I will take it into consideration

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants