-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add split function to Data.Text.Conduit (closes #205) #207
base: master
Are you sure you want to change the base?
Conversation
splitUp = T.chunksOf positiveInt t1 | ||
l1 <- (CL.sourceList [t1] C.$= CT.split t2 C.$$ CL.consume) | ||
l2 <- (CL.sourceList splitUp C.$= CT.split t2 C.$$ CL.consume) | ||
l1 `shouldBe` l2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've pretty much never used QuickCheck before, so I'm not sure how valuable this test is. I can replace it with manual tests checking the same thing, if necessary.
OK, I see it now, you're correct. The problematic thing is that this will take up unbounded memory unfortunately, which ideally would not be necessary. An algorithm like those available in http://www.stackage.org/package/stringsearch would be necessary to avoid that. |
@snoyberg I don't fully follow, could you clarify? It seems like taking up unbounded memory is inherent to this type of function since the separator may never come, and the text read so far has to be kept around in case it does. It seems like the algorithms in stringsearch could help with running time performance but not necessarily memory usage? On the other hand, I am getting really high memory usage when running this function on a 50 MB file that doesn't have the specified separator in it. |
Imagine you're looking to break on the string "abc". If I get a chunk "qwertya", I know that "qwerty" will be before the chunk boundary, but "a" may be the beginning of it. I can therefore break it off, send it downstream, and continue. If the next chunk is "bc...", then I've found a boundary. If it starts with anything else, I know that it's not a boundary. You're completely correct about |
I will attempt to summarize: @snoyberg is pointing out that It seems to me that
|
@gregwebs I'm not positive that's what @snoyberg was saying, but I don't think that I think that's that this line from the Data.Text haddocks is getting at:
|
To clarify my thoughts: the currently proposed API necessitates collecting multiple chunks into memory until it finds a boundary. This defeats the streaming nature of conduit. Instead, I'd prefer an API that provides individual streams for each piece of text between chunks. This is very similar to the difference between the |
@MaxGabriel thanks for pointing to the docs. I think the worst case could still be greater than 2x memory usage though: if every other character is split that creates a lot of new Text objects. |
Don't merge; this is missing changelog updates and documentation. I just thought these tests could address @snoyberg's comment about not splitting correctly when the input is chunked. This test in particular: