Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add split function to Data.Text.Conduit (closes #205) #207

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

MaxGabriel
Copy link
Contributor

Don't merge; this is missing changelog updates and documentation. I just thought these tests could address @snoyberg's comment about not splitting correctly when the input is chunked. This test in particular:

it "handles separators on a chunk boundary" $ do
    (CL.sourceList ["aX","Xb"] C.$= CT.split "XX" C.$$ CL.consume) ==
        [["a","b"]]

splitUp = T.chunksOf positiveInt t1
l1 <- (CL.sourceList [t1] C.$= CT.split t2 C.$$ CL.consume)
l2 <- (CL.sourceList splitUp C.$= CT.split t2 C.$$ CL.consume)
l1 `shouldBe` l2
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pretty much never used QuickCheck before, so I'm not sure how valuable this test is. I can replace it with manual tests checking the same thing, if necessary.

@snoyberg
Copy link
Owner

OK, I see it now, you're correct. The problematic thing is that this will take up unbounded memory unfortunately, which ideally would not be necessary. An algorithm like those available in http://www.stackage.org/package/stringsearch would be necessary to avoid that.

@MaxGabriel
Copy link
Contributor Author

@snoyberg I don't fully follow, could you clarify? It seems like taking up unbounded memory is inherent to this type of function since the separator may never come, and the text read so far has to be kept around in case it does. lines has the same issue in the case where there's no newline separator.

It seems like the algorithms in stringsearch could help with running time performance but not necessarily memory usage?

On the other hand, I am getting really high memory usage when running this function on a 50 MB file that doesn't have the specified separator in it.

@snoyberg
Copy link
Owner

Imagine you're looking to break on the string "abc". If I get a chunk "qwertya", I know that "qwerty" will be before the chunk boundary, but "a" may be the beginning of it. I can therefore break it off, send it downstream, and continue. If the next chunk is "bc...", then I've found a boundary. If it starts with anything else, I know that it's not a boundary.

You're completely correct about lines having unbounded memory usage, which is why it should only be used in data that you have control over (or with some extra method to ensure memory protection). Instead, a function like line is probably a better interface for a function like this, since it allows for completely deterministic memory usage.

@gregwebs
Copy link
Contributor

gregwebs commented Apr 2, 2015

I will attempt to summarize:

@snoyberg is pointing out that splitOn is not streaming: it builds up a list of results
@MaxGabriel is pointing out that in terms of memory consumption that should only worst case double the memory consumption because the original string is still in memory

It seems to me that

  • to avoid unbounded memory usage this would need to ask for a text of bounded length.
  • this function should avoid the doubling of memory and send results back immediately.

@MaxGabriel
Copy link
Contributor Author

@gregwebs I'm not positive that's what @snoyberg was saying, but I don't think that Data.Text.splitOn doubles memory usage. Its implementation takes in a Text value (backed by an Array) to be split up. Then, instead of creating new arrays for the split-up substrings, it uses the original array + different indices for the beginning and end of the text.

I think that's that this line from the Data.Text haddocks is getting at:

Splitting functions in this library do not perform character-wise copies to create substrings; they just construct new Texts that are slices of the original.

@snoyberg
Copy link
Owner

snoyberg commented Apr 2, 2015

To clarify my thoughts: the currently proposed API necessitates collecting multiple chunks into memory until it finds a boundary. This defeats the streaming nature of conduit. Instead, I'd prefer an API that provides individual streams for each piece of text between chunks. This is very similar to the difference between the lines and line functions in conduit-combinators.

@gregwebs
Copy link
Contributor

gregwebs commented Apr 2, 2015

@MaxGabriel thanks for pointing to the docs. I think the worst case could still be greater than 2x memory usage though: if every other character is split that creates a lot of new Text objects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants