Add split function to Data.Text.Conduit (closes #205) #207

MaxGabriel · 2015-03-29T03:08:05Z

Don't merge; this is missing changelog updates and documentation. I just thought these tests could address @snoyberg's comment about not splitting correctly when the input is chunked. This test in particular:

it "handles separators on a chunk boundary" $ do
    (CL.sourceList ["aX","Xb"] C.$= CT.split "XX" C.$$ CL.consume) ==
        [["a","b"]]

MaxGabriel · 2015-03-29T03:09:02Z

conduit-extra/test/Data/Conduit/TextSpec.hs

+                splitUp = T.chunksOf positiveInt t1
+            l1 <- (CL.sourceList [t1] C.$= CT.split t2 C.$$ CL.consume)
+            l2 <- (CL.sourceList splitUp C.$= CT.split t2 C.$$ CL.consume)
+            l1 `shouldBe` l2


I've pretty much never used QuickCheck before, so I'm not sure how valuable this test is. I can replace it with manual tests checking the same thing, if necessary.

snoyberg · 2015-03-29T05:59:20Z

OK, I see it now, you're correct. The problematic thing is that this will take up unbounded memory unfortunately, which ideally would not be necessary. An algorithm like those available in http://www.stackage.org/package/stringsearch would be necessary to avoid that.

MaxGabriel · 2015-03-29T18:36:22Z

@snoyberg I don't fully follow, could you clarify? It seems like taking up unbounded memory is inherent to this type of function since the separator may never come, and the text read so far has to be kept around in case it does. lines has the same issue in the case where there's no newline separator.

It seems like the algorithms in stringsearch could help with running time performance but not necessarily memory usage?

On the other hand, I am getting really high memory usage when running this function on a 50 MB file that doesn't have the specified separator in it.

snoyberg · 2015-03-29T20:16:35Z

Imagine you're looking to break on the string "abc". If I get a chunk "qwertya", I know that "qwerty" will be before the chunk boundary, but "a" may be the beginning of it. I can therefore break it off, send it downstream, and continue. If the next chunk is "bc...", then I've found a boundary. If it starts with anything else, I know that it's not a boundary.

You're completely correct about lines having unbounded memory usage, which is why it should only be used in data that you have control over (or with some extra method to ensure memory protection). Instead, a function like line is probably a better interface for a function like this, since it allows for completely deterministic memory usage.

gregwebs · 2015-04-02T04:21:49Z

I will attempt to summarize:

@snoyberg is pointing out that splitOn is not streaming: it builds up a list of results
@MaxGabriel is pointing out that in terms of memory consumption that should only worst case double the memory consumption because the original string is still in memory

It seems to me that

to avoid unbounded memory usage this would need to ask for a text of bounded length.
this function should avoid the doubling of memory and send results back immediately.

MaxGabriel · 2015-04-02T04:55:56Z

@gregwebs I'm not positive that's what @snoyberg was saying, but I don't think that Data.Text.splitOn doubles memory usage. Its implementation takes in a Text value (backed by an Array) to be split up. Then, instead of creating new arrays for the split-up substrings, it uses the original array + different indices for the beginning and end of the text.

I think that's that this line from the Data.Text haddocks is getting at:

Splitting functions in this library do not perform character-wise copies to create substrings; they just construct new Texts that are slices of the original.

snoyberg · 2015-04-02T07:34:04Z

To clarify my thoughts: the currently proposed API necessitates collecting multiple chunks into memory until it finds a boundary. This defeats the streaming nature of conduit. Instead, I'd prefer an API that provides individual streams for each piece of text between chunks. This is very similar to the difference between the lines and line functions in conduit-combinators.

gregwebs · 2015-04-02T13:38:13Z

@MaxGabriel thanks for pointing to the docs. I think the worst case could still be greater than 2x memory usage though: if every other character is split that creates a lot of new Text objects.

Add split function to Data.Text.Conduit

774cd28

MaxGabriel reviewed Mar 29, 2015
View reviewed changes

snoyberg mentioned this pull request Jul 20, 2015

split function for Data.Conduit.Text #205

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add split function to Data.Text.Conduit (closes #205) #207

Add split function to Data.Text.Conduit (closes #205) #207

MaxGabriel commented Mar 29, 2015

MaxGabriel Mar 29, 2015

snoyberg commented Mar 29, 2015

MaxGabriel commented Mar 29, 2015

snoyberg commented Mar 29, 2015

gregwebs commented Apr 2, 2015

MaxGabriel commented Apr 2, 2015

snoyberg commented Apr 2, 2015

gregwebs commented Apr 2, 2015

Add split function to Data.Text.Conduit (closes #205) #207

Are you sure you want to change the base?

Add split function to Data.Text.Conduit (closes #205) #207

Conversation

MaxGabriel commented Mar 29, 2015

MaxGabriel Mar 29, 2015

Choose a reason for hiding this comment

snoyberg commented Mar 29, 2015

MaxGabriel commented Mar 29, 2015

snoyberg commented Mar 29, 2015

gregwebs commented Apr 2, 2015

MaxGabriel commented Apr 2, 2015

snoyberg commented Apr 2, 2015

gregwebs commented Apr 2, 2015