Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

folding over (map something-expensive lazy-seq) #1

Open
mjwillson opened this issue Jul 15, 2013 · 0 comments
Open

folding over (map something-expensive lazy-seq) #1

mjwillson opened this issue Jul 15, 2013 · 0 comments

Comments

@mjwillson
Copy link

So this foldable-seq thing has still been bugging me :)

I think you mentioned this yourself Paul, but just wanted to note this important (and I think inherent) limitation in parallelizing fold over a lazy sequence. When running code like this:

(->> s
     (map something-expensive)
     (foldable-seq)
     (reducers/fold +))

The calls to something-expensive aren't parallelized and must happen in serial. There's no way around this that I can see -- the nature of lazy sequences is that you can't ask for the next item until the current one has been returned to you.

Now arguably you could get around this via:

(->> s
     (foldable-seq)
     (reducers/map something-expensive)
     (reducers/fold +))

Although (a) you kinda shouldn't have to worry about this, and (b) in a lot of cases it's too late -- someone's already given you a lazy sequence over things which take time to compute. (Might not necessarily even take lots of CPU, it might be network or disk IO for example).

To my mind this is a decent argument that lazy sequences just aren't the right data structure for parallelised processing of streams. You need a stream abstraction which lets you start computing the next item before the current item has been returned to you (where this is possible -- it isn't always of course, but e.g. for a mapped stream you can certainly start processing the next item while the mapping function is still being applied to the current item.)

Think there may be some useful lessons on how to do this stuff better in scala's parallel collection library turns out not so much for streams which don't fit in memory

My idea for a different approach (which may or may not have legs, I'm still playing with it but initially seems promising) is a stream interface which is essentially "Iterable with concurrency-safe iterators".

The underlying iterators have a single next method which you can safely call concurrently and where implementors try to avoid synchronising around the entire next implementation (and so serializing the iteration) wherever possible. With this interface it's possible (for example) to create a mapped iterator which lets consumer threads each do their own mapping in parallel, even if the underlying sequence has to be iterated over sequentially.

I'm still struggling with how best to handle non-commutative reductions in this sort of framework though, since the ordering gets a bit muddled when you have multiple threads pulling concurrently off an iterator. (Not only during the final reduction, but also in things like filter). The order in which they receive items isn't necessarily the same as the order in which they asked for them. I think there may be a fix if you require .next to hand you a sequence number with each item which can be used to reconstruct the ordering later, but this complicates things a bit.

(One other advantage worth mentioning for iterator-based approaches is you can require the iterators to be closeable as part of the interface, and then have all the consumers (reduce, fold etc) make sure to close once they're done, getting around all those annoying issues around the interaction between resource scope and laziness in clojure. Also you can do things like concat a bunch of file streams together and have the concatted stream open and close each file at the appropriate point...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant