You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
So this foldable-seq thing has still been bugging me :)
I think you mentioned this yourself Paul, but just wanted to note this important (and I think inherent) limitation in parallelizing fold over a lazy sequence. When running code like this:
(->> s
(map something-expensive)
(foldable-seq)
(reducers/fold +))
The calls to something-expensive aren't parallelized and must happen in serial. There's no way around this that I can see -- the nature of lazy sequences is that you can't ask for the next item until the current one has been returned to you.
Now arguably you could get around this via:
(->> s
(foldable-seq)
(reducers/map something-expensive)
(reducers/fold +))
Although (a) you kinda shouldn't have to worry about this, and (b) in a lot of cases it's too late -- someone's already given you a lazy sequence over things which take time to compute. (Might not necessarily even take lots of CPU, it might be network or disk IO for example).
To my mind this is a decent argument that lazy sequences just aren't the right data structure for parallelised processing of streams. You need a stream abstraction which lets you start computing the next item before the current item has been returned to you (where this is possible -- it isn't always of course, but e.g. for a mapped stream you can certainly start processing the next item while the mapping function is still being applied to the current item.)
Think there may be some useful lessons on how to do this stuff better in scala's parallel collection library turns out not so much for streams which don't fit in memory
My idea for a different approach (which may or may not have legs, I'm still playing with it but initially seems promising) is a stream interface which is essentially "Iterable with concurrency-safe iterators".
The underlying iterators have a single next method which you can safely call concurrently and where implementors try to avoid synchronising around the entire next implementation (and so serializing the iteration) wherever possible. With this interface it's possible (for example) to create a mapped iterator which lets consumer threads each do their own mapping in parallel, even if the underlying sequence has to be iterated over sequentially.
I'm still struggling with how best to handle non-commutative reductions in this sort of framework though, since the ordering gets a bit muddled when you have multiple threads pulling concurrently off an iterator. (Not only during the final reduction, but also in things like filter). The order in which they receive items isn't necessarily the same as the order in which they asked for them. I think there may be a fix if you require .next to hand you a sequence number with each item which can be used to reconstruct the ordering later, but this complicates things a bit.
(One other advantage worth mentioning for iterator-based approaches is you can require the iterators to be closeable as part of the interface, and then have all the consumers (reduce, fold etc) make sure to close once they're done, getting around all those annoying issues around the interaction between resource scope and laziness in clojure. Also you can do things like concat a bunch of file streams together and have the concatted stream open and close each file at the appropriate point...)
The text was updated successfully, but these errors were encountered:
So this foldable-seq thing has still been bugging me :)
I think you mentioned this yourself Paul, but just wanted to note this important (and I think inherent) limitation in parallelizing fold over a lazy sequence. When running code like this:
The calls to
something-expensive
aren't parallelized and must happen in serial. There's no way around this that I can see -- the nature of lazy sequences is that you can't ask for the next item until the current one has been returned to you.Now arguably you could get around this via:
Although (a) you kinda shouldn't have to worry about this, and (b) in a lot of cases it's too late -- someone's already given you a lazy sequence over things which take time to compute. (Might not necessarily even take lots of CPU, it might be network or disk IO for example).
To my mind this is a decent argument that lazy sequences just aren't the right data structure for parallelised processing of streams. You need a stream abstraction which lets you start computing the next item before the current item has been returned to you (where this is possible -- it isn't always of course, but e.g. for a mapped stream you can certainly start processing the next item while the mapping function is still being applied to the current item.)
Think there may be some useful lessons on how to do this stuff better in scala's parallel collection libraryturns out not so much for streams which don't fit in memoryMy idea for a different approach (which may or may not have legs, I'm still playing with it but initially seems promising) is a stream interface which is essentially "Iterable with concurrency-safe iterators".
The underlying iterators have a single next method which you can safely call concurrently and where implementors try to avoid synchronising around the entire next implementation (and so serializing the iteration) wherever possible. With this interface it's possible (for example) to create a mapped iterator which lets consumer threads each do their own mapping in parallel, even if the underlying sequence has to be iterated over sequentially.
I'm still struggling with how best to handle non-commutative reductions in this sort of framework though, since the ordering gets a bit muddled when you have multiple threads pulling concurrently off an iterator. (Not only during the final reduction, but also in things like filter). The order in which they receive items isn't necessarily the same as the order in which they asked for them. I think there may be a fix if you require .next to hand you a sequence number with each item which can be used to reconstruct the ordering later, but this complicates things a bit.
(One other advantage worth mentioning for iterator-based approaches is you can require the iterators to be closeable as part of the interface, and then have all the consumers (reduce, fold etc) make sure to close once they're done, getting around all those annoying issues around the interaction between resource scope and laziness in clojure. Also you can do things like concat a bunch of file streams together and have the concatted stream open and close each file at the appropriate point...)
The text was updated successfully, but these errors were encountered: