-
-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expr
instances are serialized as part of the graph
#323
Comments
In this case, |
Well, my mistake. Sorry. I haven't really tried to find out where this is happening, yet. I mixed it up with What's interesting is that for whatever objects this path is triggered, these objects have empty caches and according to my tests, |
No problem - I just wanted to put your mind at ease about that particular detail. I'm glad to see you investigating serialization behavior in general.
I suppose I don't really understand why expressions are bing pickled at all yet. Are you using a version of dask/distributed where expressions are shipped to the scheduler? |
This is just using standard main, nothing sophisticated. This is also a worker profile so even if we did send expressions to the scheduler, the worker should never see this. The only way this makes sense if we accidentally pickle something by value. I'll spin up a debugger. This should be easy to track down |
Okay, I see. Thanks for clarifying! Then I agree that we shouldn't see any expression objects show up at all here (and should fix the fact that we do). |
So, the expression that ends up on the worker is a This object is deeply buried in a fused task. |
Agreed that serializing task graphs shouldn't carry along expressions. Probably some method is getting carried along somewhere. If we had an If we do serialize then I'd expect it to be cheap, see and It looks like the errant piece of code is coming out of the Parquet stuff, which does do more interesting caching than is common in other Expr subclasses. It might makes sense to review that in particular. |
A simple albeit less elegant way to ensure this isn't serialized is also to just increase a counter whenever an object is serialized and assert that this |
Yes, I can also see a ParquetReader expression... Digging in here a bit points to https://github.com/dask-contrib/dask-expr/blob/52346744cf530ca86108553fb0427f552d91854e/dask_expr/_groupby.py#L43 I'll have to run this again since I also had a |
Thanks for investigating that @fjetter - I do see how [EDIT: I suppose the easiest fix is to define |
I think the case we've seen has been addressed but I don't think we have tests for this, yet. I think we should have tests so I suggest to keep this open until we have them. |
Yeah, I agree that we should keep this open until we have test coverage. Still not sure about the best way to do this. |
For constraints that have to apply throughout the project my best
experience comes from using assert_eq like functions. They're broadly used
throughout tests. If we can make a check that is sufficiently cheap then
we can apply it to every Expr that comes through that function. Such a
check might be serializing and then seeing if some reliably bad signal
comes through (such as any class that is part of the dask_expr module) or
having a counter on the reduce method (although I don't particularly like
adding state to expressions just for testing). Another approach might be
to monkey patch an exception into the Expr reduce method during testing.
…On Thu, Oct 12, 2023 at 7:44 AM Richard (Rick) Zamora < ***@***.***> wrote:
Yeah, I agree that we should keep this open until we have test coverage.
Still not sure about the best way to do this.
—
Reply to this email directly, view it on GitHub
<#323 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTDWOT54HHX4CB22XCTX67Q33AVCNFSM6AAAAAA5Y5I7COVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONJZGU2DINJQHE>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
I submitted a rough solution in #331 - Suggestions are very welcome. |
Running some profiling on the TCPH benchmarks, I noticed a significant portion of the main thread to be blocked by the calculation of
_meta
during the deserialization ofExpr
objectsThis screenshot indicates that the expr instance that is being (de-)serialized is the
Merge
expression.Looking through the traces a bit more, I can also find references to a parquet layer which almost looks like it would fetch some data (but I hope this is false). It is definitely performing some pyarrow table conversion.
speedscope file (a Worker)
tls-10_0_32_152-45157.json.zip
Just skimming the code, I saw stuff like this
https://github.com/dask-contrib/dask-expr/blob/52346744cf530ca86108553fb0427f552d91854e/dask_expr/_concat.py#L183
where a method of an expression is used to define a dask task. This inevitably requires us to serialize the instance itself which is something we should avoid. Ideally, the graph only includes global functions such that we can pickle by reference and not by value.
I don't know if there are other places where this pattern comes up.
related #284
The text was updated successfully, but these errors were encountered: