Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: when adding dask_histgram.boost.Histograms delay creation of task graph and use multi-source tree reduction #126

Closed
wants to merge 14 commits into from

Conversation

lgray
Copy link
Collaborator

@lgray lgray commented Feb 11, 2024

This is definitely a prototype - but outlines a nice pattern allows the interface to scale a littler further.

This results in significantly simpler graphs for single histograms with lots of fills (i.e. over systematic uncertainty categories).
Now, instead of multiple layers of tree reduces for boost histograms there is just one that aggregates over all hist-on-block operations that are generated on each histogram fill. i.e. This can handle a tree reduce over multiple input collections.

Memory use is a little bit less. Graph is pleasantly more clean.

before:
starting_graph

after:
with_new_agg

It may not look like a big diff with a smaller graph but it becomes very apparent as you increase the number of fills.

This PR also now implements multi-fill syntax. With a more concise implementation this appears to not be necessary.

@lgray lgray marked this pull request as draft February 11, 2024 00:38
@lgray
Copy link
Collaborator Author

lgray commented Feb 11, 2024

@martindurant in any case between this one and #125 we have some things that improve the situation when filling lots of histograms.

I have a feeling this + #125 (the multi-fill part) will address most of the issues analysis users are running into.
Though clearly we can think about what's the best way to compose it all together.

This alteration definite improves the structure of the resulting task graph, but it doesn't mitigate time to construct many thousands of variations. The multi-fill thing in the other draft PR takes care of that rather handily, though, and results in rather snappy performance.

Getting the interface reasonable may be a challenge. If we can figure out how to make multiple fills turn into something like the multi-fill interface behind the scenes that would also work quite well.

@lgray
Copy link
Collaborator Author

lgray commented Feb 13, 2024

Example of multi-fill syntax:

axes_fill_info_dict = {
    dense_axis_name : dense_variables_array_with_cuts["lep_chan_lst"][sr_cat][dense_axis_name],
    "weight"        : tuple(masked_weights),
    "process"       : histAxisName,
    "category"      : sr_cat,
    "systematic"    : tuple(wgt_var_lst),
}
hout[dense_axis_name].fill(**axes_fill_info_dict)

Here showing a fill where we pass multiple weights corresponding to systematic variations.

@lgray lgray requested a review from martindurant February 13, 2024 16:50
@lgray lgray changed the title feat: when adding AggHistograms continuously update a map reduce over all inputs, cull unnecessary layers feat: when adding AggHistograms continuously update a map reduce over all inputs, cull unnecessary layers, also enable "multi-fill" syntax Feb 13, 2024
@lgray lgray marked this pull request as ready for review February 13, 2024 17:14
@lgray
Copy link
Collaborator Author

lgray commented Feb 14, 2024

After talking to some users it seems another way to do this that's ~reasonable is to pass a function that specifies exactly the filling we would like to do, and we can pass a list of the arguments to each fill call.

@martindurant
Copy link
Collaborator

After talking to some users it seems another way to do this that's ~reasonable is to pass a function that specifies exactly the filling we would like to do, and we can pass a list of the arguments to each fill call.

Of course you can do that, but it's circumventing our public API. Fine if this is really the exception... but dealing with the repeated fill (or repeated with_field) seems like it might be worthwhile.

@lgray
Copy link
Collaborator Author

lgray commented Feb 14, 2024

Yeah there's a few patterns that come out of this.
The problem is that a typical high energy physics analysis will use all of those patterns, and if the task graph is most efficiently made with the fewest total fill calls, then you need the leak in the API to do all those patterns at once.

It is a bit chicken and egg.

@martindurant
Copy link
Collaborator

if the task graph is most efficiently made with the fewest total fill calls

if... the question is, which is easier: dealing with forcing users into doing their own batching, or making our own batching mechanism? I haven't reviewed the code in this PR yet, but the fact that it already exists means that the latter might be the case. Are there other cases of repeatedly applying the same operation that this pattern would apply to? Or other cases where "bury these in a single map_partitions" is the best current advice we can give?

@lgray
Copy link
Collaborator Author

lgray commented Feb 14, 2024

The other-other choice is to write an optimization layer that fuses the histogram fills after the fact into something reasonable (and thus assembles the batches fills). but then you pay in time calling awkward functions to do all the fills....

I think the pattern to search for is actually straightforward.

@martindurant
Copy link
Collaborator

write an optimization layer that fuses the histogram fills

This doesn't help with graph build time or memory use during build.

@lgray
Copy link
Collaborator Author

lgray commented Feb 14, 2024

We need to dig into why you're not seeing the memory improvements I see with this PR. I think it might be macos.

But yeah it won't remove the time-domain part of the problem unless the awkward array broadcast_and_apply time is improved.

@martindurant
Copy link
Collaborator

it won't remove the time-domain part

Batching with_field calls should, but not hists

@lgray
Copy link
Collaborator Author

lgray commented Feb 14, 2024

Lemme give a simple with_fields impl a try. I need a break from other things.

@lgray lgray changed the title feat: when adding AggHistograms continuously update a map reduce over all inputs, cull unnecessary layers, also enable "multi-fill" syntax feat: when adding AggHistograms continuously update a map reduce over all inputs, cull unnecessary layers Feb 20, 2024
@lgray
Copy link
Collaborator Author

lgray commented Feb 20, 2024

@martindurant this one is largely well-focused now. Please give it a try. It looks like with this we can avoid or do a much better job with batching the histogram fills together.

@lgray lgray changed the title feat: when adding AggHistograms continuously update a map reduce over all inputs, cull unnecessary layers feat: when adding dask_histgram.boost.Histograms delay creation of task graph and use multi-source tree reduction Feb 23, 2024
@lgray
Copy link
Collaborator Author

lgray commented Feb 26, 2024

@martindurant is there anything we should add to this one. I think it's a clear improvement on what's there and doesn't rock the boat too much.

@lgray
Copy link
Collaborator Author

lgray commented Feb 26, 2024

It also paves the way for pooling the fills when we generate the task graph (which yet needs study).

@lgray
Copy link
Collaborator Author

lgray commented Feb 28, 2024

@martindurant here's the real first try at multifill. It works for dask-awkward arrays.

Here's a recipe that should work to see the stalling issues and recursion error:
https://gist.github.com/lgray/8a28c5fcd707a2a6778f92cd598f0ca6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants