Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple calls to ImpactCalc surprisingly slow #874

Open
peanutfun opened this issue Apr 19, 2024 · 3 comments
Open

Multiple calls to ImpactCalc surprisingly slow #874

peanutfun opened this issue Apr 19, 2024 · 3 comments

Comments

@peanutfun
Copy link
Member

peanutfun commented Apr 19, 2024

TLDR: Calling ImpactCalc.impact multiple times and concatenating the result is much slower than calling it once on a larger event set. This has some unfortunate implications for use cases where the exposure or the impact function changes between hazard events.

Problem

In my use case I have a hazard event set of one event per year (flood footprints). I also have one exposure layer per year (WorldPop data). To compute an impact over all years, I must call ImpactCalc independently for each exposure layer, because it does not support multiple exposures. I then concatenate the result into a single impact.

This was my general idea on how to implement this:

# Usual setting (one exposure, multiple hazard events)
impact = ImpactCalc(exposure, impfset, hazard).impact(save_mat=False, assign_centroids=False)

# Multi-exposure (one for each hazard event)
# NOTE: This is conceptually the same work as before, but now with different exposure values
#       for each event.
impact = Impact.concat([
    ImpactCalc(
        exposure_map[event_id],
        impfset,
        hazard.select(event_id=[event_id]
    ).impact(save_mat=False, assign_centroids=False)
    for event_id in hazard.event_id.flat
])

I was quite bummed to see that in this case, impact calculation took longer by a factor of more than 5, compared to an impact calculation with a single exposure and multiple hazard events. Note that a factor 5 might mean 5s vs 25s for a single impact calculation, but 5min vs 25min for a calibration.

Profiling results

I created a stub test case. It does not use real data, and in this case, the slowdown was "only" a factor of 2, see my code below. I made sure that I assigned centroids beforehand, I did not save the impact matrix, and I only profiled the actual impact calculation. Profiling visualized with snakeviz.

Single exposure:
Screenshot 2024-04-19 at 17 28 49

Multi-exposure:
Screenshot 2024-04-19 at 17 29 07

Results:

  • In both cases, the code spends about the same amount of time creating the actual impact in imp_mat_gen.
  • In the multi-exposure case, the overhead comes mainly from
    • selecting the hazard
    • calling Impact.from_eih multiple times
    • calling ImpactCalc.minimal_exp_gdf multiple times
  • A significant amount of time is spent slicing the hazard intensity matrix

Discussion points

  • We should definitely implement support for multiple exposure "layers" within the impact calculation. This could be implemented via multiple "value" columns. The workaround of calling ImpactCalc multiple times has a strong performance penalty.
  • Impact.from_eih spends a lot of time extracting and stacking the exposure coordinates. It could simply store the exposure or its geometry column instead of copying the coordinates.
  • Is it really faster to call minimal_exp_gdf instead of just using the full exposure gdf?
  • In both Hazard.select and Hazard.get_mdr the intensity matrix is sliced both row- and column-wise. This slicing takes significantly longer than the actual data multiplication. While it is nice that the multiplication is fast with the csr_matrix, column-wise slicing is explicitly slow for this data type. Are we using the most efficient setup here?

Code

https://polybox.ethz.ch/index.php/s/y2aqu3JX4w6qgfy

@chahank
Copy link
Member

chahank commented Apr 22, 2024

Interesting! Just a few quick feedbacks:

  • Why would you do the line hazard.select(event_id=[event_id]) ? This is very suboptimal I think as the ImpatCalc.impact handles multiple events efficiently. You might already regain 5 seconds of the total 25 by not doing this.

  • The minimal_exp_gdf is central for the case when the exposure is larger than the hazard.

  • This slicing takes significantly longer than the actual data multiplication. This is only the case for small datasets if I remember correctly, but happy to improve that.

In brief, the current setup was obtained after trying to optimize computation times for a large number of different use cases (small/large exposures, small/large hazards). But there is certainly room for improvement! Maybe one could make the slicing inside of the ImpactCalc optional? Or, actually what I had in mind while writing the module is to add other types of computation methods over time, here something like ImpactCalc.multi_exposures_impact.

@peanutfun
Copy link
Member Author

@chahank Could you supply the Jupyter script and data you used for benchmarking ImpactCalc?

@chahank
Copy link
Member

chahank commented May 29, 2024

It is a bit messy, so let me clean it up and send it directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants