Multiple calls to `ImpactCalc` surprisingly slow #874

peanutfun · 2024-04-19T15:51:59Z

TLDR: Calling ImpactCalc.impact multiple times and concatenating the result is much slower than calling it once on a larger event set. This has some unfortunate implications for use cases where the exposure or the impact function changes between hazard events.

Problem

In my use case I have a hazard event set of one event per year (flood footprints). I also have one exposure layer per year (WorldPop data). To compute an impact over all years, I must call ImpactCalc independently for each exposure layer, because it does not support multiple exposures. I then concatenate the result into a single impact.

This was my general idea on how to implement this:

# Usual setting (one exposure, multiple hazard events)
impact = ImpactCalc(exposure, impfset, hazard).impact(save_mat=False, assign_centroids=False)

# Multi-exposure (one for each hazard event)
# NOTE: This is conceptually the same work as before, but now with different exposure values
#       for each event.
impact = Impact.concat([
    ImpactCalc(
        exposure_map[event_id],
        impfset,
        hazard.select(event_id=[event_id]
    ).impact(save_mat=False, assign_centroids=False)
    for event_id in hazard.event_id.flat
])

I was quite bummed to see that in this case, impact calculation took longer by a factor of more than 5, compared to an impact calculation with a single exposure and multiple hazard events. Note that a factor 5 might mean 5s vs 25s for a single impact calculation, but 5min vs 25min for a calibration.

Profiling results

I created a stub test case. It does not use real data, and in this case, the slowdown was "only" a factor of 2, see my code below. I made sure that I assigned centroids beforehand, I did not save the impact matrix, and I only profiled the actual impact calculation. Profiling visualized with snakeviz.

Single exposure:

Multi-exposure:

Results:

In both cases, the code spends about the same amount of time creating the actual impact in imp_mat_gen.
In the multi-exposure case, the overhead comes mainly from
- selecting the hazard
- calling Impact.from_eih multiple times
- calling ImpactCalc.minimal_exp_gdf multiple times
A significant amount of time is spent slicing the hazard intensity matrix

Discussion points

We should definitely implement support for multiple exposure "layers" within the impact calculation. This could be implemented via multiple "value" columns. The workaround of calling ImpactCalc multiple times has a strong performance penalty.
Impact.from_eih spends a lot of time extracting and stacking the exposure coordinates. It could simply store the exposure or its geometry column instead of copying the coordinates.
Is it really faster to call minimal_exp_gdf instead of just using the full exposure gdf?
In both Hazard.select and Hazard.get_mdr the intensity matrix is sliced both row- and column-wise. This slicing takes significantly longer than the actual data multiplication. While it is nice that the multiplication is fast with the csr_matrix, column-wise slicing is explicitly slow for this data type. Are we using the most efficient setup here?

Code

https://polybox.ethz.ch/index.php/s/y2aqu3JX4w6qgfy

The text was updated successfully, but these errors were encountered:

chahank · 2024-04-22T07:17:31Z

Interesting! Just a few quick feedbacks:

Why would you do the line hazard.select(event_id=[event_id]) ? This is very suboptimal I think as the ImpatCalc.impact handles multiple events efficiently. You might already regain 5 seconds of the total 25 by not doing this.
The minimal_exp_gdf is central for the case when the exposure is larger than the hazard.
This slicing takes significantly longer than the actual data multiplication. This is only the case for small datasets if I remember correctly, but happy to improve that.

In brief, the current setup was obtained after trying to optimize computation times for a large number of different use cases (small/large exposures, small/large hazards). But there is certainly room for improvement! Maybe one could make the slicing inside of the ImpactCalc optional? Or, actually what I had in mind while writing the module is to add other types of computation methods over time, here something like ImpactCalc.multi_exposures_impact.

peanutfun · 2024-05-29T09:12:20Z

@chahank Could you supply the Jupyter script and data you used for benchmarking ImpactCalc?

chahank · 2024-05-29T11:32:40Z

It is a bit messy, so let me clean it up and send it directly.

peanutfun added the performance label Apr 19, 2024

mncosta mentioned this issue Aug 16, 2024

Hazard.intensity in CSR format and not CSC format #938

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple calls to `ImpactCalc` surprisingly slow #874

Multiple calls to `ImpactCalc` surprisingly slow #874

peanutfun commented Apr 19, 2024 •

edited

Loading

chahank commented Apr 22, 2024

peanutfun commented May 29, 2024

chahank commented May 29, 2024

Multiple calls to ImpactCalc surprisingly slow #874

Multiple calls to ImpactCalc surprisingly slow #874

Comments

peanutfun commented Apr 19, 2024 • edited Loading

Problem

Profiling results

Discussion points

Code

chahank commented Apr 22, 2024

peanutfun commented May 29, 2024

chahank commented May 29, 2024

Multiple calls to `ImpactCalc` surprisingly slow #874

Multiple calls to `ImpactCalc` surprisingly slow #874

peanutfun commented Apr 19, 2024 •

edited

Loading