You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
TLDR: Calling ImpactCalc.impact multiple times and concatenating the result is much slower than calling it once on a larger event set. This has some unfortunate implications for use cases where the exposure or the impact function changes between hazard events.
Problem
In my use case I have a hazard event set of one event per year (flood footprints). I also have one exposure layer per year (WorldPop data). To compute an impact over all years, I must call ImpactCalc independently for each exposure layer, because it does not support multiple exposures. I then concatenate the result into a single impact.
This was my general idea on how to implement this:
# Usual setting (one exposure, multiple hazard events)impact=ImpactCalc(exposure, impfset, hazard).impact(save_mat=False, assign_centroids=False)
# Multi-exposure (one for each hazard event)# NOTE: This is conceptually the same work as before, but now with different exposure values# for each event.impact=Impact.concat([
ImpactCalc(
exposure_map[event_id],
impfset,
hazard.select(event_id=[event_id]
).impact(save_mat=False, assign_centroids=False)
forevent_idinhazard.event_id.flat
])
I was quite bummed to see that in this case, impact calculation took longer by a factor of more than 5, compared to an impact calculation with a single exposure and multiple hazard events. Note that a factor 5 might mean 5s vs 25s for a single impact calculation, but 5min vs 25min for a calibration.
Profiling results
I created a stub test case. It does not use real data, and in this case, the slowdown was "only" a factor of 2, see my code below. I made sure that I assigned centroids beforehand, I did not save the impact matrix, and I only profiled the actual impact calculation. Profiling visualized with snakeviz.
Single exposure:
Multi-exposure:
Results:
In both cases, the code spends about the same amount of time creating the actual impact in imp_mat_gen.
In the multi-exposure case, the overhead comes mainly from
selecting the hazard
calling Impact.from_eih multiple times
calling ImpactCalc.minimal_exp_gdf multiple times
A significant amount of time is spent slicing the hazard intensity matrix
Discussion points
We should definitely implement support for multiple exposure "layers" within the impact calculation. This could be implemented via multiple "value" columns. The workaround of calling ImpactCalc multiple times has a strong performance penalty.
Impact.from_eih spends a lot of time extracting and stacking the exposure coordinates. It could simply store the exposure or its geometry column instead of copying the coordinates.
Is it really faster to call minimal_exp_gdf instead of just using the full exposure gdf?
In both Hazard.select and Hazard.get_mdr the intensity matrix is sliced both row- and column-wise. This slicing takes significantly longer than the actual data multiplication. While it is nice that the multiplication is fast with the csr_matrix, column-wise slicing is explicitly slow for this data type. Are we using the most efficient setup here?
Why would you do the line hazard.select(event_id=[event_id]) ? This is very suboptimal I think as the ImpatCalc.impact handles multiple events efficiently. You might already regain 5 seconds of the total 25 by not doing this.
The minimal_exp_gdf is central for the case when the exposure is larger than the hazard.
This slicing takes significantly longer than the actual data multiplication. This is only the case for small datasets if I remember correctly, but happy to improve that.
In brief, the current setup was obtained after trying to optimize computation times for a large number of different use cases (small/large exposures, small/large hazards). But there is certainly room for improvement! Maybe one could make the slicing inside of the ImpactCalc optional? Or, actually what I had in mind while writing the module is to add other types of computation methods over time, here something like ImpactCalc.multi_exposures_impact.
TLDR: Calling
ImpactCalc.impact
multiple times and concatenating the result is much slower than calling it once on a larger event set. This has some unfortunate implications for use cases where the exposure or the impact function changes between hazard events.Problem
In my use case I have a hazard event set of one event per year (flood footprints). I also have one exposure layer per year (WorldPop data). To compute an impact over all years, I must call
ImpactCalc
independently for each exposure layer, because it does not support multiple exposures. I then concatenate the result into a single impact.This was my general idea on how to implement this:
I was quite bummed to see that in this case, impact calculation took longer by a factor of more than 5, compared to an impact calculation with a single exposure and multiple hazard events. Note that a factor 5 might mean 5s vs 25s for a single impact calculation, but 5min vs 25min for a calibration.
Profiling results
I created a stub test case. It does not use real data, and in this case, the slowdown was "only" a factor of 2, see my code below. I made sure that I assigned centroids beforehand, I did not save the impact matrix, and I only profiled the actual impact calculation. Profiling visualized with
snakeviz
.Single exposure:
Multi-exposure:
Results:
imp_mat_gen
.Impact.from_eih
multiple timesImpactCalc.minimal_exp_gdf
multiple timesDiscussion points
ImpactCalc
multiple times has a strong performance penalty.Impact.from_eih
spends a lot of time extracting and stacking the exposure coordinates. It could simply store the exposure or its geometry column instead of copying the coordinates.minimal_exp_gdf
instead of just using the full exposure gdf?Hazard.select
andHazard.get_mdr
the intensity matrix is sliced both row- and column-wise. This slicing takes significantly longer than the actual data multiplication. While it is nice that the multiplication is fast with thecsr_matrix
, column-wise slicing is explicitly slow for this data type. Are we using the most efficient setup here?Code
https://polybox.ethz.ch/index.php/s/y2aqu3JX4w6qgfy
The text was updated successfully, but these errors were encountered: