-
Notifications
You must be signed in to change notification settings - Fork 61
Notes on Reductions
Arbor performs reductions on the current, conductance, and ion concentration contributions made by mechanisms as part of the generated mechanism code. It would make more sense for the reductions to be calculated in the arbor library, which would result in better separation of concerns, and allow us to have more reproducible results. However, the inclusion of the reductions in the mechanism code is an optimization that makes the most out of cache-locality and cuts on memory costs.
We study the effect of moving the reduction code into the Arbor library, both on runtime and memory consumption. Ideally, we would like to achieve an improvement in the two metrics mentioned, but we could allow for some degradation in performance (in one or both metrics) for the reasons mentioned above.
We describe 3 approaches to reductions, focusing only on the total current (vec_i
) to illustrate the methods, though this would also apply to the total conductances (vec_g
), the ionic internal and external concentrations (Xi
and Xo
), and the ionic currents (iX
).
1. Present implementation:
A single vec_i
that is read, modified and written for every mechanism. vec_i
is not accessed in order.
for every mechanism m
for i in range(m.num_cvs)
vec_i[m.index[i]] += local_curr
Complexity: num_mechanisms
* (cost(gather) + cost(scatter))
2. Split reduction from local current calculation #1:
A local_i
vector per mechanism that is written in order. The local_i
vectors are merged and reduced into the main vec_i
after all mechanism current updates are performed.
for every mechanism m
for i in range(m.num_cvs)
m.local_i[i] = local_curr
// "Smart" merge and reduce the m.local_i vectors of mechanisms then accumulate into the main vec_i
Complexity: num_mechanisms
* cost(sequential_store) + cost(merge_reduce) + cost(scatter) + cost(gather)
Memory overhead: sum(num_cvs_per_mechanism
)
3. Split reduction from local current calculation #2:
A local_i
vector for all the mechanisms (with size = sum(num_cvs_per_mechanism
)) , that is written out of order. The mechanisms would write their contributions into the vector such that all current contributions per cv will be consecutive in memory. The local_i
vector is then reduced in order into vec_i
.
for every mechanism m
for i in range(m.num_cvs)
local_i[m.new_index[i]] = local_curr
// Reduce local_i into the main vec_i
Complexity: num_mechanisms
* cost(scatter) + cost(reduce) + cost(gather) + cost(scatter)
Memory overhead: sum(num_cvs_per_mechanism
)*2 // additional memory for the new_index vector
If we have any hope of improving the performance, we need to at least make sure that the performance of the mechanism current updates improves when we don't perform the reduction inside the mechanism. Because in both approaches (2) and (3) we are touching a larger memory (size = sum(num_cvs_per_mechanism
) instead of size = num_cvs_in_cell
), the cache performance may affect the runtime negatively, unless we use non-temporal stores.
My experiments so far show that modifying the mechanism current updates to write into local vectors as in approaches (2) and (3) slightly impedes performance relative to approach (1) (we seem to get the add and scatter from approach (1) for free); that is without taking into account the extra reduction step that needs to be performed later.
If we hope not to improve performance but instead not too greatly impact it, then we can implement either approach (2) or (3). Approach (3) has been proven to be faster (the "smart_merge" indices from approach 2 are pre-computed during model-initialization and saved in the "new_index" vector from approach (3))
With both approaches (2) and (3), we need to implement a highly optimized reduction step because in all tests of the present approach (1), the reduction seems to happens for free.
1. Vectorized reduction across CV:
local_i
is stored as follows (assuming 2 mechanisms present on all compartments):
i0,0 i1,0 i0,1 i1,1 i0,2 i1,2 ...
where ix,y
is the contribution of mech x
on cv y
. We would use vector instructions to reduce i0,0 i1,0
; i0,1 i1,1
and i0,2 i1,2
separately.
2. Smart vectorize, 1 lane per CV:
local_i
is stored as follows (assuming k mechanisms present on all compartments and a vector length (or warp size) of 4):
i0,0 i0,1 i0,2 i0,3 | i1,0 i1,1 i1,2 i1,3 | ... | ik,0 ik,1 ik,2 ik,3 | i0,4, i0,5, i0,6, i0,7 | i1,4, i1,5, i1,6, i1,7 | ...
where ix,y
is the contribution of mech x
on cv y
. We would use vector instructions to reduce [i0,0 i0,1 i0,2 i03] + [i1,0 i1,1 i1,2 i1,3] + ... + [ik,0 ik,1 ik,2 ik,3]
and so on.
Performance numbers were obtained for the reduction implementation from approach (3), using vectorization approach (1).
The changes were implemented only for the total current and total conductance (vec_i
and vec_g
). This was enough to run the following:
A simulation of a model that has 100 cells with around 6600 compartments each, the dendrites of the cells have 2 mechanisms per cv.
The runtime increases by 7% on GPU, and 2% on CPU.
The most optimal performance (besides implementation approach (1)) would be implementation approach (3), with vectorization approach (2), which has not yet been implemented.