-
Notifications
You must be signed in to change notification settings - Fork 61
SIMD Performance nrn_current suboptimal performance
Purkinje cell ~ 4000 compartments; only 9 are painted with mechanisms: the soma & 8 compartments of the axon.
The soma has 17 mechanisms and the axon shares 6 of those mechanisms.
- Cell group size:
- 1 cell per cell group:
For the shared 6 mechanisms the SIMD index vectors are arranged as follows:|___0___|___x___|__x+1__|__x+2__|__x+3__|__x+4__|__x+5__|__x+6__|__x+7__|__x+7__|__x+7__|__x+7__|
{__________INDEPENDENT__________}{__________CONTIGUOUS__________}{___________CONSTANT___________}
For the rest of the mechanisms, the SIMD index vectors are arranged as follows:
|___0___|___0___|___0___|___0___|
{___________CONSTANT___________}
The last 3 elements in both vectors are padding. They have zero weight and therefore zero contribution to the current and state updates. - 4 cells per cell group:
For the shared 6 mechanisms the SIMD index vectors are arranged as follows:|___0___|___x___|__x+1__|__x+2__|__x+3__|__x+4__|__x+5__|__x+6__|__x+7__|___y___|__y+x__|_y+x+1_|
{__________INDEPENDENT__________}{__________CONTIGUOUS__________}{__________INDEPENDENT__________}
|_y+x+2_|_y+x+3_|_y+x+4_|_y+x+5_|_y+x+6_|_y+x+7_|___z___|__z+x__|_z+x+1_|_z+x+2_|_z+x+3_|_z+x+4_|
{__________CONTIGUOUS__________}{__________INDEPENDENT__________}{__________CONTIGUOUS__________}
|_z+x+5_|_z+x+6_|_z+x+7_|___w___|__w+x__|_w+x+1_|_w+x+2_|_w+x+3_|_w+x+4_|_w+x+5_|_w+x+6_|_w+x+7_|
{__________INDEPENDENT__________}{__________CONTIGUOUS__________}{__________CONTIGUOUS__________}
For the rest of the mechanisms, the SIMD index vectors are arranged as follows:
|___0___|___x___|___y___|___z___|
{__________INDEPENDENT__________}
- Number of cells: 2048
- Network configuration: ring with additional randomly connected synapses (9) with zero weight.
With the current setup, when considering density mechanisms:
-
1 cell per cell group:
CONTIGUOUS
: 6 SIMD vectors per cell group = 20.7%
CONSTANT
: 17 SIMD vectors per cell group = 58.6%
INDEPENDENT
: 6 SIMD vectors per cell group = 20.7%
NONE
: 0 SIMD vectors per cell group -
4 cells per cell group:
CONTIGUOUS
: 30 SIMD vectors per cell group = 46.15%
CONSTANT
: 0 SIMD vectors per cell group
INDEPENDENT
: 35 SIMD vectors per cell group = 53.85%
NONE
: 0 SIMD vectors per cell group
The CONSTANT
vector stores require a vector reduction and a single element store and their loads require a single element load and a vector broadcast. The INDEPENDENT
vector loads/stores are vectors gathers/scatter but on broadwell they are essentially serialized per element. The CONTIGUOUS
loads/stores are vector loads/stored.
The number of vector operations in nrn_state
and nrn_current
differ from mechanism to mechanism and depend on the number of accessed elements and the intended operation. But in general vector stores in nrn_state
are always contiguous, and vector loads adhere to the previously mentioned categories and their statistics. Arithmetic operations in nrn_state
can be quite complex. In nrn_current
arithmetic operations are few and simple, and both vector loads and stores adhere to the previously mentioned categories and their statistics.
- exp 1: 1 cell/cell group - non-vectorized
- exp 2: 4 cells/cell group - non-vectorized
- exp 3: 1 cell/cell group - vectorized
- exp 4: 4 cells/cell group - vectorized
NON-VECTORIZED
time\config | 1 cell/cell group | 4 cells/cell group |
---|---|---|
nrn_state | 15.74 s | 15.083 s |
nrn_current | 3.26 s | 3.167 s |
matrix | 22.217 s | 24.620 s |
total | 45.027 s | 47.079 s |
VECTORIZED
time\config | 1 cell/cell group | 4 cells/cell group |
---|---|---|
nrn_state | 12.327 s | 7.213 s |
nrn_current | 3.815 s | 3.137 s |
matrix | 21.901 s | 24.980 s |
total | 41.7 s | 39.491 s |
nrn_current
is faster in the non-vectorized version than in the vectorized version when we have 1 cell/cell group. This is potentially because of the 58.6% of constant stores that require an additional SIMD reduction each.
However, when we use 4 cells/cell group, the vectorized version becomes faster than the non-vectorized version. It is even better than in the 1 cell/cell group non-vectorized case, but only slightly. There aren't any more CONSTANT
SIMD vectors so the lack of significant speedup could be attributed to other causes such as the high percentage of INDEPENDENT
stores, that are no better than non-vectorized serial stores; or to the fact that the arithmetic operations in nrn_current
are not computationally intensive.
nrn_state
benefits much more from the shift from 1 cell/cell group to 4 cell/cell group. However, the speedup does not yet warrant the switch to a default of 4 cells/cell group, which would require parallelizing the matrix assemble
and solve
functions.