-
-
Notifications
You must be signed in to change notification settings - Fork 340
/
03-anatomy.rst
498 lines (375 loc) · 19 KB
/
03-anatomy.rst
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
Anatomy of an array
===============================================================================
.. contents:: **Contents**
:local:
Introduction
------------
As explained in the Preface_, you should have a basic experience with NumPy to
read this book. If this is not the case, you'd better start with a beginner
tutorial before coming back here. Consequently I'll only give here a quick
reminder on the basic anatomy of NumPy arrays, especially regarding the memory
layout, view, copy and the data type. They are critical notions to
understand if you want your computation to benefit from NumPy philosophy.
Let's consider a simple example where we want to clear all the values from an
array which has the dtype `np.float32`. How does one write it to maximize speed? The
below syntax is rather obvious (at least for those familiar with NumPy) but the
above question asks to find the fastest operation.
.. code-block:: python
>>> Z = np.ones(4*1000000, np.float32)
>>> Z[...] = 0
If you look more closely at both the dtype and the size of the array, you can
observe that this array can be casted (i.e. viewed) into many other
"compatible" data types. By compatible, I mean that `Z.size * Z.itemsize` can
be divided by the new dtype itemsize.
.. code-block:: python
>>> timeit("Z.view(np.float16)[...] = 0", globals())
100 loops, best of 3: 2.72 msec per loop
>>> timeit("Z.view(np.int16)[...] = 0", globals())
100 loops, best of 3: 2.77 msec per loop
>>> timeit("Z.view(np.int32)[...] = 0", globals())
100 loops, best of 3: 1.29 msec per loop
>>> timeit("Z.view(np.float32)[...] = 0", globals())
100 loops, best of 3: 1.33 msec per loop
>>> timeit("Z.view(np.int64)[...] = 0", globals())
100 loops, best of 3: 874 usec per loop
>>> timeit("Z.view(np.float64)[...] = 0", globals())
100 loops, best of 3: 865 usec per loop
>>> timeit("Z.view(np.complex128)[...] = 0", globals())
100 loops, best of 3: 841 usec per loop
>>> timeit("Z.view(np.int8)[...] = 0", globals())
100 loops, best of 3: 630 usec per loop
Interestingly enough, the obvious way of clearing all the values is not the
fastest. By casting the array into a larger data type such as `np.float64`, we
gained a 25% speed factor. But, by viewing the array as a byte array
(`np.int8`), we gained a 50% factor. The reason for such speedup are to be
found in the internal NumPy machinery and the compiler optimization. This
simple example illustrates the philosophy of NumPy as we'll see in the next
section below.
Memory layout
-------------
The `NumPy documentation
<https://docs.scipy.org/doc/numpy/reference/arrays.ndarray.html>`_ defines the
ndarray class very clearly:
*An instance of class ndarray consists of a contiguous one-dimensional segment
of computer memory (owned by the array, or by some other object), combined
with an indexing scheme that maps N integers into the location of an item in
the block.*
Said differently, an array is mostly a contiguous block of memory whose parts
can be accessed using an indexing scheme. Such indexing scheme is in turn
defined by a `shape
<https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.shape.html#numpy.ndarray.shape>`_
and a `data type
<https://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html>`_ and this is
precisely what is needed when you define a new array:
.. code:: python
Z = np.arange(9).reshape(3,3).astype(np.int16)
Here, we know that Z itemsize is 2 bytes (`int16`), the shape is (3,3) and
the number of dimensions is 2 (`len(Z.shape)`).
.. code:: pycon
>>> Z.itemsize
2
>>> Z.shape
(3, 3)
>>> Z.ndim
2
Furthermore and because Z is not a view, we can deduce the
`strides <https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.strides.html#numpy.ndarray.strides>`_ of the array that define the number of bytes to step in each dimension when traversing the array.
.. code:: pycon
>>> strides = Z.shape[1]*Z.itemsize, Z.itemsize
>>> strides
(6, 2)
>>> Z.strides
(6, 2)
With all these information, we know how to access a specific item (designed by
an index tuple) and more precisely, how to compute the start and end offsets:
.. code:: python
offset_start = 0
for i in range(Z.ndim):
offset_start += Z.strides[i] * index[i]
offset_end = offset_start + Z.itemsize
offset_end
Let's see if this is correct using the `tobytes
<https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.tobytes.html>`_
conversion method:
.. code:: python
>>> Z = np.arange(9).reshape(3, 3).astype(np.int16)
>>> index = 1, 1
>>> Z[index].tobytes()
b'\x04\x00'
>>> offset_start = 0
>>> for i in range(Z.ndim):
... offset_start += Z.strides[i] * index[i]
...
>>> offset_end = offset_start + Z.itemsize
>>> Z.tobytes()[offset_start:offset_end]
b'\x04\x00'
This array can be actually considered from different perspectives (i.e. layouts):
**Item layout**
.. code::
:class: output
Z.shape[1]
(=3)
┌───────────┐
┌ ┌───┬───┬───┐ ┐
│ │ 0 │ 1 │ 2 │ │
│ ├───┼───┼───┤ │
Z.shape[0] │ │ 3 │ 4 │ 5 │ │ len(Z)
(=3) │ ├───┼───┼───┤ │ (=3)
│ │ 6 │ 7 │ 8 │ │
└ └───┴───┴───┘ ┘
**Flattened item layout**
.. code::
:class: output
┌───┬───┬───┬───┬───┬───┬───┬───┬───┐
│ 0 │ 1 │ 2 │ 3 │ 4 │ 5 │ 6 │ 7 │ 8 │
└───┴───┴───┴───┴───┴───┴───┴───┴───┘
└───────────────────────────────────┘
Z.size
(=9)
**Memory layout (C order, big endian)**
.. code::
:class: output
Z.strides[1]
(=2)
┌─────────────────────┐
┌ ┌──────────┬──────────┐ ┐
│ p+00: │ 00000000 │ 00000000 │ │
│ ├──────────┼──────────┤ │
│ p+02: │ 00000000 │ 00000001 │ │ Z.strides[0]
│ ├──────────┼──────────┤ │ (=2x3)
│ p+04 │ 00000000 │ 00000010 │ │
│ ├──────────┼──────────┤ ┘
│ p+06 │ 00000000 │ 00000011 │
│ ├──────────┼──────────┤
Z.nbytes │ p+08: │ 00000000 │ 00000100 │
(=3x3x2) │ ├──────────┼──────────┤
│ p+10: │ 00000000 │ 00000101 │
│ ├──────────┼──────────┤
│ p+12: │ 00000000 │ 00000110 │
│ ├──────────┼──────────┤
│ p+14: │ 00000000 │ 00000111 │
│ ├──────────┼──────────┤
│ p+16: │ 00000000 │ 00001000 │
└ └──────────┴──────────┘
└─────────────────────┘
Z.itemsize
Z.dtype.itemsize
(=2)
If we now take a slice of `Z`, the result is a view of the base array `Z`:
.. code-block:: python
V = Z[::2,::2]
Such view is specified using a shape, a dtype **and** strides because strides
cannot be deduced anymore from the dtype and shape only:
**Item layout**
.. code::
:class: output
V.shape[1]
(=2)
┌───────────┐
┌ ┌───┬╌╌╌┬───┐ ┐
│ │ 0 │ │ 2 │ │ ┌───┬───┐
│ ├───┼╌╌╌┼───┤ │ │ 0 │ 2 │
V.shape[0] │ ╎ ╎ ╎ ╎ │ len(V) → ├───┼───┤
(=2) │ ├───┼╌╌╌┼───┤ │ (=2) │ 6 │ 8 │
│ │ 6 │ │ 8 │ │ └───┴───┘
└ └───┴╌╌╌┴───┘ ┘
**Flattened item layout**
.. code::
:class: output
┌───┬╌╌╌┬───┬╌╌╌┬╌╌╌┬╌╌╌┬───┬╌╌╌┬───┐ ┌───┬───┬───┬───┐
│ 0 │ │ 2 │ ╎ ╎ │ 6 │ │ 8 │ → │ 0 │ 2 │ 6 │ 8 │
└───┴╌╌╌┴───┴╌╌╌┴╌╌╌┴╌╌╌┴───┴╌╌╌┴───┘ └───┴───┴───┴───┘
└─┬─┘ └─┬─┘ └─┬─┘ └─┬─┘
└───┬───┘ └───┬───┘
└───────────┬───────────┘
V.size
(=4)
**Memory layout (C order, big endian)**
.. code::
:class: output
┌ ┌──────────┬──────────┐ ┐ ┐
┌─┤ p+00: │ 00000000 │ 00000000 │ │ │
│ └ ├──────────┼──────────┤ │ V.strides[1] │
┌─┤ p+02: │ │ │ │ (=4) │
│ │ ┌ ├──────────┼──────────┤ ┘ │
│ └─┤ p+04 │ 00000000 │ 00000010 │ │
│ └ ├──────────┼──────────┤ │ V.strides[0]
│ p+06: │ │ │ │ (=12)
│ ├──────────┼──────────┤ │
V.nbytes ─┤ p+08: │ │ │ │
(=8) │ ├──────────┼──────────┤ │
│ p+10: │ │ │ │
│ ┌ ├──────────┼──────────┤ ┘
│ ┌─┤ p+12: │ 00000000 │ 00000110 │
│ │ └ ├──────────┼──────────┤
└─┤ p+14: │ │ │
│ ┌ ├──────────┼──────────┤
└─┤ p+16: │ 00000000 │ 00001000 │
└ └──────────┴──────────┘
└─────────────────────┘
V.itemsize
V.dtype.itemsize
(=2)
Views and copies
----------------
Views and copies are important concepts for the optimization of your numerical
computations. Even if we've already manipulated them in the previous section,
the whole story is a bit more complex.
Direct and indirect access
++++++++++++++++++++++++++
First, we have to distinguish between `indexing
<https://docs.scipy.org/doc/numpy/user/basics.indexing.html#>`_ and `fancy
indexing <https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html#advanced-indexing>`_. The first will always return a view while the second will return a
copy. This difference is important because in the first case, modifying the view
modifies the base array while this is not true in the second case:
.. code:: pycon
>>> Z = np.zeros(9)
>>> Z_view = Z[:3]
>>> Z_view[...] = 1
>>> Z
array([1., 1., 1., 0., 0., 0., 0., 0., 0.])
>>> Z = np.zeros(9)
>>> Z_copy = Z[[0, 1, 2]]
>>> Z_copy[...] = 1
>>> Z
array([0., 0., 0., 0., 0., 0., 0., 0., 0.])
Thus, if you need fancy indexing, it's better to keep a copy of your fancy index
(especially if it was complex to compute it) and to work with it:
.. code:: pycon
>>> Z = np.zeros(9)
>>> index = [0, 1, 2]
>>> Z[index] = 1
>>> Z
array([1., 1., 1., 0., 0., 0., 0., 0., 0.])
If you are unsure if the result of your indexing is a view or a copy, you can
check what is the `base` of your result. If it is `None`, then you result is a
copy:
.. code:: pycon
>>> Z = np.random.uniform(0, 1, (5, 5))
>>> Z1 = Z[:3, :]
>>> Z2 = Z[[0, 1, 2], :]
>>> np.allclose(Z1, Z2)
True
>>> Z1.base is Z
True
>>> Z2.base is Z
False
>>> Z2.base is None
True
Note that some NumPy functions return a view when possible (e.g. `ravel
<https://docs.scipy.org/doc/numpy/reference/generated/numpy.ravel.html>`_)
while some others always return a copy (e.g. `flatten
<https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.flatten.html#numpy.ndarray.flatten>`_):
.. code:: pycon
>>> Z = np.zeros((5, 5))
>>> Z.ravel().base is Z
True
>>> Z[::2, ::2].ravel().base is Z
False
>>> Z.flatten().base is Z
False
Temporary copy
++++++++++++++
Copies can be made explicitly like in the previous section, but the most
general case is the implicit creation of intermediate copies. This is the case
when you are doing some arithmetic with arrays:
.. code:: pycon
>>> X = np.ones(10, dtype=int)
>>> Y = np.ones(10, dtype=int)
>>> A = 2*X + 2*Y
In the example above, three intermediate arrays have been created. One for
holding the result of `2*X`, one for holding the result of `2*Y` and the last
one for holding the result of `2*X+2*Y`. In this specific case, the arrays are
small enough and this does not really make a difference. However, if your
arrays are big, then you have to be careful with such expressions and wonder if
you can do it differently. For example, if only the final result matters and
you don't need `X` nor `Y` afterwards, an alternate solution would be:
.. code:: pycon
>>> X = np.ones(10, dtype=int)
>>> Y = np.ones(10, dtype=int)
>>> np.multiply(X, 2, out=X)
>>> np.multiply(Y, 2, out=Y)
>>> np.add(X, Y, out=X)
Using this alternate solution, no temporary array has been created. The problem
is that there are many other cases where such copies need to be created and
this impacts the performance like demonstrated in the example below:
.. code:: pycon
>>> X = np.ones(100000000, dtype=int)
>>> Y = np.ones(100000000, dtype=int)
>>> timeit("global X; X = X + 2.0*Y", globals())
100 loops, best of 3: 3.61 ms per loop
>>> timeit("global X; X = X + 2*Y", globals())
100 loops, best of 3: 3.47 ms per loop
>>> timeit("global X; X += 2*Y", globals())
100 loops, best of 3: 2.79 ms per loop
>>> timeit("np.add(X, Y, out=X); np.add(X, Y, out=X)", globals())
1000 loops, best of 3: 1.57 ms per loop
Conclusion
----------
As a conclusion, we'll make an exercise. Given two vectors `Z1` and `Z2`. We
would like to know if `Z2` is a view of `Z1` and if yes, what is this view ?
.. code-block::
>>> Z1 = np.arange(10)
>>> Z2 = Z1[1:-1:2]
.. code-block::
:class: output
╌╌╌┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬╌╌
Z1 │ 0 │ 1 │ 2 │ 3 │ 4 │ 5 │ 6 │ 7 │ 8 │ 9 │
╌╌╌┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴╌╌
╌╌╌╌╌╌╌┬───┬╌╌╌┬───┬╌╌╌┬───┬╌╌╌┬───┬╌╌╌╌╌╌╌╌╌╌
Z2 │ 1 │ │ 3 │ │ 5 │ │ 7 │
╌╌╌╌╌╌╌┴───┴╌╌╌┴───┴╌╌╌┴───┴╌╌╌┴───┴╌╌╌╌╌╌╌╌╌╌
First, we need to check if `Z1` is the base of `Z2`
.. code-block::
>>> Z2.base is Z1
True
At this point, we know `Z2` is a view of `Z1`, meaning `Z2` can be expressed as
`Z1[start:stop:step]`. The difficulty is to find `start`, `stop` and
`step`. For the `step`, we can use the `strides` property of any array that
gives the number of bytes to go from one element to the other in each
dimension. In our case, and because both arrays are one-dimensional, we can
directly compare the first stride only:
.. code-block::
>>> step = Z2.strides[0] // Z1.strides[0]
>>> step
2
Next difficulty is to find the `start` and the `stop` indices. To do this, we
can take advantage of the `byte_bounds` method that returns a pointer to the
end-points of an array.
.. code-block::
:class: output
byte_bounds(Z1)[0] byte_bounds(Z1)[-1]
↓ ↓
╌╌╌┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬╌╌
Z1 │ 0 │ 1 │ 2 │ 3 │ 4 │ 5 │ 6 │ 7 │ 8 │ 9 │
╌╌╌┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴╌╌
byte_bounds(Z2)[0] byte_bounds(Z2)[-1]
↓ ↓
╌╌╌╌╌╌╌┬───┬╌╌╌┬───┬╌╌╌┬───┬╌╌╌┬───┬╌╌╌╌╌╌╌╌╌╌
Z2 │ 1 │ │ 3 │ │ 5 │ │ 7 │
╌╌╌╌╌╌╌┴───┴╌╌╌┴───┴╌╌╌┴───┴╌╌╌┴───┴╌╌╌╌╌╌╌╌╌╌
.. code-block::
>>> offset_start = np.byte_bounds(Z2)[0] - np.byte_bounds(Z1)[0]
>>> offset_start # bytes
8
>>> offset_stop = np.byte_bounds(Z2)[-1] - np.byte_bounds(Z1)[-1]
>>> offset_stop # bytes
-16
Converting these offsets into indices is straightforward using the `itemsize`
and taking into account that the `offset_stop` is negative (end-bound of `Z2`
is logically smaller than end-bound of `Z1` array). We thus need to add the
items size of Z1 to get the right end index.
.. code-block::
>>> start = offset_start // Z1.itemsize
>>> stop = Z1.size + offset_stop // Z1.itemsize
>>> start, stop, step
(1, 8, 2)
Last we test our results:
.. code-block::
>>> np.allclose(Z1[start:stop:step], Z2)
True
As an exercise, you can improve this first and very simple implementation by
taking into account:
* Negative steps
* Multi-dimensional arrays
`Solution <code/find_index.py>`_ to the exercise.