small arrays, this has hardly any measurable impact on the performance of array
operations. However, when arrays get large the story is somewhat different, depending on
the operations to be implemented on the arrays.
To illustrate this important point for memory-wise handling of arrays in science and
finance, consider the following construction of multidimensional numpy.ndarray objects:
In [ 133 ]: x = np.random.standard_normal(( 5 , 10000000 ))
y = 2 * x + 3 # linear equation y = a * x + b
C = np.array((x, y), order=‘C’)
F = np.array((x, y), order=‘F’)
x = 0.0; y = 0.0 # memory cleanup
In [ 134 ]: C[: 2 ].round( 2 )
Out[134]: array([[[-0.51, -1.14, -1.07, ..., 0.2 , -0.18, 0.1 ],
[-1.22, 0.68, 1.83, ..., 1.23, -0.27, -0.16],
[ 0.45, 0.15, 0.01, ..., -0.75, 0.91, -1.12],
[-0.16, 1.4 , -0.79, ..., -0.33, 0.54, 1.81],
[ 1.07, -1.07, -0.37, ..., -0.76, 0.71, 0.34]],
[[ 1.98, 0.72, 0.86, ..., 3.4 , 2.64, 3.21],
[ 0.55, 4.37, 6.66, ..., 5.47, 2.47, 2.68],
[ 3.9 , 3.29, 3.03, ..., 1.5 , 4.82, 0.76],
[ 2.67, 5.8 , 1.42, ..., 2.34, 4.09, 6.63],
[ 5.14, 0.87, 2.27, ..., 1.48, 4.43, 3.67]]])
Let’s look at some really fundamental examples and use cases for both types of ndarray
objects:
In [ 135 ]: %timeit C.sum()
Out[135]: 10 loops, best of 3: 123 ms per loop
In [ 136 ]: %timeit F.sum()
Out[136]: 10 loops, best of 3: 123 ms per loop
When summing up all elements of the arrays, there is no performance difference between
the two memory layouts. However, consider the following example with the C-like
memory layout:
In [ 137 ]: %timeit C[ 0 ].sum(axis= 0 )
Out[137]: 10 loops, best of 3: 102 ms per loop
In [ 138 ]: %timeit C[ 0 ].sum(axis= 1 )
Out[138]: 10 loops, best of 3: 61.9 ms per loop
Summing five large vectors and getting back a single large results vector obviously is
slower in this case than summing 10,000,000 small ones and getting back an equal number
of results. This is due to the fact that the single elements of the small vectors — i.e., the
rows — are stored next to each other. With the Fortran-like memory layout, the relative
performance changes considerably:
In [ 139 ]: %timeit F.sum(axis= 0 )
Out[139]: 1 loops, best of 3: 801 ms per loop
In [ 140 ]: %timeit F.sum(axis= 1 )
Out[140]: 1 loops, best of 3: 2.23 s per loop
In [ 141 ]: F = 0.0; C = 0.0 # memory cleanup
In this case, operating on a few large vectors performs better than operating on a large
number of small ones. The elements of the few large vectors are stored in memory next to
each other, which explains the relative performance advantage. However, overall the
operations are absolutely much slower when compared to the C-like variant.