544 High performance computing and parallelism
a[1]
a[5]
a[9]
a[13]
a[2]
a[6]
a[10]
a[14]
a[3]
a[7]
a[11]
a[15]
a[4]
a[8]
a[12]
a[16]
... ... ... ...
Figure 16.2. Distribution of one-dimensional array over four memory banks.
Therefore, if a first pipeline using vectorsa,bandcis executed, it will cause a
rather large overhead because these arrays have to be loaded from main memory
into the cache. However, a subsequent pipeline using (a subset of)a,bandcwill
run much faster because these vectors are still stored in the cache. Obviously the
cache paradigm is based on statistical considerations, and for some programs it
may not improve the performance at all because most data are directly fetched
from main memory (cache trashing). Sometimes, the cache divided into several
levels of increasing speed and cost, but decreasing size. Level-1 cache is built into
the processor; level-2 cache is either part of the processor, or external.
16.2.2 Implications for programming
For a pipeline process to be possible, it should be allowed to run without interruption.
If the processor receives commands to check whether elements of the vector being
processed are zero, or to check whether the loop must be interrupted, it cannot
continue the pipeline and performance drops dramatically. Pipelines usually work
in vector processing mode, in which the processor receives a command such as
‘calculate the scalar product’ or ‘add two vectors’ with as operands the memory
locations (addresses) of the first elements of the vectors, the length of the pipeline
and the address of the output value (scalar or vector). How can we tell the processor
to start a pipeline rather than operate in conventional mode? This is usually done
by the compiler which recognises the parts of our program which can be pipelined.
In the language Fortran 90 the statement
c=a+b,
for one-dimensional arraysa,bandcis equivalent to the vector addition considered
above. This can easily be recognised by a compiler as a statement which can be
executed in pipeline mode.