16.2 Pipelining 543
processor, the speed-up can be much higher than for addition. Often, pipeline pro-
cessors are able to run a pipeline for addition and multiplication simultaneously, and
the maximum obtainable speed, measured in floating point operations per second,
is then increased by a factor of two. A pipeline processor with a clock time of 5 ns
can therefore achieve a peak performance of 2/(5× 10 −^9 )=400 MFLOPs per
second (1 MFLOP= 106 FLOPS).
In practice, a pipeline processor will not reach its peak performance for various
reasons. First, the overheads at the beginning and the end of the pipeline slightly
decrease the speed, as we have seen. More importantly, a typical program does
not exclusively contain simultaneous additions and multiplications of pairs of large
vectors. Finally, for various operations, pipeline machines suffer from slow memory
access. This is a result of the fact that memory chips work at a substantially slower
rate than the processor, simply because faster memory is too expensive to include
in the computer. The typical difference in speed of the two is a factor of four (or
more). This means that the pipeline cannot feed itself with data at the same rate
at which it operates, nor can it get rid of its data as fast as they are produced. In
so-calledvector processors, the data are read into a fast processor memory, the so-
calledvector registers, from which they can be fed into the pipelines at a processor
clock-cycle rate. This still leaves the problem of feeding the vector registers at a fast
enough rate, but if the same vectors are used more than once, a significant increase
increase in performance is realised.
Because of this problem, which may cause the performance for many operations
to be reduced by a factor of four, various solutions have been developed. First of
all, the memory is often organised into four or morememory banks. Suppose that
we have four banks, and that each bank can be accessed only once in four processor
clock cycles (i.e. the memory access time is four times as slow as the processor clock
cycle), but after accessing bank number one, bank number two (or three or four) can
be accessed immediately at the next clock cycle. Therefore, the banks are cycled
through in succession (see Figure 16.2). For this to be possible, vectors are distrib-
uted over the memory banks such that for four banks, the first bank contains the
elements 1, 5, 9, 13 etc., the second one the elements 2, 6, 10, 14...and so on. Cyc-
ling through the memory banks enables the processor to fetch a vector from memory
at a rate of one element per clock cycle. In the case of the vector-addition code above,
this is not enough as two numbers are required per clock cycle. Also, if we want to
carry out operations on the vector elements 1, 5, 9...of a vector, the memory access
time increases by a factor of four. Such problems are calledmemory bank conflicts.
Another device for solving the slow memory problem is based on the observa-
tion that a program will often manipulate a restricted amount of data more than
once. Therefore a relatively small but fast memory, the so-calledcache memory,is
included. This memory acts as a buffer between the main memory and the processor,
and it contains the data which have been accessed most recently by the processor.