7.6 PARALLEL NUMERICAL ALGORITHMS 203
Analysis
This process took seven cycles to multiply a 2 3 matrix by a 3 4 matrix.
There is a general formula that we can develop to determine the number of
cycles that we need to complete this process. We first need to consider the pro-
cess involved. Counting how many cycles are necessary to get the last value in
the last row of the first matrix out of the mesh is one way to determine how
long this process takes. We could also do this analysis based on the columns
because for the process to finish, the last element of the final row of the first
matrix and the last element of the final column of the second matrix must
leave the mesh, which happens at the same time.
The rows of the first matrix shift by one column each cycle of the process. If
our first matrix has I rows and J columns (IJ), how many cycles will it take
before the last number of the first row is inside the mesh? Because each cycle
adds one number to the mesh and the first number enters the mesh on the first
cycle, it will take J 1 more cycles for the last value in the first row to be in
the mesh. How many additional cycles will it take for this number to leave the
mesh when we are done with it? Because this number needs to be multiplied
by a value in each column of the second matrix (JK), it will take K cycles
for this number to leave the matrix. Because we delay each of the successive
rows and columns, we need to consider what happens with the last row to see
what really goes on overall now that we understand how the first row works.
As was said, each row of the first matrix starts one cycle later than the row
above it. So, the second row starts on cycle 2, the third row starts on cycle 3,
and the last row starts on cycle I. We said that it takes J 1 more cycles for the
last value in a row to enter the mesh and K cycles for it to leave the mesh. This
means that the entire process will, in general, take I + J + K 1 cycles to
complete. The run time of our mesh matrix multiplication is O(N), where N
= maximum(I,J,K). The number of processors we need is O(N^2 ), and so the
cost of our parallel version is O(N^3 ), which is the same as the standard matrix
multiplication algorithm. The real value of this mesh algorithm is that it has a
much shorter run time than any of the sequential matrix multiplication meth-
ods we considered in Chapter 4. Any algorithm that relies on a large number of
matrix multiplications can see a dramatic improvement by implementation on
a mesh network. For example, in the introduction to Chapter 4, we discussed a
convolution that multiplies a 5 5 matrix by every 5 5 patch of a 512
512 image. Using the standard sequential matrix multiplication algorithm, we