Computational Physics

(Rick Simeone) #1

554 High performance computing and parallelism


The density was 1.0 and the temperature was 0.8 (in reduced units). Running this
program for 1000 time steps on a SGI Altix MIMD system took 20±1 seconds per
step using 4, 8 and 16 processors.
A second approach focuses on the evaluation of the double sum in the force
calculation. Thepnodes available are connected in a ring topology: each node is
connected to a left and a right neighbour, and the leftmost node is connected to the
rightmost node and vice versa. The particles are again divided over the nodes, but
not in a way depending on their positions in the physical system: of theNparticles,
the firstN/pare allocated to node number 1, the nextN/pto node number 2 and so
on (we suppose thatN/pis an integer for simplicity). Each node therefore contains
the positions and momenta of the particles allocated to it (we shall use the leap-
frog algorithm for integrating the equations of motion) – these are called thelocal
particles. In addition to these data, each node contains memory to store the forces
acting on its local particles, and sufficient memory to contain positions and momenta
of ‘travelling particles’, which are to be sent by the other nodes. The number of
travelling particles at a node equals the number of local particles, and therefore we
need to keep free memory available for 6N/preal numbers in order to be able to
receive these travelling particle data.
In the first step of the force calculation, all the interactions of local particles with
other local particles are calculated, as described inChapter 8. Then each node copies
its local positions and momenta to those of the travelling particles. Each node then
sends the positions and momenta of its travelling particles to the left, and it receives
the data sent by its right neighbour into its travelling particle memory area. At this
point, node 1, which still contains the momenta and positions of its local particles
in its local particle memory area, contains the positions and momenta of the local
particles of node 2 in its travelling memory area. The travelling particles of node 1
have been sent to nodep.
Now the contributions to the forces acting on the local particles of each node
resulting from the interactions with the travelling particles are calculated and added
to the forces. Then the travelling particles are again sent to the left neighbour and
received from the right neighbour (modulo PBC), and the calculation proceeds in
the same fashion. When each segment of travelling particles has visited all the
nodes in this way, all the forces have been calculated, and the new local positions
and momenta are calculated according to the leap-frog scheme in each processor.
Then the procedure starts over again, and so on. The motion of the particles over the
nodes in the force calculation is depicted inFigure 16.7. This algorithm is called the
systolic algorithm[3], because of the similarity between the intermittent data-traffic
and the heart beat.
We now give the systolic algorithm in schematic from, using a data-parallel
programming mode. This means that at each node, memory is allocated for the local
particles, forces and travelling particles, but the data contained in this memory are

Free download pdf