setting, one thread finishes its last value ofiearly, while another thread still
has two values ofileft to do. This would mean our program would finish
somewhat later than it could. In parallel-processing parlance, we would have
aload balanceproblem. With dynamic assignment, the thread that finished
when there were two values ofileft to handle could have taken up one of
those values itself. We would have better balance and theoretically less over-
all runtime.
But don’t jump to conclusions. As always, we have the overhead issue to
reckon with. Recall that acriticalpragma, used in the dynamic version of
the code above, has the effect of temporarily rendering the program serial
rather than parallel, thus causing a slowdown. In addition, for reasons too
technical to discuss here, these pragmas may cause considerable cache activ-
ity overhead. So in the end, the dynamic code could actually be substantially
slower than the static version.
Various solutions to this problem have been developed, such as an
OpenMP construct namedguided. But rather than present these, the point
I wish to make is that they are unnecessary. In most situations, static assign-
ment is just fine. Why is this the case?
You may recall that the standard deviation of the sum of independent,
identically distributed random variables, divided by the mean of that sum,
goes to zero as the number of terms goes to infinity. In other words, sums
are approximately constant. This has a direct implication for our load-
balancing concerns: Since the total work time for a thread in static assign-
ment is the sum of its individual task times, that total work time will be
approximately constant; there will be very little variation from thread to
thread. Thus, they will all finish at pretty close to the same time, and we do
not need to worry about load imbalance. Dynamic scheduling will not be
necessary.
This reasoning does depend on a statistical assumption, but in practice,
the assumption will typically be met sufficiently well for the outcome: Static
scheduling does as well as dynamic in terms of uniformity of total work times
across threads. And since static scheduling doesn’t have the overhead prob-
lems of the dynamic kind, in most cases the static approach will give better
performance.
There is one more aspect of this to discuss. To illustrate the issue,
consider again the mutual outlinks example. Let’s review the outline of
the algorithm:
1 sum=0
2 for i = 0...n-1
3 for j = i+1...n-1
4 for k = 0...n-1 sum = sum + a[i][k]a[j][k]
5 mean = sum / (n(n-1)/2)
Saynis 10000 and we have four threads, and consider ways to partition
thefor iloop. Naively, we might at first decide to have thread 0 handle the
ivalues 0 through 2499, thread 1 handle 2500 through 4999, and so on.
However, this would produce a severe load imbalance, since the thread that
Parallel R 349