#define SMARTCOPY memcpy(destination, source, 65536)
main()
{
char source[65536], destination[65536];
int i, j;
for (j = 0; j < 100; j++)
SMARTCOPY;
}
% cc -O cache.c
% time a.out
1.0 seconds user time
# change to DUMBCOPY and recompile
% time a.out
7.0 seconds user time
Compile and time the run of the above program two different ways, first as it is, and then
with the macro call changed to DUMBCOPY. We measured this on a SPARCstation 2, and
there was a consistent large performance degradation with the dumb copy.
The slowdown happens because the source and destination are an exact multiple of the
cache size apart. Cache lines on the SS2 aren't filled sequentially—the particular algorithm
used happens to fill the same line for main memory addresses that are exact multiples of the
cache size apart. This arises from optimized storage of tags—only the high-order bits of
each address are put in the tag in this design.
All machines that use a cache (including supercomputers, modern PC's, and everything in
between) are subject to performance hits from pathological cases like this one. Your
mileage will vary on different machines and different cache implementations.
In this particular case both the source and destination use the same cache line, causing every
memory reference to miss the cache and stall the processor while it waited for regular
memory to deliver. The library memcpy() routine is especially tuned for high performance.
It unrolls the loop to read for one cache line and then write, which avoids the problem.
Using the smart copy, we were able to get a huge performance improvement. This also
shows the folly of drawing conclusions from simple-minded benchmark programs.
The Data Segment and Heap
We have covered the background on system-related memory issues, so it's time to revisit the layout of
memory inside an individual process. Now that you know the system issues, the process issues will
start making a lot more sense. Specifically, we'll begin by taking a closer look at the data segment
within a process.
Just as the stack segment grows dynamically on demand, so the data segment contains an object that
can do this, namely, the heap, shown in Figure 7-5. The heap area is for dynamically allocated storage,
that is, storage obtained through malloc (memory allocate) and accessed through a pointer.