The Art of R Programming

(WallPaper) #1

the counts or totals for that chunk and record them. After reading all the
chunks, we add up all the counts or totals in order to calculate our grand
means or proportions.
As another example, suppose we are performing a statistical operation,
say calculating principle components, in which we have a huge number of
rows—that is, a huge number of observations—but the number of variables
is manageable. Again, chunking could be the solution. We apply the sta-
tistical operation to each chunk and then average the results over all the
chunks. My mathematical research shows that the resulting estimators are
statistically efficient in a wide class of statistical methods.


14.6.2 Using R Packages for Memory Management.......................


Again looking at a bit more sophistication, there are alternatives for accom-
modating large memory requirements in the form of some specialized R
packages.
One such package isRMySQL, an R interface to SQL databases. Using it
requires some database expertise, but this package provides a much more
efficient and convenient way to handle large data sets. The idea is to have
SQL do its variable/case selection operations for you back at the database
end and then read the resulting selected data as it is produced by SQL.
Since the latter will typically be much smaller than the overall data set,
you will likely be able to circumvent R’s memory restriction.
Another useful package isbiglm, which does regression and generalized
linear-model analysis on very large data sets. It also uses chunking but in a
different manner: Each chunk is used to update the running totals of sums
needed for the regression analysis and then discarded.
Finally, some packages do their own storage management indepen-
dently of R and thus can deal with very large data sets. The two most com-
monly used today areffandbigmemory. The former sidesteps memory con-
straints by storing data on disk instead of memory, essentially transparently
to the programmer. The highly versatilebigmemorypackage does the same,
but it can store data not only on disk but also in the machine’s main mem-
ory, which is ideal for multicore machines.


Performance Enhancement: Speed and Memory 321
Free download pdf