The Art of R Programming

(WallPaper) #1
The list consists of one component per word in the file, with a word’s
component showing the positions within the file where that word occurs.
Sure enough, the worditemis shown as occurring at positions 7, 14, and 27.
Before looking at the code, let’s talk a bit about our choice of a list struc-
ture here. One alternative would be to use a matrix, with one row per word
in the text. We could userownames()to name the rows, with the entries within
a row showing the positions of that word. For instance, rowitemwould con-
sist of 7, 14, 27, and then 0s in the remainder of the row. But the matrix
approach has a couple of major drawbacks:


  • There is a problem in terms of the columns to allocate for our matrix.
    If the maximum frequency with which a word appears in our text is, say,
    10, then we would need 10 columns. But we would not know that ahead
    of time. We could add a new column each time we encountered a new
    word, usingcbind()(in addition to usingrbind()to add a row for the
    word itself). Or we could write code to do a preliminary run through
    the input file to determine the maximum word frequency. Either of
    these would come at the expense of increased code complexity and
    possibly increased runtime.

  • Such a storage scheme would be quite wasteful of memory, since most
    rows would probably consist of a lot of zeros. In other words, the matrix
    would besparse—a situation that also often occurs in numerical analysis
    contexts.


Thus, the list structure really makes sense. Let’s see how to code it.

1 findwords <- function(tf) {
2 # read in the words from the file, into a vector of mode character
3 txt <- scan(tf,"")
4 wl <- list()
5 for (i in 1:length(txt)) {
6 wrd <- txt[i] # ith word in input file
7 wl[[wrd]] <- c(wl[[wrd]],i)
8 }
9 return(wl)
10 }

We read in the words of the file (wordssimply meaning any groups of let-
ters separated by spaces) by callingscan(). The details of reading and writing
files are covered in Chapter 10, but the important point here is thattxtwill
now be a vector of strings: one string per instance of a word in the file. Here
is whattxtlooks like after the read:

> txt
[1] "the" "here" "means" "that" "the"
[6] "first" "item" "in" "this" "line"
[11] "of" "output" "is" "item" "in"
[16] "this" "case" "our" "output" "consists"

92 Chapter 4

Free download pdf