The Art of R Programming

The list consists of one component per word in the file, with a word’s component showing the positions within the file where that word occurs. Sure enough, the worditemis shown as occurring at positions 7, 14, and 27. Before looking at the code, let’s talk a bit about our choice of a list structure here. One alternative would be to use a matrix, with one row per word in the text. We could userownames()to name the rows, with the entries within a row showing the positions of that word. For instance, rowitemwould consist of 7, 14, 27, and then 0s in the remainder of the row. But the matrix approach has a couple of major drawbacks:

There is a problem in terms of the columns to allocate for our matrix.
If the maximum frequency with which a word appears in our text is, say,
10, then we would need 10 columns. But we would not know that ahead
of time. We could add a new column each time we encountered a new
word, usingcbind()(in addition to usingrbind()to add a row for the
word itself). Or we could write code to do a preliminary run through
the input file to determine the maximum word frequency. Either of
these would come at the expense of increased code complexity and
possibly increased runtime.

Such a storage scheme would be quite wasteful of memory, since most
rows would probably consist of a lot of zeros. In other words, the matrix
would besparse—a situation that also often occurs in numerical analysis
contexts.

Thus, the list structure really makes sense. Let’s see how to code it.

1 findwords <- function(tf) { 2 # read in the words from the file, into a vector of mode character 3 txt <- scan(tf,"") 4 wl <- list() 5 for (i in 1:length(txt)) { 6 wrd <- txt[i] # ith word in input file 7 wl[[wrd]] <- c(wl[[wrd]],i) 8 } 9 return(wl) 10 }

We read in the words of the file (wordssimply meaning any groups of let- ters separated by spaces) by callingscan(). The details of reading and writing files are covered in Chapter 10, but the important point here is thattxtwill now be a vector of strings: one string per instance of a word in the file. Here is whattxtlooks like after the read:

> txt [1] "the" "here" "means" "that" "the" [6] "first" "item" "in" "this" "line" [11] "of" "output" "is" "item" "in" [16] "this" "case" "our" "output" "consists"

92 Chapter 4

The Art of R Programming

Get our desktop app

Company

Features

Documentation

Resources