The findWithinHorizon methods operate similarly to findInLine except that they take an additional
int parameter that specifies the maximum number of characters to look-ahead through. This "horizon" value
is treated as a transparent, non-anchoring bound (see the Matcher class for detailsSection 13.3.4 on page
329). A horizon of zero means there is no look-ahead limit.
The skip method can be used to skip over the input that matches a given pattern. As with findInLine and
findWithinHorizon, it ignores the scanner's delimiters when looking for the pattern. The skipped input
is not returned, rather skip returns the scanner itself so that invocations can be chained together.
Exercise 22.7: Rewrite readCSVTable so that the number of cells of data expected is passed as an
argument.
Exercise 22.8: As it stands, readCSVTable is both too strict and too lenient on the input format it expects.
It is too strict because an empty line at the end of the input will cause the IOException to be thrown. It is
too lenient because a line of input with more than three commas will not cause an exception. Rectify both of
these problems.
Exercise 22.9: Referring back to the discussion of efficiency of regular expressions on page 329, devise at
least four patterns that will parse a line of comma-separated-values. (Hint: In addition to the suggestion on
page 329 also consider the use of greedy versus non-greedy quantifiers.) Write a benchmark program that
compares the efficiency of each pattern, and be sure that you test with both short strings between commas and
very long strings.
22.5.3. Using Scanner
Scanner and StreamTokenizer have some overlap in functionality, but they have quite different
operational models. Scanner is based on regular expressions and so can match tokens based on whatever
regular expressions you devise. However, some seemingly simple tasks can be difficult to express in terms of
regular expression patterns. On the other hand, StreamTokenizer basically processes input a character at
a time and uses the defined character classes to identify words, numbers, whitespace, and ordinary characters.
You can control the character classes to some extent, but you don't have as much flexibility as with regular
expressions. So some things easily expressed with one class are difficult, or at best awkward, to express with
the other. For example, the built-in ability to handle comment lines is a boon for StringTokenizer, while
using a scanner on commented text requires explicit, unobvious handling. For example:
Scanner in = new Scanner(source);
Pattern COMMENT = Pattern.compile("#.*");
String comment;
// ...
while (in.hasNext()) {
if (in.hasNext(COMMENT)) {
comment = in.nextLine();
}
else {
// process other tokens
}
}
This mostly works. The intent is that if we find that the next token matches a comment, then we skip the rest
of the line by using nextLine. Note that you can't, for example, use a pattern of "#.*$" to try to match
from the comment character to the end of the line, unless you are guaranteed there are no delimiters in the
comment itself.