BEST PRACTICES 349
strong enough to differentiate authors from each other and
cluster documents by author.^49
Finally, due to the brevity of the texts, a realistic forensic
author identification method needs a way of measuring the texts
to get as much information as possible out of them. Counting
syntactic structure rather than words yields a higher count and
makes statistical analysis possible. If a method only counts the
words, the result is a long list of words with frequencies that are
mostly one, and a few function words like [the, a, of, with] with
slightly higher frequencies. But if the syntactic structures are
counted, all the nouns in a sentence contribute to the noun
category, all the determiners to the determiner category, and so
forth. Likewise, by subcategorizing the noun phrases into
marked and unmarked types, the frequency counts are divided
into two separate measures for the marked and unmarked
frequency of each syntactic category. The marked and unmarked
subcategorization is a way to compare different authors’ patterns
of use for what is salient on the one hand (as marked patterns
are salient by definition) but hard to imitate on the other (as
syntactic structures are fragile in memory).
B. Ground-Truth Data
The Chaski Writing Sample Database includes ten topics,
listed in Table 1. The database makes cross-genre/register
comparison possible for known authors who are not professional
writers and produce unedited texts. With funding from the U.S.
Department of Justice’s National Institute of Justice, data were
collected from students at a community college and a four-year
college with a student body of both traditional students and
returning adult students; the population provided a wide age
range, males and females, and several races; Table 2 shows the
demographics of an experiment that contrasted gender and
controlled for race because race is highly correlated with some
American English dialects.
(^49) Chaski, Who Wrote It?, supra note 1, at 17.