BEST PRACTICES 373
that Joe and Roy are distinct people, and the method cannot
clearly recognize the difference between Joe’s and Roy’s
documents. We will never know which explanation is correct
because a dataset of ground-truth data was not used. If a ground-
truth dataset had been used, if known authors were attached to
one or more screennames before validation testing was begun,
the accuracy of the method could have been legitimately tested.
Ground-truth data must be verified. Scraping data from the
web is a fast way of collecting a lot of data, but the data are not
at all easily verifiable. Koppel and his colleagues harvested a
dataset of blog posts from approximately 19,000 bloggers, which
is available for research.^110 The bloggers are identified by a
numerical identifier, gender, age, industry, and zodiacal sign.
As with any data collected from the web, there is an assumption
that the screenname belongs to one person at the keyboard, but
this assumption is not trustworthy, since most web-based author
identification disputes focus on the facts that screennames are
not reliable indicators of textual ownership. Further, ages and
gender can be falsely reported and are typically not verified in
any way on blog postings, or even in blog ownership.
B. Forensically Feasible Data
Traditional literary and recent computer-science-based
stylometry have focused on literary texts, religious texts, and
scholarly publications in science for electronic librarianship. All
of the text types contain edited, rhetorically sophisticated, and
highly stylized or formulaic language. These texts are also
typically long, with tens of thousands of words.
In fact, using techniques that work well on tens of thousands
of words is not at all a guarantee that it works on a few
thousand (or hundred) words in an actual case of forensic author
identification. Even computer tools for part-of-speech tagging
that have been built on traditional “novels and newspaper”
(^110) Jonathan Schler et al., Effects of Age and Gender on Blogging, AAAI
SPRING SYMPOSIUM: COMPUTATIONAL APPROACHES TO ANALYZING WEBLOGS
(2006).