421
ON THE ROBUSTNESS OF AUTHORSHIP
ATTRIBUTION BASED ON CHARACTER
N-GRAM FEATURES
Efstathios Stamatatos*
ABSTRACT
A number of independent authorship attribution studies have
demonstrated the effectiveness of character n-gram features for
representing the stylistic properties of text. However, the vast
majority of these studies examined the simple case where the
training and test corpora are similar in terms of genre, topic,
and distribution of the texts. Hence, there are doubts whether
such a simple and low-level representation is equally effective in
realistic conditions where some of the above factors are not
possible to remain stable. In this study, the robustness of
authorship attribution based on character n-gram features is
tested under cross-genre and cross-topic conditions. In addition,
the distribution of texts over the candidate authors varies in
training and test corpora to imitate real cases. Comparative
results with another competitive text representation approach
based on very frequent words show that character n-grams are
better able to capture stylistic properties of text when there are
significant differences among the training and test corpora.
Moreover, a set of guidelines to tune an authorship attribution
model according to the properties of training and test corpora is
provided.
- Assistant Professor, University of the Aegean.