of the training corpus.^18 In the latter case, the top words with
respect to their frequency correspond to function words. As we
descend the ranked list, we encounter more and more nouns,
verbs, and adjectives (possibly related with thematic choices).
One disadvantage of lexical features is that they fail to capture
any similarity in cases of noisy word forms (probably the result
of errors in language use). For example, “stylometric” and
“stilometric” are considered two different words. Another
shortcoming is that in some languages, mostly East Asian ones,
it is not easy to define what a word is.
Nowadays, character n-grams provide a standard approach to
represent texts. Each text is considered as a mere sequence of
characters. Then, all the overlapping sequences of n consecutive
characters are extracted. For example, the character 3-grams of
the beginning of this sentence would be “For,” “or,” “r e,”
“ex,” etc. Character n-gram features have several important
advantages: simplicity of measurement; language independence;
tolerance to noise (“stylometric” and “stilometric” have many
(^18) J.F. Burrows, Not Unless You Ask Nicely: The Interpretative Nexus
Between Analysis and Information, 7 LITERARY & LINGUISTIC COMPUTING
91, 91–109 (1992).
Figure 1: An example of an online article and the extracted main text.