Communities may then be examined for differences in characteristics to give insights.
For example, if we find that stocks in more connected communities tend to be more
volatile, then we may want to limit the number of stocks chosen from these communities
in a portfolio.
2.4 Metrics
Developing analytics without metrics is insufficient. It is important to build measures
that examine whether the analytics are generating classifications that are statistically
significant, economically useful, and stable. For an analytic to bestatistically valid,
it should meet some criterion that signifies classification accuracy and power. Being
economically usefulsets a different bar—does it make money? Andstabilityis a double-
edged quality: one, does it perform well in sample and out of sample? And, two, is the
behavior of the algorithm stable across training corpora?
Here, we explore some of the metrics that have been developed and propose others.
No doubt, as the range of analytics grows, so will the range of metrics.
2.4.1 Confusion matrix
The confusion matrix is the classic tool for assessing classification accuracy. Givenn
categories, the matrix is of dimensionnn. The rows relate to the category assigned by
the analytic algorithm and the columns refer to the correct category in which the text
resides. Each cellði;jÞof the matrix contains the number of text messages that were of
typejand were classified as typei. The cells on the diagonal of the confusion matrix state
the number of times the algorithm got the classification right. All other cells are
instances of classification error. If an algorithm has no classification ability, then the
rows and columns of the matrix will be independent of each other. Under this null
hypothesis, the statistic that is examined for rejection is as follows:
^2 ½dof¼ðn 1 Þ^2 ¼
Xn
i¼ 1
Xn
j¼ 1
½Aði;jÞEði;jÞ^2
Eði;jÞ
whereAði;jÞare the actual numbers observed in the confusion matrix, andEði;jÞare the
expected numbers, assuming no classification ability under the null. IfTðiÞrepresents
the total across rowiof the confusion matrix, andTðjÞthe column total, then
Eði;jÞ¼
TðiÞTðjÞ
Xn
i¼ 1
TðiÞ
TðiÞTðjÞ
Xn
j¼ 1
TðjÞ
The degrees of freedom of the^2 statistic isðn 1 Þ^2. This statistic is very easy to
implement and may be applied to models for anyn. A highly significant statistic is
evidence of classification ability.
2.4.2 Accuracy
Algorithm accuracy over a classification scheme is the percentage of text that is correctly
62 Quantifying news: Alternative metrics