Pattern Recognition and Machine Learning

(Jeff_L) #1

Appendix A Data Sets


In this appendix, we give a brief introduction to the data sets used to illustrate some
of the algorithms described in this book. Detailed information on file formats for
these data sets, as well as the data files themselves, can be obtained from the book
web site:


http://research.microsoft.com/∼cmbishop/PRML

Handwritten Digits


The digits data used in this book is taken from the MNIST data set (LeCunet al.,
1998), which itself was constructed by modifying a subset of the much larger data
set produced by NIST (the National Institute of Standards and Technology). It com-
prises a training set of 60 , 000 examples and a test set of 10 , 000 examples. Some
of the data was collected from Census Bureau employees and the rest was collected
from high-school children, and care was taken to ensure that the test examples were
written by different individuals to the training examples.
The original NIST data had binary (black or white) pixels. To create MNIST,
these images were size normalized to fit in a 20 × 20 pixel box while preserving their
aspect ratio. As a consequence of the anti-aliasing used to change the resolution of
the images, the resulting MNIST digits are grey scale. These images were then
centred in a 28 × 28 box. Examples of the MNIST digits are shown in Figure A.1.
Error rates for classifying the digits range from 12 % for a simple linear classi-
fier, through 0. 56 % for a carefully designed support vector machine, to 0. 4 % for a
convolutional neural network (LeCunet al., 1998).


677
Free download pdf