Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1
1.6 Notation 27


  1. Chapter 30.

  2. Chapters 12, 13.

  3. Chapter 14.

  4. Chapter 8.

  5. Chapter 17.

  6. Chapter 29.

  7. Chapter 19.

  8. Chapter 20.

  9. Chapter 21.


1.6 Notation


Most of the notation we use throughout the book is either standard or defined
on the spot. In this section we describe our main conventions and provide a
table summarizing our notation (Table 1.1). The reader is encouraged to skip
this section and return to it if during the reading of the book some notation is
unclear.
We denote scalars and abstract objects with lowercase letters (e.g.xandλ).
Often, we would like to emphasize that some object is a vector and then we
use boldface letters (e.g.xandλ). Theith element of a vectorxis denoted
byxi. We use uppercase letters to denote matrices, sets, and sequences. The
meaning should be clear from the context. As we will see momentarily, the input
of a learning algorithm is a sequence of training examples. We denote byzan
abstract example and byS=z 1 ,...,zma sequence ofmexamples. Historically,
Sis often referred to as a trainingset; however, we will always assume thatSis
asequencerather than a set. A sequence ofmvectors is denoted byx 1 ,...,xm.
Theith element ofxtis denoted byxt,i.
Throughout the book, we make use of basic notions from probability. We
denote byDa distribution over some set,^2 for example,Z. We use the notation
z∼ Dto denote thatzis sampled according toD. Given a random variable
f:Z→R, its expected value is denoted byEz∼D[f(z)]. We sometimes use the
shorthandE[f] when the dependence onzis clear from the context. Forf:Z→
{true,false}we also usePz∼D[f(z)] to denoteD({z:f(z) =true}). In the
next chapter we will also introduce the notationDmto denote the probability
overZminduced by sampling (z 1 ,...,zm) where each pointziis sampled from
Dindependently of the other points.
In general, we have made an effort to avoid asymptotic notation. However, we
occasionally use it to clarify the main results. In particular, givenf:R→R+
andg:R→R+we writef=O(g) if there existx 0 ,α∈R+such that for all
x > x 0 we havef(x)≤αg(x). We writef=o(g) if for everyα >0 there exists

(^2) To be mathematically precise,Dshould be defined over someσ-algebra of subsets ofZ.
The user who is not familiar with measure theory can skip the few footnotes and remarks
regarding more formal measurability definitions and assumptions.

Free download pdf