6.3 EXTENDING LINEAR MODELS 229
Gradient descent exploits information given by the derivative of the function
that is to be minimized—in this case, the error function. As an example, con-
sider a hypothetical error function that happens to be identical to x^2 +1, shown
in Figure 6.12. The X-axis represents a hypothetical parameter that is to be opti-
mized. The derivative ofx^2 +1 is simply 2x. The crucial observation is that,
based on the derivative, we can figure out the slope of the function at any par-
ticular point. If the derivative is negative the function slopes downward to the
right; if it is positive, it slopes downward to the left; and the size of the deriva-
tive determines how steep the decline is. Gradient descent is an iterative
optimization procedure that uses this information to adjust a function’s
parameters. It takes the value of the derivative, multiplies it by a small constant
called the learning rate,and subtracts the result from the current parameter
value. This is repeated for the new parameter value, and so on, until a minimum
is reached.
Returning to the example, assume that the learning rate is set to 0.1 and the
current parameter value xis 4. The derivative is double this—8 at this point.
Multiplying by the learning rate yields 0.8, and subtracting this from 4 gives 3.2,
which becomes the new parameter value. Repeating the process for 3.2, we get
2.56, then 2.048, and so on. The little crosses in Figure 6.12 show the values
encountered in this process. The process stops once the change in parameter
value becomes too small. In the example this happens when the value
approaches 0, the value corresponding to the location on the X-axis where the
minimum of the hypothetical error function is located.
0
5
10
15
20
-4 -2 0 2 4
Figure 6.12Gradient descent using the error function x^2 +1.