ML: Chapter 3 - Gradient Descent

In this blog post, I’m going to cover gradient descent which is a fancy term for a fairly straight-forward idea, once you can visualize it. Calculating the gradient in general requires some involved math which we’re going to avoid for the time being. For this blog post, the goal is to provide readers with an intuitive and visual idea of the technique of gradient descent without needing to get too bogged down with mathematical notation. But before we get into that a quick recap from last time. We have our parametric model (linear regression) and a visualization of the loss.

If you recall, we chose our loss function so that it outputs small values when our line fits the data well so we want to pick the parameters (slope and y intercept) for our model such that output of this function is minimized.

In order to simplify things, we’re going to restrict our linear model to just picking the slope and assume that the y-intercept is zero. We have a single input number (the slope) and our loss function outputs a single number; we can plot this relationship. The x coordinate is the slope and the y coordinate is the loss for that slope.

Notice that as you move the slope around the little point slides around on the curve.

Ultimately we want to be at the lowest point which will correspond to the slope that best fits the data. If we’re at a particular point on this graph, we need a way to know which way to step the slope so that it reduces the loss the most.

Enter the gradient

The gradient is the direction a surface is most sharply increasing. Since our goal is to find the minimum value for the loss, we use the negative of the gradient to indicate the most effective direction to walk. This is where the name comes from. We’re using the gradient to descend to the lowest point on the loss function (hopefully!) Importantly, this approach will converge on the local minimum, but it’s possible that there exists a better global minimum that this approach won’t find.

So in this demo we are now drawing the “gradient” and we have added a “step” button which allows you to take a step in the direction that the gradient is pointing.

[aside: in this case, since we’re only finding the gradient for a function of one variable, you may be used to calling that the derivative and you’re absolutely right. The gradient is a generalization of the derivative which applies for functions of more than one variable]

As you can see, it only takes a few steps until we’ve found the best slope to fit the line.

Now that we know which direction we should take a step in, the next question is how big a step to take. This might seem like a simple question given our simplified example but it’s useful to keep in mind that for more complicated machine learning problems, it’s not possible to draw the loss at every possible combination of parameters.

So from the point of view of the computer we really only have the following information:

Given that we can’t see what’s ahead of us, what step size we should choose isn’t clear. If we pick a very small step size, we’ll get to the bottom but it might take us longer than we’d like to arrive at the solution.

Worse though is using too big a step size. If we tell our model to take too large a step, we could overshoot the minimum and wind up with a worse model than before we took a step!

In the example below you can choose the size of step to take. Try out different step sizes to see the different problems with taking too small or too large a step size.

How to pick the right size step is a hard problem and also depends on the particular problem you are solving. Some approaches use a fixed step size, others vary the step size depending on how steep the gradient is.

Generally though, it’s better to err on the side of a small step size.

Hopefully, this was able to demystify the idea of gradient descent a bit. The main things I’d want you to take away are:

The gradient of a function at a particular point is which direction that function is most steeply increasing
gradient descent is the process of evaluating the gradient at a particular point and using that to take a step to reduce the loss
the step size you choose impacts how quickly the model converges towards the minimum and whether that solution is stable.

Matthew KellerDecember 23, 2018machine learning