Machine Learning: Chapter 5 - The Math of Gradient Descent

In the last blog post, we extended the simple linear model to include more predictors. At the same time we moved to thinking about the model as a neural network with nodes and weights as opposed to a conventional linear model.

In this post, we’re going get a bit more into the nitty gritty of the math that’s going on under the hood. I’ll be the first to acknowledge that when I see a lot of equations my eyes start to glaze over but I’d recommend spending the time to understand the equations below and what they mean.

Let’s start off in familiar territory, though. Here’s the model that we had at the end of last post:

To recap, every time you hit the step button, this model is calculating the gradient, and taking a step in that direction. The size of the step is configured by the step slider.

Let’s take another look at the model in equation form:

\[ \hat = w_0x_0 + w_1x_1 + w_2x_2 + w_3x_3 + w_4x_4 + w_5x_5 + b \]

and our loss function:

\[ \sum_ (y_i − \hat_i)^2 \]

One important part to this that isn’t depicted in the model above is the data. The data is how the gradient is actually calculated so it’s important to remember that it’s playing a role in this process as well.

One thing that was a bit hard for me to wrap my head around was what’s variable and what’s constant.

Typically, when you’re thinking about linear equations, the slope, or more appropriately the weights, are thought of as constants and the inputs to the equation are variable.

Finding the gradient kind of flips that around. Now it’s the data that’s a series of constants and the weights are variables that we’re tweaking. Specifically, we’re tweaking it to minimize loss.

Calculating the Gradient

Let’s work through an example of what’s happening when we’re looking at a single data point while calculating the gradient.

As described before, the gradient is a nudge of the different weights. When the model was a function of a single variable it was an arrow that pointed in one of two directions. We’ve given up drawing it since it’s an arrow in 7 dimensions but I still find that to be a useful metaphor for thinking about the gradient, even if I can’t actually picture it in my head.

For our purposes though we’ll work with it like it’s any other vector in 2 or 3 dimensional space that you might be more familiar with.

\[ \begin w_ \\ w_ \\ \vdots \\ w_ \\ b \end \]

So for each one of those values we need to calculate the partial derivative of the loss function with respect to that element. In other words, what is the rate of change of the loss function in response to a change in the corresponding weight.

That might sound a bit intimidating but the math is actually fairly straight-forward. We’ll be using a couple of rules from derivative calculus but I’ll make sure I point it out when I’m using them.

Let’s start with the first element in the gradient which is the partial derivative of the loss with respect to the weight of the espresso input:

\[ \frac{\partial l}{\partial w_ \]

This is the data point we’ll be using to calculate the gradient:

\[(\begin \textrm\\ \textrm{Hours of sleep:} \\ \textrm \\ \textrm{Fat (grams):} \\ \textrm{Carbs (grams):} \\ \textrm{Protein (grams):} \end 4 \\ 5 \\ 1 \\ 12.7 \\ 39.5 \\ 26.6 \end, \textrm 7)\]

Fig 2: the anatomy of a sample, the array of input predictors are on the right and the observed result is on the left.

This is one example tuple with the value of the predictors as the first element and the observed result as the second. I’ve been explicit here with what the values mean but for the sake of brevity elsewhere in the post when showing samples I’ll show it as:

\[ (\begin 4 \\ 5 \\ 1 \\ 12.7 \\ 39.5 \\ 26.6 \end, 7)  \]

Fig 3: the terse format of the sample.

So recall, for a single data point, our loss function is:

\[l = (\hat − y)^2 \]

And the loss function is dependent on wespressos by way of our linear model:

\[ \hat = w_0x_0 + w_1x_1 + w_2x_2 + w_3x_3 + w_4x_4 + w_5x_5 + b \]

Now, we could substitute the full linear equation into the loss function and expand it out to calculate the derivative but that would be tedious and error-prone. Luckily, we can use the chain rule from calculus to make our lives much simpler.

So this will break down our problem into:

\[ \frac{\partial l}{\partial \hat \times \frac{\partial w_ \]

So we need to find the partial derivative of the loss with respect to the prediction. For the sake of thoroughness, let’s expand the quadratic out:

\[ l = (\hat^2 − 2\haty + y^2) \]

The final term doesn’t include so it drops out. Applying the power rule to the first and second terms we get:

\[ \frac{\partial l}{\partial \hat = 2\hat − 2y = 2 (\hat − y) \]

now we take the derivative of the model function with respect to . Only one term includes so it is just:

\[ \frac{\partial \hat{\partial w_0} = x_0 \]

Finally, for the bias, it’s just by itself so the derivative of the prediction with respect to the bias is:

\[ \frac{\partial \hat{\partial b} = 1 \]
\[ \nabla l = \begin  \frac{\partial l}{\partial w_0} \\[3mm]  \frac{\partial l}{\partial w_1} \\[3mm] \frac{\partial l}{\partial w_2 } \\[3mm] \frac{\partial l}{\partial w_3} \\[3mm] \frac{\partial l}{\partial w_4} \\[3mm] \frac{\partial l}{\partial w_5} \\[3mm] \frac{\partial l}{\partial b} \end = \begin 2(\hat − y) \times x_0 \\[3mm]  2(\hat − y) \times x_1 \\[3mm] 2(\hat − y) \times x_2 \\[3mm] 2(\hat − y) \times x_3 \\[3mm] 2(\hat − y) \times x_4 \\[3mm] 2(\hat − y) \times x_5 \\[3mm] 2(\hat − y) \times 1 \end\]

Now all we need to do is substitute the values from our data into the equations to get the literal numbers. Let’s assume that all the weights and the bias start at 0. is the prediction that our model makes for the productivity. It will be 0 since all of the weights and bias are zeroed out. is the ground truth and finally the is the value in the array of predictors from above for that sample. So, taking our sample from above:

\[(x,y) = (\begin 4 \\ 5 \\ 1 \\ 12.7 \\ 39.5 \\ 26.6 \end, 7)  \]

\[\hat = 0\]

We get:

\[ \nabla l = \begin 2(\hat − y) \times x_0 \\[3mm]  2(\hat − y) \times x_1 \\[3mm] 2(\hat − y) \times x_2 \\[3mm] 2(\hat − y) \times x_3 \\[3mm] 2(\hat − y) \times x_4 \\[3mm] 2(\hat − y) \times x_5 \\[3mm] 2(\hat − y) \times 1 \end = \begin −14 \times 4 \\[3mm]  −14 \times 5 \\[3mm] −14 \times 1 \\[3mm] −14 \times 12.7 \\[3mm] −14 \times 39.5 \\[3mm] −14 \times 26.6 \\[3mm] −14 \end = \begin −56 \\[3mm] −70 \\[3mm] −14 \\[3mm] −177.8 \\[3mm] −553 \\[3mm] −372.4 \\[3mm] −14 \end \].

So there we have it. That’s the gradient! For one of the examples... Now we need to calculate this for all the examples we have and average them together to get the true gradient.

An important reminder here is that the gradient points in the direction of steepest ascent but we want to do gradient descent so we need to remember to subtract these values from the weights otherwise we’ll send the weights infinitely far generating ever worse models.

As you can see, these numbers are pretty large relative to the expected output of 7. If you recall from the earlier examples, step size plays a pretty important role as to whether the model converges towards a good solution. If you were to apply this directly to the model, you’d find that it would shoot way past the optimal loss.

So at this point we have two options:

  1. we can scale the gradient down by a large factor,

  2. we can normalize the vector and then scale that by the step size

Depending on the problem, you may want to do one or the other. Generally, the first option is good if you want the gradient descent to take into account how steep the slope is and take smaller steps if the slope seems to be getting less steep. The second one is good if you want to take consistently sized steps the whole way along.

To recap:

  • The gradient is an n-dimensional vector where n is the number of tunable parameters in the model. (7 in our case, 6 weights and 1 bias)

  • The components of the gradient are the partial derivative of the loss function w.r.t. the corresponding model parameter

  • we can use the chain rule to decompose the partial derivatives into the derivative of the loss function with respect to the prediction and the derivative of the prediction with respect to the parameter to simplify the partial derivatives we need to calculate

  • We average all of the results from each of the samples to get the final gradient.

  • Once we have the gradient, we use the desired stepsize to scale the gradient. This can be done with or without normalizing the gradient first.

  • Finally, we subtract the components of the gradient from the corresponding model parameters to get the new value of each of the parameters at which point we can repeat the process.

I have to give a lot of credit to 3blue1brown, an awesome YouTube channel for lots of surprising and wonderful mathematics. His video about the calculus of backpropagation (and the whole Neural network series) is an invaluable resource for anyone wishing to understand Neural Networks more deeply.