Machine Learning: Chapter 8 -Nonlinearity + Activation Functions

If you’ve been reading other ML sources, you’ve probably come across the concept of “Activation Functions”, While these resources typically explain the effect that these functions have on the output of neurons, in my experience, they don’t do a good job of explaining why we’re using them. The goal of this post is to explain what these activation functions are doing for our models.

In chapter seven we had just started to work with something we might actually consider a neural network. Let’s see how we apply activation functions to this existing example. Let’s take a close-up look at one neuron in the neural network.

We have a single neuron with the incoming edges the value of which is the previous node value times the weight of the edge. In addition, we have the bias for the node which doesn’t have any inputs but still contributes to the value. We sum them up to get an intermediate value of the node. The activation function comes in at this point. We pipe this intermediate value through the activation function in order to arrive at the final output value that gets sent along to the next layer of nodes.

Strictly speaking, we were using an activation function from before, it was just the linear activation function, f(x) = x. But there are other options!

One such non-linear activation function is called ReLU (which stands for Rectified Linear Unit). Essentially, ReLU outputs any positive outputs of the neuron and outputs zero otherwise. Kind of goofy but it’s one of the more effective activation functions for deep neural networks.

Side Note: I’ve looked around for a more in-depth description of why this is the case but haven’t yet found a really satisfying answer. Two good additional resources about this are this snippet from 3 blue 1 brown and this blog post which talks about different activation functions and under what circumstances you might want to use them.

To be more explicit about this, if the sum of all the neurons plus the bias is -55 our neuron will output 0, but if it's 39, it will output 39. It’s that simple.

So why might we want to include these activation functions in our model? Well let's take another look at our equation from chapter seven. The keen-eyed among you might have noticed that you could combine terms across the two hidden nodes. Essentially even though we were adding a lot of complexity to our model, we actually weren't fundamentally improving its flexibility. You could represent the same model with the model shown in chapter 4. Let's see that for ourselves.

Here’s an equation that represents the model from the previous blog post, with some intentionally nice numbers:

$$ (2 * (0.5x_E + 3x_H + 4x_G + 0.5) + 5 * (4x_E + x_H + 2x_G + 1) + 5) $$ $$ x_E + 6x_H + 8x_G + 1 + 20x_E + 5x_H + 10x_G + 5 + 5 $$ $$ 21x_E + 11x_H + 18x_G + 11 $$

As you can see, the initially complicated expression winds up being equivalent to a linear combination of the input predictors.

If we wrap the outputs of our hidden nodes in a non-linear activation function, though, we can no longer do that term combination, indicating that we’re actually extending the flexibility by being able to express non-linear relationships between the predictors and the prediction.

So this is ultimately why activation functions were introduced. Without them, even the most complicated neural networks could be equivalently expressed with our simple linear model.

In upcoming blog posts, I intend to revisit backpropagation now that we’ve added activation functions and hidden layers into the mix.