Activation Functions in Neural Networks

The main purpose of Activation Functions is to convert an input signal of a node in an ANN to an output signal by applying a transformation. That output signal now is used as a input in the next layer in the stack.

Function | Description |

Linear Activation Function | Inefficient, used in regression |

Sigmoid Function | Good for output layer in binary classification problems |

Tanh Function | Better than sigmoid |

Relu Function ✨ | Default choice for hidden layers |

Leaky Relu Function | Little bit better than Relu, Relu is more popular |

**Formula:**

$linear(x)=x$

**Graph:**

It can be used in regression problem in the output layer

**Formula:**

$sigmoid(x)=\frac{1}{1+exp(-x)}$

**Graph:**

Almost always strictly superior than sigmoid function

**Formula:**

$tanh(x)=\frac{2}{1+e^{-2x}}-1$

Shifted version of the Sigmoid function 🤔

**Graph:**

Activation functions can be different for different layers, for example, we may use

tanhfor a hidden layer andsigmoidfor the output layer

If z is very large or very small then the derivative *(or the slope)* of these function becomes very small (ends up being close to 0), and so this can slow down gradient descent 🐢

Another and very popular choice

**Formula:**

$relu(x)=\left\{\begin{matrix}
0, if x<0
\\
x,if x\geq0
\end{matrix}\right.$

**Graph:**

So the derivative is 1 when z is positive and 0 when z is negative

Disadvantage:derivative=0 when z is negative 😐

**Formula:**

$leaky\_relu(x)=\left\{\begin{matrix}
0.01x, if x<0
\\
x,if x\geq0
\end{matrix}\right.$

**Graph:**

**Or:** 😛

A lot of the space of z the derivative of the activation function is very different from 0

NN will learn much faster than when using tanh or sigmoid

Well, if we use linear function then the NN is just outputting a linear function of the input, so no matter how many layers out NN has 🙄, all it is doing is just computing a linear function 😕

❗ Remember that the composition of two linear functions is itself a linear function

If the output is 0 or 1 (binary classification) ➡

*sigmoid*is good for output layerFor all other units ➡

*Relu*✨

We can say that relu is the default choice for activation function

Note:

If you are not sure which one of these functions work best 😵, try them all 🤕 and evaluate on different validation set and see which one works better and go with that 🤓😇