RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://ml-cheatsheet.readthedocs.io/en/latest/activation_functions.html below:

Activation Functions — ML Glossary documentation

ML Glossary Activation FunctionsÂ¶ Linear Â¶

A straight line function where activation is proportional to input ( which is the weighted sum from neuron ).

Function Derivative

\[\begin{split}R(z,m) = \begin{Bmatrix} z*m \\ \end{Bmatrix}\end{split}\]

\[\begin{split}R'(z,m) = \begin{Bmatrix} m \\ \end{Bmatrix}\end{split}\]

def linear(z,m):
	return m*z

def linear_prime(z,m):
	return m

Pros

It gives a range of activations, so it is not binary activation.
We can definitely connect a few neurons together and if more than 1 fires, we could take the max ( or softmax) and decide based on that.

Cons

For this function, derivative is a constant. That means, the gradient has no relationship with X.
It is a constant gradient and the descent is going to be on constant gradient.
If there is an error in prediction, the changes made by back propagation is constant and not depending on the change in input delta(x) !

ELU Â¶

Exponential Linear Unit or its widely known name ELU is a function that tend to converge cost to zero faster and produce more accurate results. Different to other activation functions, ELU has a extra alpha constant which should be positive number.

ELU is very similiar to RELU except negative inputs. They are both in identity function form for non-negative inputs. On the other hand, ELU becomes smooth slowly until its output equal to -Î± whereas RELU sharply smoothes.

Function Derivative

\[\begin{split}R(z) = \begin{Bmatrix} z & z > 0 \\ Î±.( e^z â 1) & z <= 0 \end{Bmatrix}\end{split}\]

\[\begin{split}R'(z) = \begin{Bmatrix} 1 & z>0 \\ Î±.e^z & z<0 \end{Bmatrix}\end{split}\]

def elu(z,alpha):
	return z if z >= 0 else alpha*(e^z -1)

def elu_prime(z,alpha):
	return 1 if z > 0 else alpha*np.exp(z)

Pros

ELU becomes smooth slowly until its output equal to -Î± whereas RELU sharply smoothes.
ELU is a strong alternative to ReLU.
Unlike to ReLU, ELU can produce negative outputs.

Cons

For x > 0, it can blow up the activation with the output range of [0, inf].

ReLU Â¶

A recent invention which stands for Rectified Linear Units. The formula is deceptively simple: \(max(0,z)\). Despite its name and appearance, itâs not linear and provides the same benefits as Sigmoid (i.e. the ability to learn nonlinear functions), but with better performance.

Function Derivative

\[\begin{split}R(z) = \begin{Bmatrix} z & z > 0 \\ 0 & z <= 0 \end{Bmatrix}\end{split}\]

\[\begin{split}R'(z) = \begin{Bmatrix} 1 & z>0 \\ 0 & z<0 \end{Bmatrix}\end{split}\]

def relu(z):
  return max(0, z)

def relu_prime(z):
  return 1 if z > 0 else 0

Pros

It avoids and rectifies vanishing gradient problem.
ReLu is less computationally expensive than tanh and sigmoid because it involves simpler mathematical operations.

Cons

One of its limitations is that it should only be used within hidden layers of a neural network model.
Some gradients can be fragile during training and can die. It can cause a weight update which will makes it never activate on any data point again. In other words, ReLu can result in dead neurons.
In another words, For activations in the region (x<0) of ReLu, gradient will be 0 because of which the weights will not get adjusted during descent. That means, those neurons which go into that state will stop responding to variations in error/ input (simply because gradient is 0, nothing changes). This is called the dying ReLu problem.
The range of ReLu is \([0, \infty)\). This means it can blow up the activation.

Further reading

Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, Kaiming He et al. (2015)

Sigmoid Â¶

Sigmoid takes a real value as input and outputs another value between 0 and 1. Itâs easy to work with and has all the nice properties of activation functions: itâs non-linear, continuously differentiable, monotonic, and has a fixed output range.

Function Derivative

\[S(z) = \frac{1} {1 + e^{-z}}\]

\[S'(z) = S(z) \cdot (1 - S(z))\]

def sigmoid(z):
  return 1.0 / (1 + np.exp(-z))

def sigmoid_prime(z):
  return sigmoid(z) * (1-sigmoid(z))

Pros

It is nonlinear in nature. Combinations of this function are also nonlinear!
It will give an analog activation unlike step function.
It has a smooth gradient too.
Itâs good for a classifier.
The output of the activation function is always going to be in range (0,1) compared to (-inf, inf) of linear function. So we have our activations bound in a range. Nice, it wonât blow up the activations then.

Cons

Towards either end of the sigmoid function, the Y values tend to respond very less to changes in X.
It gives rise to a problem of âvanishing gradientsâ.
Its output isnât zero centered. It makes the gradient updates go too far in different directions. 0 < output < 1, and it makes optimization harder.
Sigmoids saturate and kill gradients.
The network refuses to learn further or is drastically slow ( depending on use case and until gradient /computation gets hit by floating point value limits ).