The assertion that the activation function in neural networks can only be implemented by a step function, which results in outputs of either 0 or 1, is a common misconception. While step functions, such as the Heaviside step function, were among the earliest activation functions used in neural networks, modern deep learning frameworks, including those built with Python and PyTorch, employ a variety of activation functions that offer continuous, differentiable outputs. These functions are important for enabling the training of deep neural networks through gradient-based optimization methods such as backpropagation.
Step Functions and Their Limitations
A step function, specifically the binary step function, is defined mathematically as follows:
![Rendered by QuickLaTeX.com \[ f(x) = \begin{cases} 0 & \text{if } x < 0 \\ 1 & \text{if } x \geq 0 \end{cases} \]](https://eitca.org/wp-content/ql-cache/quicklatex.com-368dbe6457b74ad677514f8302c5bb51_l3.png)
The binary step function maps input values to either 0 or 1, depending on whether the input is below or above a certain threshold (typically zero). This function is non-linear and can be used to create a simple model of a neuron that either "fires" (output 1) or does not "fire" (output 0).
However, the binary step function has significant limitations:
1. Non-Differentiability: The binary step function is not differentiable at the threshold point (x=0). Differentiability is a critical property for training neural networks using gradient-based methods, as gradients are used to update the weights of the network. The lack of a gradient makes it impossible to apply gradient descent or backpropagation effectively.
2. Limited Expressiveness: The binary step function's output is binary, which limits the function's ability to model complex relationships in the data. More nuanced and continuous activation functions allow for the representation of more complex patterns and interactions.
Modern Activation Functions
To address the limitations of the step function, a variety of continuous and differentiable activation functions have been developed. These functions are designed to introduce non-linearity into the network while being amenable to gradient-based optimization. Some of the most commonly used activation functions include:
1. Sigmoid Function:
The sigmoid function maps any real-valued number to a value between 0 and 1, which can be interpreted as a probability. It is defined as:
![]()
The sigmoid function is differentiable and has a smooth gradient, which makes it suitable for training with gradient descent. However, it suffers from the vanishing gradient problem, where the gradients become very small for extreme values of x, slowing down the training process.
2. Hyperbolic Tangent (Tanh) Function:
The tanh function is similar to the sigmoid function but maps input values to a range between -1 and 1. It is defined as:
![]()
The tanh function has a steeper gradient than the sigmoid function and is zero-centered, which can help in centering the data and making the optimization process more efficient. However, it also suffers from the vanishing gradient problem.
3. Rectified Linear Unit (ReLU):
The ReLU function is one of the most popular activation functions in modern neural networks. It is defined as:
![]()
ReLU is computationally efficient and helps mitigate the vanishing gradient problem by providing a constant gradient for positive input values. However, it can suffer from the "dying ReLU" problem, where neurons can become inactive and stop learning if they consistently output zero.
4. Leaky ReLU:
Leaky ReLU is a variant of ReLU that allows a small, non-zero gradient when the input is negative. It is defined as:
![Rendered by QuickLaTeX.com \[ f(x) = \begin{cases} x & \text{if } x \geq 0 \\ \alpha x & \text{if } x < 0 \end{cases} \]](https://eitca.org/wp-content/ql-cache/quicklatex.com-780d3836149b601275f479db9d151025_l3.png)
where
is a small constant (e.g., 0.01). Leaky ReLU helps address the dying ReLU problem by ensuring that neurons continue to learn even when the input is negative.
5. Parametric ReLU (PReLU):
PReLU is an extension of Leaky ReLU where the slope of the negative part of the function is learned during training. It is defined as:
![Rendered by QuickLaTeX.com \[ f(x) = \begin{cases} x & \text{if } x \geq 0 \\ \alpha x & \text{if } x < 0 \end{cases} \]](https://eitca.org/wp-content/ql-cache/quicklatex.com-780d3836149b601275f479db9d151025_l3.png)
where
is a learnable parameter. PReLU can adapt to the data during training, potentially improving model performance.
6. Exponential Linear Unit (ELU):
ELU is another activation function designed to improve learning by addressing the vanishing gradient problem. It is defined as:
![Rendered by QuickLaTeX.com \[ f(x) = \begin{cases} x & \text{if } x \geq 0 \\ \alpha (e^x - 1) & \text{if } x < 0 \end{cases} \]](https://eitca.org/wp-content/ql-cache/quicklatex.com-16484ae0cc0fecdd3fc257013409b607_l3.png)
where
is a positive constant. ELU has a smoother gradient than ReLU and can produce negative outputs, which helps in centering the data.
7. Softmax Function:
The softmax function is commonly used in the output layer of classification networks. It converts logits (raw prediction scores) into probabilities by exponentiating the logits and normalizing them. It is defined as:
![]()
where
is the input to the i-th neuron, and the denominator is the sum of exponentials of all inputs. The softmax function ensures that the output is a valid probability distribution, with values between 0 and 1 that sum to 1.
Implementation in PyTorch
In PyTorch, these activation functions are readily available and can be easily integrated into neural network models. Here are examples of how to implement some of these activation functions in PyTorch:
{{EJS1}}
Choosing the Right Activation Function
The choice of activation function depends on various factors, including the specific problem being addressed, the architecture of the neural network, and empirical performance. Here are some guidelines for choosing activation functions:
1. Hidden Layers: ReLU and its variants (Leaky ReLU, PReLU, ELU) are commonly used in hidden layers due to their computational efficiency and ability to mitigate the vanishing gradient problem.
2. Output Layer: The activation function for the output layer depends on the type of task:
- For binary classification, the sigmoid function is often used to produce a probability.
- For multi-class classification, the softmax function is used to produce a probability distribution over classes.
- For regression tasks, a linear activation function (or no activation function) is typically used to produce continuous output values.
3. Experimental Validation: It is often beneficial to experiment with different activation functions and evaluate their performance on the specific task. Empirical results can provide insights into which activation function works best for the given data and model architecture.
In the field of deep learning, activation functions play a important role in enabling neural networks to learn and model complex patterns in data. While step functions were used in early neural networks, modern deep learning frameworks employ a variety of continuous, differentiable activation functions that address the limitations of step functions and enhance the training process. By understanding the properties and applications of different activation functions, practitioners can make informed decisions about which functions to use in their models, ultimately leading to better performance and more accurate predictions.
Other recent questions and answers regarding Training model:
- In a classification neural network, in which the number of outputs in the last layer corresponds to the number of classes, should the last layer have the same number of neurons?
- The number of neurons per layer in implementing deep learning neural networks is a value one can predict without trial and error?
- Why is it incorrect to consider activation function running on the input data of a layer?
- What is the purpose of iterating over the dataset multiple times during training?
- How is the loss calculated during the training process?
- Why is it important to choose an appropriate learning rate?
- How does the learning rate affect the training process?
- What is the role of the optimizer in training a neural network model?

