How does the Wasserstein distance improve the stability and quality of GAN training compared to traditional divergence measures like Kullback-Leibler (KL) divergence and Jensen-Shannon (JS) divergence?

by EITCA Academy / Tuesday, 11 June 2024 / Published in Artificial Intelligence, EITC/AI/ADL Advanced Deep Learning, Generative adversarial networks, Advances in generative adversarial networks, Examination review

Generative Adversarial Networks (GANs) have revolutionized the field of generative modeling by enabling the creation of highly realistic synthetic data. However, training GANs is notoriously difficult, primarily due to issues related to stability and convergence. Traditional divergence measures such as Kullback-Leibler (KL) divergence and Jensen-Shannon (JS) divergence have been commonly used to guide the training process. However, these measures often fall short in providing the necessary stability and quality in GAN training. The introduction of the Wasserstein distance, also known as the Earth Mover's distance, has significantly improved the training dynamics of GANs. To understand why the Wasserstein distance offers such advantages, it is essential to consider the mathematical and practical aspects of these divergence measures.

Traditional Divergence Measures: KL and JS Divergence

The Kullback-Leibler (KL) divergence is a measure of how one probability distribution diverges from a second, expected probability distribution. Mathematically, for two distributions $P$ and $Q$ , the KL divergence is given by:

$D_{KL}(P \| Q) = \sum_{x} P(x) \log \left(\frac{P(x)}{Q(x)}\right)$

KL divergence is asymmetric and measures the information lost when $Q$ is used to approximate $P$ . In the context of GANs, the generator aims to produce a distribution $P_g$ that approximates the real data distribution $P_r$ . However, KL divergence can be problematic because it tends to focus excessively on areas where $P_r$ is non-zero and $P_g$ is zero, leading to mode collapse where the generator produces limited diversity in its outputs.

Jensen-Shannon (JS) divergence, on the other hand, is a symmetrized and smoothed version of KL divergence. It is defined as:

$D_{JS}(P \| Q) = \frac{1}{2} D_{KL}(P \| M) + \frac{1}{2} D_{KL}(Q \| M)$

where $M = \frac{1}{2}(P + Q)$ . JS divergence mitigates some of the issues of KL divergence by being symmetric and bounded. However, it still suffers from the problem of vanishing gradients. When the distributions $P_r$ and $P_g$ do not overlap significantly, the gradient of the JS divergence becomes very small, making it difficult for the generator to learn effectively.

Wasserstein Distance: A Robust Alternative

The Wasserstein distance, also known as the Earth Mover's distance, provides a more robust measure of the difference between two probability distributions. It is defined as the minimum cost of transporting mass to transform one distribution into another. Mathematically, for two distributions $P$ and $Q$ , the Wasserstein distance $W(P, Q)$ is given by:

$W(P, Q) = \inf_{\gamma \in \Pi(P, Q)} \mathbb{E}_{(x,y) \sim \gamma} [\|x - y\|]$

where $\Pi(P, Q)$ is the set of all joint distributions $\gamma(x, y)$ whose marginals are $P$ and $Q$ respectively. Unlike KL and JS divergence, the Wasserstein distance provides meaningful gradients even when the distributions do not overlap, which is important for stable GAN training.

Advantages of Wasserstein Distance in GAN Training

1. Meaningful Gradients: One of the most significant advantages of the Wasserstein distance is that it provides meaningful gradients even when the generated distribution $P_g$ and the real distribution $P_r$ are disjoint. This property ensures that the generator receives informative feedback throughout the training process, reducing the risk of vanishing gradients that can stall learning.

2. Improved Stability: The Wasserstein distance leads to more stable training dynamics. By providing a smoother and more continuous measure of the difference between distributions, it mitigates the oscillations and instability often observed with KL and JS divergence. This stability is achieved because the Wasserstein distance is a weaker metric, focusing on the overall shape and support of the distributions rather than their pointwise differences.

3. Better Mode Coverage: The Wasserstein distance encourages the generator to cover the entire support of the real data distribution, addressing the mode collapse issue prevalent with KL and JS divergence. By considering the cost of transporting mass, the Wasserstein distance inherently penalizes distributions that do not cover all modes of the real distribution.

4. Lipschitz Continuity: The Wasserstein GAN (WGAN) framework imposes a Lipschitz continuity constraint on the critic (formerly the discriminator) by clipping its weights or using gradient penalty. This constraint ensures that the critic function is smooth and bounded, further contributing to the stability of the training process.

Practical Implementation: Wasserstein GAN (WGAN)

The practical implementation of the Wasserstein distance in GANs is realized through the Wasserstein GAN (WGAN) framework. The key modifications in WGAN compared to traditional GANs include:

1. Critic Instead of Discriminator: In WGAN, the discriminator is referred to as the critic because it no longer classifies samples as real or fake but instead scores them based on their Wasserstein distance from the real data distribution.

2. Weight Clipping or Gradient Penalty: To enforce the Lipschitz constraint, the weights of the critic are clipped to a small range (e.g., $[-0.01, 0.01]$ ). Alternatively, a gradient penalty term can be added to the loss function to ensure that the gradient norm of the critic is close to 1.

3. Loss Function: The loss function in WGAN is based on the Wasserstein distance and is given by:

$L = \mathbb{E}_{x \sim P_r} [f(x)] - \mathbb{E}_{x \sim P_g} [f(x)]$

where $f$ is the critic function. The generator aims to minimize this loss, while the critic aims to maximize it.

Empirical Results and Examples

Empirical results have demonstrated that WGANs outperform traditional GANs in terms of stability and quality of generated samples. For example, in image generation tasks, WGANs produce more diverse and realistic images compared to standard GANs. The improved stability of WGANs allows for longer training times without the risk of mode collapse or training failure.

In a practical example, consider the task of generating high-resolution images of human faces. Traditional GANs might produce a limited variety of faces, often failing to capture the full diversity of human features. In contrast, a WGAN can generate a wide range of faces with different attributes, such as age, gender, and ethnicity, due to its ability to cover the entire support of the real data distribution.

Conclusion

The Wasserstein distance has significantly advanced the field of GANs by addressing the limitations of traditional divergence measures like KL and JS divergence. Its ability to provide meaningful gradients, improve training stability, and encourage better mode coverage has made it a preferred choice for many generative modeling tasks. The WGAN framework, with its modifications to the critic and loss function, exemplifies the practical benefits of using the Wasserstein distance in GAN training.

EITCA Academy

How does the Wasserstein distance improve the stability and quality of GAN training compared to traditional divergence measures like Kullback-Leibler (KL) divergence and Jensen-Shannon (JS) divergence?

Traditional Divergence Measures: KL and JS Divergence

Wasserstein Distance: A Robust Alternative

Advantages of Wasserstein Distance in GAN Training

Practical Implementation: Wasserstein GAN (WGAN)

Empirical Results and Examples

Conclusion

Other recent questions and answers regarding Advances in generative adversarial networks:

More questions and answers:

EITCA Academy is a part of the European IT Certification framework

Eligibility for EITCA Academy 80% EITCI DSJC Subsidy support

EITCA Academy

LOG IN TO YOUR ACCOUNT BY EITHER YOUR USERNAME OR EMAIL ADDRESS

FORGOT YOUR DETAILS?

CREATE AN ACCOUNT

How does the Wasserstein distance improve the stability and quality of GAN training compared to traditional divergence measures like Kullback-Leibler (KL) divergence and Jensen-Shannon (JS) divergence?

Traditional Divergence Measures: KL and JS Divergence

Wasserstein Distance: A Robust Alternative

Advantages of Wasserstein Distance in GAN Training

Practical Implementation: Wasserstein GAN (WGAN)

Empirical Results and Examples

Conclusion

Other recent questions and answers regarding Advances in generative adversarial networks:

More questions and answers:

Eligibility for EITCA Academy 80% EITCI DSJC Subsidy support