Wasserstein GAN

 In this blog, we are trying more to be focusing on the architecture. In the next blog, we would be concentrating on mathematics like our earlier blogs, which is the topic of prime importance to us. We will just be giving an intro to the W-GAN paper in this blog. 

Trust me, this is one of the greatest, path-breaking and classical writings of GAN literature. This article used a very new loss function. The problem with GANs is because of its loss function. We know that generator is trained using the discriminator gradients. In theory what we do is first we optimise the discriminator and later with the help of discriminator gradients, we train generator keeping discriminator weights constant. So, at every step, we are trying to push generator to fool an optimal discriminator. 

But the problem is that in practice, as the discriminator gets better, the updates to the generator get consistently worse.The original GAN paper argued that this issue arose from saturation, and switched to another similar cost function that doesn’t have this problem. However, even with this new cost function, updates tend to get worse, and optimisation gets massively unstable(we will be explaining this in detail in our further posts). In regular GAN loss, we try to train the discriminator by using JS Divergence which implicitly is a sum of two KL Divergences. Actually, KL divergence uses the continuous probability distribution. But our distributions namely, Pg and Pr are actually defined on low-dimensional manifolds(). The distributions are not continuous, and they have disjoint supports.

For this purpose, the regular GAN loss(the loss involving cross entropy loss) was replaced by a new loss inspired from Earth-Mover distance.

What is earth mover distance?

The mathematical expression for this is:- 
                   W (Pr , Pg ) = infγ∈Π(Pr ,Pg ) (E(x,y)∼γ ||x − y||)
Where Π(Pr, Pg ) denotes the set of all joint distributions γ(x, y) whose marginals are respectively Pr and Pg. It is saying that what is the minimum distance between the two distributions.

How can we establish that this measure is better than KL divergence and JS divergence?

We are trying to explain with the same example as in the original paper. 
Let Z ∼ U [0, 1] the uniform distribution on the
unit interval. Let P0 be the distribution of (0, Z) ∈ RxR, uniform on a straight vertical line passing through the origin. Now let gθ (z) = (θ, z) with θ a single real parameter.


Then,

        W (P0 , Pθ ) = |θ|
                                  
        JS (P0 , Pθ )= log2  if θ ≠0 
        JS (P0 , Pθ )= 0     if  θ  =0

We can clearly observe that earth-mover distance is continuous throughout R. But JS divergence is discontinuous at 0. Hence through this simple example, we can say that JS divergence is not a useful metric for our case. From the mathematical expression for earth-mover distance, we see that we have to find the minimum over a set of all joint distributions γ(x, y) whose marginals are respectively Pr and Pg. The infimum in is highly intractable(this will be proved in next blog). Hence Kantorovich-Rubinstein duality which states that:-

                   W (Pr , Pg ) = sup E x∼P r [f (x)] − E x∼P θ [f (x)] 
                                           ||f ||L ≤1

where the supremum is over all the 1-Lipschitz functions f: 𝛘 → R. In the problem we use it by replacing it with K-Lipschitz functions f: 𝛘 → R   with condition
||f ||L ≤k. Therefore, if we have a parameterized family of functions {f w } w∈W that is all K-Lipschitz for some K, we could consider solving
the problem:-

                   max Ex∼Pr [fw (x)] − Ez∼p(z) [fw (gθ (z)]
                          ||f ||L ≤1

So our aim is to find max Ex∼P r [fw (x)] − Ez∼p(z) [fw (gθ (z)] for discriminator.
                                 ||f ||L ≤1
In case of generator, we try to find min E f(gθ (z (i) ))
                                                          w∈W    
For reaching this maximum, minimum, we can use gradient descent. Here we have constructed the problem, and we have neural networks to approximate the functions few.  

Here we use few computational tricks for better results like  
  • To make the function K-Lipschitz, weights are clipped between [-0.01,0.01] after each gradient update. 
  • We try to update the discriminator for a minimum of 5 times for every generator update to avoid 0 gradients(This trick was used in our DC-GAN model as well ).
  • It was observed that  RMS prop was showing better results as compared to ADAM optimiser.  
Feel free to contact and leave a comment if you have any corrections to add to this post and you can find the code from here.      

























 

Comments

Popular posts from this blog

GAN EXPERIMENTS