Project Wrap Up

Objectives : 

The day has come to finalise this project. Overall I’m quite satisfied with the results I got. My main objective for the projects were :

  • Get familiar and efficient with a Deep Learning library
  • Get better at reading/understanding recent papers and reproduce results
  • Understand GANs and successfully implement one
  • implement conditional GANs

Overall I think I managed to achieve all of the above. I started this semester having no clue what Theano and Lasagne were. And now they are my go-to libraries for any DL stuff. After countless of failed tried, I managed to train a GAN, which is a victory in itself. Can’t wait to try GANs on other cool stuff.

Comments :  

  1. using the captions. Even though there has been really cool papers on generating images from captions (namely this one), I really don’t think it’s feasible to expect similar results with our dataset. This is because MSCOCO is arguably the most complicated/diverse public image dataset. Therefore, there are only very few images of each object (e.g. banana, oven, surfer), so it’s hard to imagine a model successfully capturing all the complexities of the dataset. This is way I decided not to go down that path and focus only on the images.
  2. GPUs : it’s worth mentioning that having access to good GPUs is a key ingredient to training big and complex models like GANs. I did more progress on 2 weeks of TITAN X GPUs than on what was offered on the Hades cluster. I think TAs should have that in mind when going over the projects.
  3. Class blog. It was of great help 🙂

 

Further work : 

I would have liked to try a WGAN for this project, especially the one from Improved training for Wasserstein GANs. The main issue I had with WGANs was how they enforced lipschitz continuity by clipping weights). But now, the paper mentioned above ensures lipschitz constraint by penalizing the norm of the critic’s gradient . I think WGANs just may be the way to go and this improved approach is a step in the right direction.

All.jpg

 

 

 

 

Conditional GAN v3

I tried mixing 2 of my previous approaches : resetting borders at each deconvolution and adversarial autoencoder. The main difference is that no pixel-wise reconstruction loss (i.e. L1 or L2) is used.

Also, I preprocessed the input differently : I took a look at the famous DCGAN torch implementation to see how they processed their images. They first divide by 255, and thean normalize with mean and variance 0.5, and use tanh activation. So I decided to do the same

Architecture : 

Generator :

InputLayer     (64, 4, 64, 64)
ConvLayer      (64, 64, 32, 32)
ConvLayer      (64, 128, 16, 16)
ConvLayer      (64, 256, 8, 8)
ConvLayer      (64, 512, 4, 4)
ConvLayer      (64, 4000, 1, 1)
DenseLayer    (64, 16384)
ReshpLayer    (64, 1024, 4, 4)
DeconvLayer  (64, 256, 8, 8)
ResetDeconv   (64, 256, 8, 8)          25% cropped 75% generated
DeconvLayer  (64, 128, 16, 16)
ResetDeconv  (64, 128, 16, 16)        25% cropped 75% generated
DeconvLayer (64, 64, 32, 32)
ResetDeconv  (64, 64, 32, 32)         25% cropped 75% generated
DeconvLayer (64, 3, 64, 64)
ResetDeconv  (64, 3, 64, 64)           100% cropped 0% generated

Discriminator/Critic

InputLayer        (64, 3, 64, 64)
ConvLayer         (64, 128, 32, 32)
ConvLayer         (64, 256, 16, 16)
ConvLayer         (64, 512, 8, 8)
ConvLayer         (64, 1024, 4, 4) (*)
DenseLayer       (64, 1)

Loss for generator : 50 % L2 loss between real and generated tensors at layer (*) of discriminator. 50 % regular LSGAN loss.

ResetDeconv is as described in this previous post.

Results : 

gan f3
images from validation set

Even though they might not be the “safest” results I’ve had so far, I like what the GAN did. It actually managed to draw bird legs (leftmost 3rd row) which I find pretty cool.

Conditional GAN v2

Main differences with previous conditional GAN :

  • instead of partially resetting each deconvolution with a factor of p=0.5 (meaning output =  p * generated input + (1-p) * resized cropped image). I allow the model to train on this p value, and to do so for every layer. Interestingly enough, after being initialized at 0.5, the end values are [0.47; 0.48; 0.51].
  • Sigmoid activation instead of tanh, input squashed in [0,1] instead of normalized.

Similar to the previous model, the weights were initialized with those of a pretrained GAN ran on 150 * 50 minibatches of size 64.

1251_pretrain_300
minibatch from validation set

 

Again, we see some mode collapse in the model, where different filled squares share similar features. Nonetheless, the moddle is able to replicate the immediate colors, which is nice to see. Still unclear however which model performs the best.

 

 

AutoEncoder with Adversarial Loss v2

Main difference with previous model :

  • Discriminator looks at the entire image, not just the reconstructed part
  • Added a bigger reconstruction penalty to fetch colors
  • Used sigmoid activation instead of tanh : input is divided by 255, instead of being normalized.

gan_ae_reset_from_scratch_1760

Thoughts :
Images a less sharp than before, but colors seem better (sharp edges / colour tradeoff), still, seems to give worse results than here.

In general it seems easier to train without normalizing the input (only compress in [0,1]). And sigmoid is a good activation for that. Will keep using it.

Next : 

I’d like to -try- and get decent results with a conditional GAN (i.e. not an AE with adversarial loss.) that does not suffer from mode collapse.

Context Autoencoder with Adversarial Loss : the Redemption

After playing a lot with GANs and getting results,  I found that GANs have one weakness when it comes to inpainting : they try too hard. For example, if you give it a blue sky with a hole, it will prefer adding sharp lines, rather than just putting a simple blue patch. OTOH, autoencoders are the exact opposite : they lack detail, take the lazy way out by outputting a blended, blurry patch. This was my motivation for trying the adversarial autoencoder … again.

Without further ado, here are some results on images from the validation set :

samples2_lsgancond_ae_2_epoch1700

Have a look also at the output of the generator (before the real image border is merged with the prediction) :

samples2_raw_lsgancond_ae_2_epoch1700

As you may have guessed, the discriminator only looks at the center part of the image. We can really see the contribution of both autoencoder and GAN, which are the color and sharpness of images respectively.

Methodology : 

I had to change a few things from the original GAN paper to make it work (described in previous post(s):

  • LSGAN’s loss is used instead of BCE, with Rmsprop instead of Adam.
  • Reconstruction/Adversarial Loss ratio is the same (100), however the learning rate for both models is 2e-4. (instead of 2e-3 for Generator)
  • Architecture used : regular encoder before architecture described here. Bottleneck of 4000 units.  Discriminator only looks at center 32×32 part.

Comments : 

  • I’m not a big fan of this approach. Compared to regular GANs where you are actually learning the true data distribution, here the adversarial part only makes predictions sharper. This is good for inpainting however.
  • Very unstable to train : hard to have “just the right amount” of Adversarial Loss. I tried putting more Adv. gradient, the result is sharper edges, but less color. I emailed the author about how he got his Reconstruction/Adversarial Loss ratio. His response was very insigthful, I figured I would post it here :

    The ideology for doing that was to search for higher lambda_rec than lambda_adv. This is so because reconstruction loss plays a crucial role in deciding what should be filled in the context,  and even without adversarial loss, the output is almost correct, but just blurry. Thus, the aim was to shoot for higher weight of reconstruction loss

higher adv
What happens when you put too much adversarial loss.

Next step : figure out a way to use a discriminator that looks at the entire image. This way, putting more adversarial loss on the Generator’s loss will help improve context. Also, trying poisson blending could be a good idea.

First -presentable- conditional GAN results

Architecture :

LSGAN model previously described, with the following differences :

  • At every deconvolution step, the original cropped image is averaged with the generator’s -intermediate- outpout. For example, the first tensor produced has shape b_s x 1024 x 8 x 8, the image is first pooled to have heigth, width = 8, and then the 3 colors are repeated in total 1024. The idea was originally proposed by Philip Paquette. However, I had to make a few key changes in order to get some results.
    • The center of the generated image is never reset with the -empty- cropped image  center.
    • The 4 deconvolution layers take 40, 50, 70, 100 % of the cropped image tensor, and (1-x) of the generated tensor
    • The generator and discriminator are first pretrained as a regular GAN. Probably the most important part. You need a generator that can already produce sharp edges otherwise no learning happens.
  • RmsProp seems to perform better for inpainting, as Adam suffers from mode collapse to quickly.
  • The masked tensor is concatenated with the input (after the RGB channels as a 1 x 64 x 64 tensor). This seems to help the generator know which parts are more important.
  • No labels are used whatsoever.

Problems : Finding the right amount of pretraining is hard : if you don’t train enough, you get bad results. If you train too much, you get mode collapse. Here are some results I got :

mode coll.png

 

We can see that the GAN is outputting roughly 2 centers, that are adapted to the context. Which I find pretty cool if you ask me. That being said this is still a major flaw.

A good question to ask is why  is mode collapse much more common when conditioning on the borders ? I was actually expecting the opposite to be honest. That being said, I think it’s because the Minibatch Discrimination Layer can’t do his work properly, because more thant 50% of the statistics of the “fake” image are actually from “real” images (i.e. the borders).

 

How -not- to condition on border with GANs

3 unsuccessful approaches were tried, both using variants of LSGAN architecture previusly described.

  1. encoding input (same encoder architecture as previously described), outputting a 64×64 image with the original borders.
  2. paritally resetting every deconvolution with the original -resized- cropped image.

Comments : 

The first approach (picture on right) gives relatively crisp results, but the outer inner and outer parts don’t match. A better approach would be to allow the model to see the original image during the deconvolution to get a better context mathing.

The second approach (picture on left) just isn’t able to get training going. Maybe better hyperparameters could solve this issue.

Next : 

  • Go back to Adversarial Autoencoder, see if I can get it to work.
  • Try different conditioning techniques : maybe taking a pretrained GAN for approach 2.

Stable unconditional LSGAN architecture

LSGAN.png
images after 15 epochs with Adam optimizer.

Architecture : 

Generator :

InputLayer         (64, 100)
DenseLayer        (64, 16384)
ReshapeLayer    (64, 1024, 4, 4)
DeconvLayer     (64, 256, 8, 8)
DeconvLayer     (64, 128, 16, 16)
DeconvLayer     (64, 64, 32, 32)
DeconvLayer     (64, 3, 64, 64)

All but last layer : Relu and BatchNorm. Last layer : tanh and no BatchNorm.

Discriminator/Critic

InputLayer        (64, 3, 64, 64)
ConvLayer         (64, 128, 32, 32)
ConvLayer         (64, 256, 16, 16)
ConvLayer         (64, 512, 8, 8)
ConvLayer         (64, 1024, 4, 4)
GlobalPool         (64, 1024)
MinibatchDisc  (64, 1124)
DenseLayer       (64, 1)

All but last layer : leaky relu and BatchNorm. Last layer : no activation.

  • Both rmsprop and adam work. That being said, adam gives sharper edges, but eventually suffers from mode collapse, unlike rmsprop. This validates the results from the original LSGAN paper.
  • Input was normalized. no training happens if input is left as-is.

 

Comments : 

  • Training much more stable than vanilla DCGAN :  after ~ 1000 iterations, you get constant and very low variance gradient for both generator and discrminator.

 

  • Minibatch Discrimination essential to delay mode collapse with Adam

If you train forever, the training dynamics are stable, however you get mode collapse. We can see half of the examples contain the same “white kitchen counter” on the bottom right.

modec.png
mode collapse with Adam optimizer  (epoch 25)

Next :

Try conditioning on the border of the image.  

Experimenting with GANs

Before diving in head first into GANs, I thought a proper litterature review seemed necessary. Ever since Goodfellow’s vanilla GAN  in 2014, here are a few important papers I’ve come across that seem relevant.

  1. DCGANs : the authors of this paper propose a new GAN architecture that seems more stable for training.
  • Batch normalization is used is both networks.
  • Fully hidden connected layers are not a good idea : better stick to convolutions
  • Avoid pooling :  stride your convolutions. This allows the model to learn its own spatial downsampling.
  • use ReLu and Leaky ReLu for generator and discriminator respectively.

Here are some results I got with the dcgan architecture on MSCOCO : I had to resize image to 28 x 28, as I could not get anything nice with the original size. I adapted f0k’s DCGAN mnist code.

Hey, if you’re trying to trying to do a Bob Ross painting and draw happy little trees, this just might be the best way to start your canvas. However, for the task at hand, we’ll need to do better than that.

2 common failure modes I’ve come across are

  • Mode collapse : since images come from a multimodal distribution, you’d like your generator to show good variance and capture many of these modes. However, in the game theory GAN setting, it might happen that your generator sticks to producing very few modes that fool the discriminator. This results in poor image synthesis and halts training.
  • Vanishing gradient : most of the time your discriminator gets really good and gets near perfect classification accuracy. This means it can easily classify the Generator’s images and therefore have very little loss. This results in poor gradient flow.

2) Improved Techniques for Training GANs : The following paper proposes a couple of  techniques to resolve these issues, namely :

  • To solve mode collapse, Openai’s paper proposes minibatch discrimination. This is loosely described as any technique allowing the discriminator to discriminate over an entire batch. It’s then much easier to detect mode collapse and rectify it.
  • To help with vanishing gradient, one can apply feature matching. In order for the statistics of the generated images to match the real dataset’s, you try and minimize the L2 loss between the Expectations of the real and generated data : fm.png

Next step : 

get a stable GAN training environment. I saw a couple of new setups, that use critics instead of discriminator (namely, the output is not squashed by a sigmoid for the discriminator). I’m referring to WGAN and LSGAN.

Problems of Adversarial loss

My initial goal was to use my previous autoencoder as generator, and complement the MSE loss with gradient flowing from a discriminator.  2 problems emerged :

    1. balancing MSE with adversarial loss : The paper uses the following loss :  lossThe suggested lamba’s are 0.001 and 0.999 for the MSE and discrimininator losses respectively. Trying this setup, I got very similar results as using only MSE. My guess is that these hyperparameters are highly dependant of the size of your image, the complexity of your dataset, and the discriminator’s achitecture. I tried different lambda values, but I quickly realised that it’s hard to find good hyperparameters without having a good understanding of what you are adjusting (GAN in this case)
    2. Training the discriminator : The gradient for the discriminator quickly went to 0 with the suggested learning rates. Having little intuition on how to solve this problem, I figured I should probably play around with GANs and come back to this.

Next : play with GANs to gain intuition on how they actually work.