DEV Community

Ertugrul
Ertugrul

Posted on

🎭 Compressing Human Faces with VAE vs VQ-VAE — A Deep Dive into Autoencoder Design

"Can neural networks really compress faces efficiently, without losing identity?"

In this post, I explore this question by building and comparing two popular generative compression architectures: Variational Autoencoder (VAE) and Vector Quantized VAE (VQ-VAE) — trained on passport-style human face images.

🔗 GitHub Repository
📂 Dataset Source (Kaggle)


📦 Why Autoencoders for Image Compression?

Autoencoders learn to reconstruct input data from a compact representation (latent space). This enables lossy compression by:

  • Removing irrelevant pixel-level noise
  • Learning semantic structure (e.g., eyes, nose, face contour)
  • Outputting reconstructions that are visually close to original but much smaller in size

But not all autoencoders are created equal. Let’s break down how VAE and VQ-VAE differ — and which one works best for face images.


🔧 Project Setup

  • Dataset: 3000+ frontal face images from Kaggle (balanced by lighting, expression, and gender)
  • All images resized to 64×64 or 128×128
  • Trained on CPU with PyTorch
  • Output format: JPEG (quality=85)
# Install dependencies pip install -r requirements.txt 
Enter fullscreen mode Exit fullscreen mode

🧠 Architecture 1: Variational Autoencoder (VAE)

VAE is a probabilistic generative model that learns a continuous latent space:

  • Encoder outputs mean (μ) and log variance (logσ²)
  • Latent vector sampled as: z = μ + σ * ε where ε ~ N(0,1)
  • Decoder reconstructs image from z
mu = fc_mu(encoder(x)) logvar = fc_logvar(encoder(x)) z = reparameterize(mu, logvar) x_hat = decoder(z) 
Enter fullscreen mode Exit fullscreen mode

Loss = MSE reconstruction + KL divergence (to enforce Gaussian distribution)

✅ Pros:

  • Smooth latent space, good for interpolation
  • Easy to implement

❌ Cons:

  • Blurry outputs due to probabilistic sampling
  • Gaussian prior limits representation precision

📸 Sample Result (64×64, 50 epochs)

🖼️ Original: 93.71 KB 🔁 Reconstructed: 1.62 KB 📉 Compression Rate: 57.84x 
Enter fullscreen mode Exit fullscreen mode

🧠 Architecture 2: Vector Quantized VAE (VQ-VAE)

VQ-VAE replaces the continuous latent space with discrete codebook vectors:

  • Encoder outputs feature map → quantized to nearest embedding
  • Decoder reconstructs image from quantized features
z = encoder(x) quantized, vq_loss = vector_quantizer(z) x_hat = decoder(quantized) 
Enter fullscreen mode Exit fullscreen mode

Loss = MSE reconstruction + VQ commitment loss

✅ Pros:

  • Sharper and more detailed reconstructions
  • Discrete representations better for downstream tasks

❌ Cons:

  • Slightly harder to train
  • Requires codebook tuning (size, commitment cost)

📸 Sample Result (128×128, 50 epochs)

🖼️ Original: 93.71 KB 🔁 Reconstructed: 3.66 KB 📉 Compression Rate: 25.58x 
Enter fullscreen mode Exit fullscreen mode

⚙️ Why These Architectures?

I chose VAE and VQ-VAE because they represent two fundamentally different approaches to learning compressed representations:

VAE VQ-VAE
Latent Space Continuous (Gaussian) Discrete (codebook)
Output Style Smooth, blurry Crisp, pixel-accurate
Use Case Interpolation, generation Compression, deployment

In practice, the difference was immediately visible: VQ-VAE produced sharper eyes, better skin texture, and preserved the facial layout more accurately.


📊 Comparison Results

Model Resolution Epochs Output Size Compression Rate Visual Quality
VAE 64×64 20 1.54 KB 60.85× ⭐⭐☆☆☆
VAE 64×64 50 1.62 KB 57.84× ⭐⭐⭐☆☆
VQ-VAE 64×64 20 1.62 KB 57.98× ⭐⭐⭐⭐☆
VQ-VAE 128×128 50 3.66 KB 25.58× ⭐⭐⭐⭐⭐

🖼️ Visual Comparison

VQ-VAE 128×128 – 50 Epochs

vqvae\_128\_50ep

VQ-VAE 64×64 – 20 Epochs

vqvae\_64\_20ep

VAE 64×64 – 20 Epochs

vae\_64\_20ep

VAE 64×64 – 50 Epochs

vae\_64\_50ep


📉 Loss Curves & Insights

VAE Training Loss

vae\_loss

)

  • Converges smoothly after ~35 epochs
  • Most gain occurs early (first 20 epochs)

VQ-VAE Training Losses

vqvae\_loss

  • Breakdown: total, reconstruction, and VQ commitment loss
  • VQ loss stabilizes quickly while reconstruction improves more gradually

🧠 Takeaways

  • VAE is easier to train and interpret but suffers from blur due to probabilistic sampling
  • VQ-VAE captures high-frequency structure better and preserves identity at higher compression
  • At 64x64, both models compress extremely well, but VQ-VAE outperforms visually
  • At 128x128, VQ-VAE dominates in realism and perceptual clarity

💻 Run the Code Yourself

git clone https://github.com/Ertugrulmutlu/VQVAE-and-VAE cd VQVAE-and-VAE pip install -r requirements.txt python main.py 
Enter fullscreen mode Exit fullscreen mode

🧾 References


If you found this comparison helpful or insightful, consider ⭐ starring the GitHub repository — and feel free to reach out with feedback or questions!

— Github Ertuğrul Mutlu

Top comments (0)