I ran into a recurring “KL term” while reading Rezende & Mohamed (2015), Variational Inference with Normalizing Flows and realized my mental model was slightly off. I used to treat “KL term” as basically “cross-entropy,” so I thought it was just another classification-like penalty. That belief is sometimes practically correct, but it is also misleading in the exact context where KL shows up most prominently: variational inference and normalizing flows. This note is my attempt to resolve that confusion.

(Primary motivation: Rezende & Mohamed, 2015. :contentReference[oaicite:0]{index=0})


1. What KL Divergence Actually Is

KL divergence measures how far one probability distribution is from another:

\[D_{KL}(P\|Q) = \mathbb{E}_{x\sim P}\left[\log \frac{P(x)}{Q(x)}\right].\]

A few properties I need to keep front-of-mind:

  • Not symmetric: $D_{KL}(P|Q) \neq D_{KL}(Q|P)$.
  • Not a “distance” metric (no triangle inequality).
  • It is an expectation under $P$, meaning the direction strongly affects behavior.

The direction is not cosmetic: it changes optimization behavior (mode-covering vs mode-seeking).


2. Where My Confusion Came From: KL vs Cross-Entropy

I previously thought:

“KL term is basically cross-entropy.”

This is only conditionally true.

The key identity is:

\[D_{KL}(P\|Q) = H(P,Q) - H(P),\]

where

  • cross-entropy: $H(P,Q) = -\mathbb{E}_{x\sim P}[\log Q(x)]$
  • entropy: $H(P) = -\mathbb{E}_{x\sim P}[\log P(x)]$

So if $P$ is fixed, then $H(P)$ is constant, and minimizing $D_{KL}(P|Q)$ is equivalent to minimizing cross-entropy $H(P,Q)$.

That is why in standard supervised classification with fixed labels, it feels like “cross-entropy = KL”.

The correction to my belief

The “KL term” is not intrinsically cross-entropy. Cross-entropy is what KL reduces to when the target distribution $P$ is fixed and we only optimize $Q$.

That “fixed target” assumption is exactly what breaks in many important ML objectives.


3. The VAE / Variational Inference Setting: KL is Not a “Label Loss”

Rezende & Mohamed frame variational inference as maximizing a lower bound on $\log p(x)$ (the evidence), because the true marginal likelihood is typically intractable. They write the ELBO as:

\[\log p_\theta(x) \ge \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}(q_\phi(z|x)\|p(z)).\]

This is the core equation in the paper. :contentReference[oaicite:1]{index=1}

Here the KL is:

\[D_{KL}(q_\phi(z|x)\|p(z)).\]

This is the conceptual point I had wrong: this KL is not comparing “predictions vs labels.”
It is comparing:

  • $q_\phi(z x)$: an approximate posterior produced by the inference network (encoder)
  • $p(z)$: the prior

So the KL term is a regularizer / information constraint: it prevents the inference network from encoding arbitrary information in $z$ and pushes the posterior toward the prior.

Why my cross-entropy intuition fails here

In classification, $P$ is typically fixed (true labels).
In the ELBO, $q_\phi(z|x)$ is learned and changes during training.

So even though the identity $D_{KL} = H(P,Q) - H(P)$ always holds mathematically, the “$H(P)$ is constant” trick is not something I can rely on in intuition. The moving part $q_\phi$ is exactly what I’m optimizing.


4. Another Important Subtlety: KL Direction Matters

The ELBO uses:

\[D_{KL}(q_\phi(z|x)\|p(z)).\]

This is the “reverse” direction relative to what I mentally associate with supervised learning (which usually resembles $D_{KL}(P|Q)$ with fixed $P$).

This direction has practical consequences:

  • $D_{KL}(P|Q)$ (“forward KL”) strongly penalizes putting low probability mass where $P$ has mass → tends to be mode-covering.
  • $D_{KL}(Q|P)$ (“reverse KL”) strongly penalizes putting mass where $P$ has low mass → tends to be mode-seeking.
In variational inference, this asymmetry is part of why approximate posteriors can miss modes (the “mode-seeking” behavior is a known limitation). Normalizing flows in Rezende & Mohamed are partly motivated by making $q_\phi(z x)$ flexible enough to reduce that approximation gap. :contentReference[oaicite:2]{index=2}

5. Why Normalizing Flows Make the KL Term Even More Central

Rezende & Mohamed propose constructing the approximate posterior by transforming a simple base distribution through a sequence of invertible mappings (“normalizing flow”):

  • start: $z_0 \sim q_0(z_0 x)$ (often Gaussian)
  • transform: $z_k = f_k(z_{k-1})$ for invertible $f_k$
  • end: $z_K$ has a more complex distribution $q_K(z_K x)$

Because the transform is invertible, the density changes via change-of-variables:

\[\log q_K(z_K|x) = \log q_0(z_0|x) - \sum_{k=1}^{K}\log\left|\det \frac{\partial f_k}{\partial z_{k-1}}\right|.\]

This flexibility makes the approximate posterior richer, which tightens the variational bound. This is one of the main contributions of the paper. :contentReference[oaicite:3]{index=3}

My updated intuition

The KL term is not an “annoying extra penalty.” It is the explicit mismatch measure between what my inference model can represent ($q_\phi$) and what the generative model assumes ($p$).
Normalizing flows are an upgrade to $q_\phi$ so that this mismatch can be reduced without sacrificing scalability.


6. When the “Target Distribution is Not Fixed” Happens in Practice

This directly answers my earlier confusion: when is $P$ not fixed?

In variational inference and flows, the “target” distribution in a KL can involve learnable distributions such as $$q_\phi(z x)$$ or teacher/student distributions that change over time. Concrete examples:
  • Variational inference / VAE: $$q_\phi(z x)$$ is learned.
  • Normalizing flows in VI: the whole posterior family is learned through transformations.
  • Online distillation / EMA teachers: teacher distributions evolve during training.
  • RL: policies and visitation distributions shift.

This is the world where “KL term = cross-entropy” stops being a reliable mental shortcut.


7. The Corrected Takeaway I Want to Keep

My previous belief:

  • “KL is basically cross-entropy.”

What I now believe:

  • KL is a distribution mismatch measure.
  • Cross-entropy is one special case view of KL when the target distribution is fixed.
  • In variational inference (including normalizing flows), the KL term is the core regularizer that shapes the posterior approximation, and its direction matters.

If I keep that in mind, the ELBO decomposition in Rezende & Mohamed stops looking like “reconstruction loss + random KL penalty” and starts looking like what it really is:

a likelihood-fitting term plus an explicit constraint on how much the posterior is allowed to deviate from the prior, with flows making that posterior expressive enough to be useful. :contentReference[oaicite:4]{index=4}

Like