KL Divergence, Practically — What I Got Wrong
I ran into a recurring “KL term” while reading Rezende & Mohamed (2015), Variational Inference with Normalizing Flows and realized my mental model was slightly off. I used to treat “KL term” as basically “cross-entropy,” so I thought it was just another classification-like penalty. That belief is sometimes practically correct, but it is also misleading in the exact context where KL shows up most prominently: variational inference and normalizing flows. This note is my attempt to resolve that confusion.
(Primary motivation: Rezende & Mohamed, 2015. :contentReference[oaicite:0]{index=0})
1. What KL Divergence Actually Is
KL divergence measures how far one probability distribution is from another:
\[D_{KL}(P\|Q) = \mathbb{E}_{x\sim P}\left[\log \frac{P(x)}{Q(x)}\right].\]A few properties I need to keep front-of-mind:
- Not symmetric: $D_{KL}(P|Q) \neq D_{KL}(Q|P)$.
- Not a “distance” metric (no triangle inequality).
- It is an expectation under $P$, meaning the direction strongly affects behavior.
The direction is not cosmetic: it changes optimization behavior (mode-covering vs mode-seeking).
2. Where My Confusion Came From: KL vs Cross-Entropy
I previously thought:
“KL term is basically cross-entropy.”
This is only conditionally true.
The key identity is:
\[D_{KL}(P\|Q) = H(P,Q) - H(P),\]where
- cross-entropy: $H(P,Q) = -\mathbb{E}_{x\sim P}[\log Q(x)]$
- entropy: $H(P) = -\mathbb{E}_{x\sim P}[\log P(x)]$
So if $P$ is fixed, then $H(P)$ is constant, and minimizing $D_{KL}(P|Q)$ is equivalent to minimizing cross-entropy $H(P,Q)$.
That is why in standard supervised classification with fixed labels, it feels like “cross-entropy = KL”.
The correction to my belief
The “KL term” is not intrinsically cross-entropy. Cross-entropy is what KL reduces to when the target distribution $P$ is fixed and we only optimize $Q$.
That “fixed target” assumption is exactly what breaks in many important ML objectives.
3. The VAE / Variational Inference Setting: KL is Not a “Label Loss”
Rezende & Mohamed frame variational inference as maximizing a lower bound on $\log p(x)$ (the evidence), because the true marginal likelihood is typically intractable. They write the ELBO as:
\[\log p_\theta(x) \ge \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}(q_\phi(z|x)\|p(z)).\]This is the core equation in the paper. :contentReference[oaicite:1]{index=1}
Here the KL is:
\[D_{KL}(q_\phi(z|x)\|p(z)).\]This is the conceptual point I had wrong: this KL is not comparing “predictions vs labels.”
It is comparing:
-
$q_\phi(z x)$: an approximate posterior produced by the inference network (encoder) - $p(z)$: the prior
So the KL term is a regularizer / information constraint: it prevents the inference network from encoding arbitrary information in $z$ and pushes the posterior toward the prior.
Why my cross-entropy intuition fails here
In classification, $P$ is typically fixed (true labels).
In the ELBO, $q_\phi(z|x)$ is learned and changes during training.
So even though the identity $D_{KL} = H(P,Q) - H(P)$ always holds mathematically, the “$H(P)$ is constant” trick is not something I can rely on in intuition. The moving part $q_\phi$ is exactly what I’m optimizing.
4. Another Important Subtlety: KL Direction Matters
The ELBO uses:
\[D_{KL}(q_\phi(z|x)\|p(z)).\]This is the “reverse” direction relative to what I mentally associate with supervised learning (which usually resembles $D_{KL}(P|Q)$ with fixed $P$).
This direction has practical consequences:
- $D_{KL}(P|Q)$ (“forward KL”) strongly penalizes putting low probability mass where $P$ has mass → tends to be mode-covering.
- $D_{KL}(Q|P)$ (“reverse KL”) strongly penalizes putting mass where $P$ has low mass → tends to be mode-seeking.
| In variational inference, this asymmetry is part of why approximate posteriors can miss modes (the “mode-seeking” behavior is a known limitation). Normalizing flows in Rezende & Mohamed are partly motivated by making $q_\phi(z | x)$ flexible enough to reduce that approximation gap. :contentReference[oaicite:2]{index=2} |
5. Why Normalizing Flows Make the KL Term Even More Central
Rezende & Mohamed propose constructing the approximate posterior by transforming a simple base distribution through a sequence of invertible mappings (“normalizing flow”):
-
start: $z_0 \sim q_0(z_0 x)$ (often Gaussian) - transform: $z_k = f_k(z_{k-1})$ for invertible $f_k$
-
end: $z_K$ has a more complex distribution $q_K(z_K x)$
Because the transform is invertible, the density changes via change-of-variables:
\[\log q_K(z_K|x) = \log q_0(z_0|x) - \sum_{k=1}^{K}\log\left|\det \frac{\partial f_k}{\partial z_{k-1}}\right|.\]This flexibility makes the approximate posterior richer, which tightens the variational bound. This is one of the main contributions of the paper. :contentReference[oaicite:3]{index=3}
My updated intuition
The KL term is not an “annoying extra penalty.” It is the explicit mismatch measure between what my inference model can represent ($q_\phi$) and what the generative model assumes ($p$).
Normalizing flows are an upgrade to $q_\phi$ so that this mismatch can be reduced without sacrificing scalability.
6. When the “Target Distribution is Not Fixed” Happens in Practice
This directly answers my earlier confusion: when is $P$ not fixed?
| In variational inference and flows, the “target” distribution in a KL can involve learnable distributions such as $$q_\phi(z | x)$$ or teacher/student distributions that change over time. Concrete examples: |
-
Variational inference / VAE: $$q_\phi(z x)$$ is learned. - Normalizing flows in VI: the whole posterior family is learned through transformations.
- Online distillation / EMA teachers: teacher distributions evolve during training.
- RL: policies and visitation distributions shift.
This is the world where “KL term = cross-entropy” stops being a reliable mental shortcut.
7. The Corrected Takeaway I Want to Keep
My previous belief:
- “KL is basically cross-entropy.”
What I now believe:
- KL is a distribution mismatch measure.
- Cross-entropy is one special case view of KL when the target distribution is fixed.
- In variational inference (including normalizing flows), the KL term is the core regularizer that shapes the posterior approximation, and its direction matters.
If I keep that in mind, the ELBO decomposition in Rezende & Mohamed stops looking like “reconstruction loss + random KL penalty” and starts looking like what it really is:
a likelihood-fitting term plus an explicit constraint on how much the posterior is allowed to deviate from the prior, with flows making that posterior expressive enough to be useful. :contentReference[oaicite:4]{index=4}