KL Divergence, Practically — What I Got Wrong

January 2, 2026

I ran into a recurring “KL term” while reading Rezende & Mohamed (2015), Variational Inference with Normalizing Flows and realized my mental model was slightly off. I used to treat “KL term” as basically “cross-entropy,” so I thought it was just another classification-like penalty. That belief is sometimes practically correct, but it is also misleading in the exact context where KL shows up most prominently: variational inference and normalizing flows. This note is my attempt to resolve that confusion.

(Primary motivation: Rezende & Mohamed, 2015. :contentReference[oaicite:0]{index=0})

1. What KL Divergence Actually Is

KL divergence measures how far one probability distribution is from another:

\[D_{KL}(P\|Q) = \mathbb{E}_{x\sim P}\left[\log \frac{P(x)}{Q(x)}\right].\]

A few properties I need to keep front-of-mind:

Not symmetric: $D_{KL}(P|Q) \neq D_{KL}(Q|P)$.
Not a “distance” metric (no triangle inequality).
It is an expectation under $P$, meaning the direction strongly affects behavior.

The direction is not cosmetic: it changes optimization behavior (mode-covering vs mode-seeking).

2. Where My Confusion Came From: KL vs Cross-Entropy

I previously thought:

“KL term is basically cross-entropy.”

This is only conditionally true.

The key identity is:

\[D_{KL}(P\|Q) = H(P,Q) - H(P),\]

where

cross-entropy: $H(P,Q) = -\mathbb{E}_{x\sim P}[\log Q(x)]$
entropy: $H(P) = -\mathbb{E}_{x\sim P}[\log P(x)]$

So if $P$ is fixed, then $H(P)$ is constant, and minimizing $D_{KL}(P|Q)$ is equivalent to minimizing cross-entropy $H(P,Q)$.

That is why in standard supervised classification with fixed labels, it feels like “cross-entropy = KL”.

The correction to my belief

The “KL term” is not intrinsically cross-entropy. Cross-entropy is what KL reduces to when the target distribution $P$ is fixed and we only optimize $Q$.

That “fixed target” assumption is exactly what breaks in many important ML objectives.

3. The VAE / Variational Inference Setting: KL is Not a “Label Loss”

Rezende & Mohamed frame variational inference as maximizing a lower bound on $\log p(x)$ (the evidence), because the true marginal likelihood is typically intractable. They write the ELBO as:

\[\log p_\theta(x) \ge \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}(q_\phi(z|x)\|p(z)).\]

This is the core equation in the paper. :contentReference[oaicite:1]{index=1}

Here the KL is:

\[D_{KL}(q_\phi(z|x)\|p(z)).\]

This is the conceptual point I had wrong: this KL is not comparing “predictions vs labels.”
It is comparing:

$q_\phi(z x)$: an approximate posterior produced by the inference network (encoder)
$p(z)$: the prior

So the KL term is a regularizer / information constraint: it prevents the inference network from encoding arbitrary information in $z$ and pushes the posterior toward the prior.

Why my cross-entropy intuition fails here

In classification, $P$ is typically fixed (true labels).
In the ELBO, $q_\phi(z|x)$ is learned and changes during training.

So even though the identity $D_{KL} = H(P,Q) - H(P)$ always holds mathematically, the “$H(P)$ is constant” trick is not something I can rely on in intuition. The moving part $q_\phi$ is exactly what I’m optimizing.

4. Another Important Subtlety: KL Direction Matters

The ELBO uses:

\[D_{KL}(q_\phi(z|x)\|p(z)).\]

This is the “reverse” direction relative to what I mentally associate with supervised learning (which usually resembles $D_{KL}(P|Q)$ with fixed $P$).

This direction has practical consequences:

$D_{KL}(P|Q)$ (“forward KL”) strongly penalizes putting low probability mass where $P$ has mass → tends to be mode-covering.
$D_{KL}(Q|P)$ (“reverse KL”) strongly penalizes putting mass where $P$ has low mass → tends to be mode-seeking.

In variational inference, this asymmetry is part of why approximate posteriors can miss modes (the “mode-seeking” behavior is a known limitation). Normalizing flows in Rezende & Mohamed are partly motivated by making $q_\phi(z

x)$ flexible enough to reduce that approximation gap. :contentReference[oaicite:2]{index=2}

5. Why Normalizing Flows Make the KL Term Even More Central

Rezende & Mohamed propose constructing the approximate posterior by transforming a simple base distribution through a sequence of invertible mappings (“normalizing flow”):

start: $z_0 \sim q_0(z_0 x)$ (often Gaussian)
transform: $z_k = f_k(z_{k-1})$ for invertible $f_k$
end: $z_K$ has a more complex distribution $q_K(z_K x)$

Because the transform is invertible, the density changes via change-of-variables:

\[\log q_K(z_K|x) = \log q_0(z_0|x) - \sum_{k=1}^{K}\log\left|\det \frac{\partial f_k}{\partial z_{k-1}}\right|.\]

This flexibility makes the approximate posterior richer, which tightens the variational bound. This is one of the main contributions of the paper. :contentReference[oaicite:3]{index=3}

My updated intuition

The KL term is not an “annoying extra penalty.” It is the explicit mismatch measure between what my inference model can represent ($q_\phi$) and what the generative model assumes ($p$).
Normalizing flows are an upgrade to $q_\phi$ so that this mismatch can be reduced without sacrificing scalability.

6. When the “Target Distribution is Not Fixed” Happens in Practice

This directly answers my earlier confusion: when is $P$ not fixed?

In variational inference and flows, the “target” distribution in a KL can involve learnable distributions such as $$q_\phi(z

x)$$ or teacher/student distributions that change over time. Concrete examples:

Variational inference / VAE: $$q_\phi(z x)$$ is learned.
Normalizing flows in VI: the whole posterior family is learned through transformations.
Online distillation / EMA teachers: teacher distributions evolve during training.
RL: policies and visitation distributions shift.

This is the world where “KL term = cross-entropy” stops being a reliable mental shortcut.

7. The Corrected Takeaway I Want to Keep

My previous belief:

“KL is basically cross-entropy.”

What I now believe:

KL is a distribution mismatch measure.
Cross-entropy is one special case view of KL when the target distribution is fixed.
In variational inference (including normalizing flows), the KL term is the core regularizer that shapes the posterior approximation, and its direction matters.

If I keep that in mind, the ELBO decomposition in Rezende & Mohamed stops looking like “reconstruction loss + random KL penalty” and starts looking like what it really is:

a likelihood-fitting term plus an explicit constraint on how much the posterior is allowed to deviate from the prior, with flows making that posterior expressive enough to be useful. :contentReference[oaicite:4]{index=4}

Share on

Twitter Facebook LinkedIn