VAE, Practically — What I Got Wrong
While reading Variational Inference with Normalizing Flows, Rezende & Mohamed (2015), I noticed my own mental model of VAEs was slightly off. I was carrying an “engineering” intuition that worked for autoencoders, but it created confusion the moment I tried to interpret VAEs as variational inference and relate them to forecasting problems like strawberry yield prediction.
This post is my corrected summary, written as a set of “mis-beliefs → corrections,” grounded in the variational inference framing emphasized in the paper.
1) My first mis-belief: “Decoder is the posterior”
What I believed
The decoder is the posterior.
What’s actually true
In VAE terminology:
- The posterior is $p_\theta(z \mid x)$.
- But it is usually intractable in deep generative models.
- So VAEs introduce a tractable approximation $q_\phi(z \mid x)$.
That means:
-
Encoder $\approx$ approximate posterior (inference network):
$q_\phi(z \mid x) \approx p_\theta(z \mid x)$ -
Decoder $\approx$ likelihood / generative model:
$p_\theta(x \mid z)$
So the decoder is not “the posterior.” The decoder is one of the ingredients used to define the posterior (via Bayes’ rule), but the direction is the opposite:
- Decoder: $z \rightarrow x$ (generate)
- Encoder: $x \rightarrow z$ (infer)
This mapping is consistent with the paper’s emphasis: variational inference replaces posterior inference with optimization over a variational family, where $q_\phi(z\mid x)$ is the variational distribution and is learned to match $p_\theta(z\mid x)$.
2) My second mis-belief: “Encoder compresses, decoder expands”
What I believed
Encoder compresses a pipeline; decoder expands a pipeline. Therefore, VAE is a fancy compress–decompress trick.
What’s actually true (and why this confusion happens)
This “compress / expand” idea comes from image VAEs, where:
- $z$ is low-dimensional,
- $x$ is a high-dimensional image,
- and the decoder visually looks like an upsampling network.
But for forecasting (e.g., predicting yield as a scalar), “expand” is not a meaningful concept:
- The target $y$ is often 1-dimensional.
- The decoder does not necessarily “expand”; it often maps $(x, z)$ to a scalar distribution.
So the right mental model is:
The encoder and decoder are not defined by dimensionality changes.
They are defined by probabilistic roles in variational inference.
- Encoder: a recognition model for approximate posterior inference.
- Decoder: a generative model defining the likelihood.
3) What “VAE inference” really means (and why “generative inference” confused me)
My confusion
The phrase “generative inference” didn’t make sense to me, because inference should mean “estimate hidden state,” not “generate outputs.”
The correction
In probabilistic modeling, “inference” usually means:
compute or approximate the posterior over latent variables.
In a VAE, the true posterior is:
\[p_\theta(z \mid x) = \frac{p_\theta(x \mid z)\,p(z)}{p_\theta(x)}.\]But VAEs do not compute this directly. Instead, they learn:
\[q_\phi(z \mid x) \approx p_\theta(z \mid x).\]So:
- VAE inference = run the encoder to obtain $q_\phi(z \mid x)$ (or its parameters).
- Generation / sampling = sample $z \sim p(z)$ (or a conditional prior) and decode $x \sim p_\theta(x\mid z)$.
These are different operations. Inference is “backward reasoning.” Generation is “forward simulation.”
The paper’s message is basically:
we make inference scalable by turning it into an optimization problem, and then amortizing it with a neural network.
4) Strawberry yield forecasting: how I map “state vs observation” properly
Now I apply this to a practical problem:
- I observe temperature and fruit count.
- I want to forecast strawberry yield.
Observed variables (measurements)
Let:
- $T_{1:t}$ = temperature history up to time $t$
- $C_{1:t}$ = fruit_count history up to time $t$
- $Y_t$ = yield (today, or at harvest)
I’ll group the observed covariates as $x$:
\[x = (T_{1:t}, C_{1:t}, \text{engineered features}).\]Latent state (unobserved but important)
A forecasting model often benefits from a hidden state representing things I don’t measure well:
- plant vigor
- stress (heat / water / disease)
- phenological stage
- cultivar or management effects
- microclimate differences
Call that latent crop condition $Z_t$.
So, in Bayesian terms, what I actually want is:
\[p_\theta(Z_t \mid x),\]i.e., “given observed data, what crop states are plausible?”
That is exactly a posterior.
5) What the decoder becomes in this forecasting case
In a forecasting-oriented VAE (more precisely, a conditional VAE), the decoder is a probabilistic forecast model:
\[p_\theta(Y \mid x, z).\]This is the “forward” story:
- if the crop state is $z$
- and covariates are $x$
- then yield $Y$ follows some distribution.
For example, the decoder might output $(\mu_\theta(x,z), \sigma_\theta(x,z))$ for a Gaussian yield distribution.
So yes: the decoder is the forecasting component.
But it is not “expanding” by default—it is defining a likelihood.
6) Then what is the encoder in this forecasting case?
During training, I have both covariates and yield, so I can infer what latent crop state best explains the outcome:
\[q_\phi(z \mid x, y).\]Interpretation:
“Given sensors and realized yield, what hidden crop condition must have been present?”
That’s posterior inference (approximate), and it is the core meaning of “VAE inference.”
This is aligned with Rezende & Mohamed’s viewpoint:
the inference model $q_\phi$ is trained to approximate the true posterior while keeping optimization tractable.
7) The forecasting-time detail I initially missed: I need a conditional prior
A key practical point:
At prediction time, I do not know $y$, so I cannot directly use $q_\phi(z \mid x, y)$.
To forecast, I need a distribution over latent states given only covariates:
\[p_\psi(z \mid x),\]sometimes called a conditional prior.
Then forecasting is:
1) Sample latent crop states: $z^{(k)} \sim p_\psi(z \mid x)$
2) Decode yields: $y^{(k)} \sim p_\theta(y \mid x, z^{(k)})$
3) Aggregate samples to get a predictive distribution.
This is how the model produces:
- a mean forecast
- prediction intervals
- multi-modal outcomes (if relevant)
8) Comparing this to LightGBM (the baseline that keeps me honest)
LightGBM is typically:
\[\hat{y} = f_{\text{LGBM}}(x),\]a direct mapping from engineered covariates to yield.
A VAE-style forecast is instead:
\[z \sim p_\psi(z \mid x), \quad y \sim p_\theta(y \mid x, z).\]This difference matters when:
- there are hidden factors not captured by $x$,
- the same $x$ maps to multiple plausible yields,
- or uncertainty calibration is important (operations planning, labor scheduling, contracts).
If none of those are true, LGBM is usually the better engineering choice.
9) My corrected takeaways (the short list)
What I used to think
- Encoder compresses
- Decoder expands
- Decoder is the posterior
- “Generative inference” means predicting
What I now think
- Encoder is the approximate posterior: $q_\phi(z \mid \cdot)$
- Decoder is the likelihood / generative model: $p_\theta(\cdot \mid z)$
- In forecasting, decoder is best seen as a probabilistic forecaster, not an “expander”
- “VAE inference” means latent state inference via the encoder
- For forecasting, I often need a conditional prior $p_\psi(z \mid x)$ to sample latent states when $y$ is unknown
10) One sentence that finally fixed my mental model
A VAE-style forecaster for strawberry yield is:
a model that learns a latent crop-condition variable $z$ and uses variational inference (via an encoder) to approximate the posterior over $z$, while a decoder defines a probabilistic forecast $p_\theta(y \mid x, z)$ that can be sampled for uncertainty-aware predictions.
This framing made the Rezende & Mohamed (2015) motivation click:
variational inference is the mathematical reason VAEs exist, and “encoder/decoder” are just neural parameterizations of the variational posterior and the likelihood.