Evaluate agents

December 22, 2025

Parts of an Agent You need to Evaluate

Routers: Function choice and parameter extraction
- Did it call the right skills based on the scenario?
Skills: Can use standard LLM evaluations
- Embed input query
- Vector DB lookup
- LLM call with retrieved context
Path: The most challenging to evaluate at scale

How to evaluate these components

LLM as a Judge: Using other LLMs to evaluate
- It will never e a 100% correct
- Tuning your LLM judge prompt can help close this gap
- Always use discrete classification labels (incorrect vs correct, not 1-100% accuracy)
- Foolow from 13:45 (https://www.youtube.com/watch?v=LpbGpJhndQ0)
Code-based Evals: Using traditional code checks
Human Feedback: Using end-user or human labeler feedback

Share on

Twitter Facebook LinkedIn

VAE, Practically — What I Got Wrong

January 2, 2026

A corrected mental model for VAEs, written as ‘mis-belief → correction’ notes after re-reading Rezende & Mohamed (2015) and reframing the encoder/decoder...

KL Divergence, Practically — What I Got Wrong

January 2, 2026

Why treating KL as ‘just cross-entropy’ breaks down inside variational inference, and what its asymmetry actually does to optimization (mode-covering vs mode...

Role of Experience in AI era

November 20, 2025

Why LLMs are unreliable judges of feasibility, and how to find the impact bottleneck of an ML project before tuning the model.

Causal Inference in Plants

November 11, 2025

Adding causal inference (DoWhy + a hand-drawn DAG) on top of a LightGBM yield model so a greenhouse simulator can answer ‘what happens if we raise temperatur...

Share on

You may also enjoy

VAE, Practically — What I Got Wrong

KL Divergence, Practically — What I Got Wrong

Role of Experience in AI era

Causal Inference in Plants