Parts of an Agent You need to Evaluate

  • Routers: Function choice and parameter extraction
    • Did it call the right skills based on the scenario?
  • Skills: Can use standard LLM evaluations
    • Embed input query
    • Vector DB lookup
    • LLM call with retrieved context
  • Path: The most challenging to evaluate at scale

How to evaluate these components

  • LLM as a Judge: Using other LLMs to evaluate
    • It will never e a 100% correct
    • Tuning your LLM judge prompt can help close this gap
    • Always use discrete classification labels (incorrect vs correct, not 1-100% accuracy)
    • Foolow from 13:45 (https://www.youtube.com/watch?v=LpbGpJhndQ0)
  • Code-based Evals: Using traditional code checks
  • Human Feedback: Using end-user or human labeler feedback