A to Z of LLMs evals for Product Managers

Jun 20, 2025

AI product evaluation is a core skill for PMs at Anthropic — says Mark Krieger, Chief Product Officer

I see PMs trying to ship GenAI features based on intuition and demos alone. That works fine for pitch decks — not for production.

The moment you’re building any AI-powered feature — whether it’s a smart reply on Gmail, a Copilot assistant on Microsoft product suite, or a fraud detection engine for Stripe — you’ll bump into this invisible wall:

“How do we know if it’s working?”

Welcome to the world of evals.
This is your definitive guide — the A to Z for Product Managers.

1. 🤷 What Even Are “Evals”?

If you're a PM working on AI products, think of evals as your new product QA + UX testing + data science validation — rolled into one.

Evaluations (aka Evals) are structured tests that measure how well your AI model is performing across dimensions like:

Correctness (Was it factually right?)
Relevance (Did it address the user’s intent?)
Helpfulness (Did it actually help the user get a task done?)
Safety (Did it avoid toxic/bias outputs?)
UX-fit (Did the tone, length, and structure make sense?)

Why are evals needed?
Because AI models ≠ deterministic systems → Same input ≠ same output every time

Same prompt, different results.
Sometimes they shine.
Sometimes they hallucinate.
Sometimes they surprise you in ways you didn’t expect.

Without evals, you're shipping into the dark.

2. 🛠 Why Evals Matter for Product Managers

As PMs, we own the success metrics for what we ship. But with AI:

Unit tests aren’t enough
QA isn’t sufficient
A single demo can’t reflect real-world usage

Traditional QA (e.g., test cases, assertions) breaks down when outputs are open-ended (text, images, recommendations)

So what should you do?

You need an eval framework.
One that balances

data science metrics (e.g., BLEU, grounding score)
business outcomes (e.g., CTR, conversions) and
human judgment (e.g., usefulness, tone)

Good PMs build features.
Great PMs define what success looks like.
10x PMs design eval frameworks before the feature is even built.

🔍 Your Role as PM:
You are not just shipping features—you’re shipping probabilistic behavior. Your job is to define what "good enough" means when there’s no ground truth.

📉 Without evals:

You can’t accept/reject models
You can’t measure launch success
You’ll ship silent failures (bias, hallucinations, broken UX)

📦 Let's Start With a Real-World Example: Gmail Smart Reply

You’ve seen those tiny one-tap replies in Gmail, right?

“Sounds good.”
“Let’s do it.”
“Thanks!”

Simple UI. Magical behind the scenes.

Now imagine you're the PM building this feature.
How do you know your model is “good”? What does that even mean?

Here’s what the Gmail team did:

✅ Offline Evals

Train the model on millions of past emails. Then ask:

Does the model predict replies that match real human responses?
What’s the top-1 and top-3 match rate?

👩‍⚖️ Human Evals

Show sample suggestions to real users or raters:

Is this reply helpful?
Is the tone appropriate?
Would you use this in real life?

📈 Online Evals

After launch, track:

CTR on smart replies
Frequency of usage per session
Impact on email reply time

PM Insight: It’s not just about being correct. It’s about being useful, fast, and trusted.

🧪 The 3 Core Types of Evals

Let’s break them down.

a. Offline Evals

Run your model against a benchmark dataset before launching.

Think of it as automated unit testing for AI.
You check how often the model gets the “right” answer.

🛠 Example:
Your support assistant is summarizing tickets. You compare model summaries to human summaries using metrics like:

ROUGE → Measures how much overlap there is between the model’s summary and the human reference summary (think: shared words and phrases).
Use it to check: Is the model covering the key points humans typically mention?
BLEU → Measures how fluent and well-structured the output is, based on how closely it follows the expected phrasing.
Use it to check: Does the summary sound natural and grammatically clean?
Cosine Similarity → Measures how semantically close the model’s summary is to the human-written one, even if the wording is different.
Use it to check: Does the model convey the same meaning, even if it uses different words?

As a PM, you don’t need to calculate these — but you should know what they tell you, and where they fall short. For instance, a summary can score high on ROUGE but still be confusing or redundant — which is why you pair offline evals with human judgment.

b. Human Evals

Real humans score model outputs based on qualitative dimensions:

Was this helpful?
Did it sound robotic?
Did it hallucinate?
Was it biased?

You can use:

Internal team members (PMs, designers, QA)
External vendors (Scale, Surge, Toloka)
Crowdsourced raters

You might create a human eval rubric that looks something like this — no need for complex tooling, just a structured way to assess quality across dimensions.

Here’s how you’d break it down:

Relevance – 4/5
The response was mostly aligned with the intent of the prompt. It covered the key topic, though it could have gone a bit deeper in one area.
Factual Accuracy – 3/5
There was one claim that didn’t seem quite right or lacked supporting evidence. It wasn’t outright false, but it raised doubts.
Tone – 5/5
The response sounded clear, polite, and in line with the brand voice. No robotic phrasing or awkwardness.
Usefulness – 4/5
The output would actually help a user get their job done. There’s some room for improvement, but it’s solid and usable.

You can adapt these categories and scoring scales based on your use case. What matters is that you’re evaluating consistently and across multiple dimensions — not just “does it look good.”

PM Insight: Human evals take time. But they catch what metrics miss — like nuance, tone, weird edge cases.

c. Online Evals (In-Production)

Once the model is live, you track behavioral signals from real users.

Key online eval metrics:

CTR (click-through rate)
Conversion rate
Undo/Delete/Exit events
Time to complete task
Feedback buttons ("👍👎")

🧠 Real gold: Connect online evals to business outcomes.

Example: Your AI Copilot helps users create dashboards. If dashboards created per user goes up post-AI, that's your online eval win.

4. 🧭 Anatomy of a Good Eval Framework

A proper eval setup should answer:

✅ What are the goals of this feature?
✅ What does a “good” output look like?
✅ How will we measure offline performance?
✅ What’s our human eval rubric?
✅ What are the success metrics in production?
✅ What is our threshold for launch?
✅ How do we detect regressions?

PM Tip: Treat your eval suite like test cases. Re-run it on every new model version.

5. 📉 Common Failure Modes You Should Evaluate For

You don’t need to be a data scientist to catch when your AI is going off-track. What you need is clarity on what to look for, and a set of test prompts to catch them early. Here are the most common failure patterns — and what they actually look like in real life:

🔮 Hallucination

The model invents facts that were never true or never present in the context.
Example: It says “Einstein was born in Canada” when he was actually born in Germany.
These can sneak into product experiences without you realizing — and erode user trust.

😬 Toxicity

The model generates content that is offensive, discriminatory, or inappropriate.
This could be subtle — like biased jokes or tone-deaf phrasing — or outright harmful.
Even one slip-up here can be a reputational risk for your brand.

🔁 Repetition

Sometimes the model gets stuck in a loop or repeats phrases unnecessarily.
Example: “The issue is with the issue due to the issue.”
This usually signals poor prompt engineering or decoding strategies.

⚖️ Bias

The model gives different responses based on gender, race, or other demographic inputs — even when the query is the same.
Example: Asking “Tell me about a good leader” and getting male examples consistently.

🌀 Incoherence

The output may sound grammatically correct but lacks logical flow.
Example: “I agree with the disagreement of agreement.”
It might look okay at first glance, but it breaks down when read closely.

🎯 Overconfidence

The model sounds extremely certain — even when it’s wrong.
Example: Giving a clear “yes” or “no” answer to something subjective or unclear.
This is especially dangerous in domains like legal, health, or finance.

As a PM, you should create dedicated prompt sets to evaluate for each of these. Build a “hallucination test suite,” a “toxicity probe set,” and so on. Run these before every major model update — just like regression tests in traditional software.

Trust me: catching one of these early can save your team weeks of debugging and your users a lot of frustration.

6. 🧰 Tools of the Trade

Here’s a short list of tools to help you design and run evals:

OpenAI Evals – framework to test GPT models
Ragas – framework for RAG applications
LangSmith – logs and traces for LangChain apps
TruLens – monitor and evaluate LLM-based applications
Promptfoo – compare prompt outputs across models
Scale / Surge / Toloka – human feedback services

7. 📖 Case Study: Microsoft Copilot

When Microsoft shipped Copilot (for Word, Excel, PowerPoint), they didn’t just run “accuracy” tests. They evaluated:

Grounding: Are responses backed by data in the document?
Actionability: Did users complete the task faster?
Safety: Are toxic/junk suggestions filtered out?
Fluency: Is it readable and natural?

They built internal dashboards for evals, tied to every model update. PMs, not just researchers, used those dashboards to make launch decisions.

PM Takeaway: AI features don’t just fail because they’re “wrong.” They fail because they’re useless, creepy, slow, or untrustworthy. Evals catch that.

8. 🔁 Regression Testing in AI Is a Different Beast

In traditional software, you write unit tests. If something breaks, the test fails.

With AI, it's trickier. A small model change may improve one type of response, but worsen another.

That’s why you need regression eval suites:

Create 100–200 “golden” test prompts across core use cases
Evaluate every model version on the same set
Track score changes over time
Don’t ship unless you outperform or match last version

PM Insight: Your eval suite is your safety net. Build it before the model breaks.

9. 🔐 What About Ongoing Monitoring?

Evals don’t stop after launch. Set up:

Alerting for spikes in error rates
Monitoring hallucination rate in prod
Feedback loop for thumbs down responses
Shadow mode testing for new models

Treat evals like experimentation infrastructure, not QA checkboxes.

10. 🧳 Wrapping Up: What Makes a PM “Eval-Literate”?

If you’re working on AI, being “eval-literate” means:

✅ You define what “good” means for your model
✅ You know which metrics matter (and which ones don’t)
✅ You can challenge engineers when metrics don’t reflect usefulness
✅ You can balance product tradeoffs — speed vs accuracy, usefulness vs latency
✅ You can confidently say:

“We’ve tested this across X dimensions. We’re ready to ship.”

❤️ Liked this? There’s More.

If this helped you, you’ll love my 6-hour live course —
AI for Product Managers, where we go deep into:

Evaluations
Prompting techniques
RAG vs Fine-tuning
LLM Architecture
Safety, Bias & Monitoring

💥 5,000+ PMs from Razorpay, Uber, Swiggy, CRED, Intuit, and Amazon have taken it.

Learn AI For PM

👉 Register now — 45% off today

Xplainerr

Discussion about this post