What Is an Experiment Scorecard? How to Know Why an A/B Test Won or Lost

June 16, 2026

James Villacci

Your CSAT went up four points last quarter. Leadership is thrilled. The number is up and to the right, the slide looks great, and everyone moves on to the next thing.

Here's the problem with that moment of celebration: you can't say why the number moved. And if you can't say why it moved, you can't say how to move it again. You're left hoping the next quarter is kind to you, which is not a strategy.

I've spent a long time watching teams get this wrong, and I've gotten it wrong myself. The failure is in treating a single measurement as the end of the inquiry rather than the beginning of it.

A score tells you the temperature, not the fire

A single metric is a smoke detector; it tells you something is burning, and it tells you fast. But what it never tells you is where the fire is, what started it, or whether you're dealing with burnt toast or a house going up in flames.

NPS is the easiest example to pick on, but this applies to any number a team starts treating as the single source of truth, from conversion rate, to activation, or task success. Each metric compresses a messy, human reality into a single digit that's easy to report and easy to rally around. That compression is the whole appeal, but it's also the whole danger. The number is clean precisely because it threw away everything that would tell you what to do next.

I'm not arguing against metrics. Quant researchers live and die by good measurement, and a well-chosen number is the fastest way to know that something changed. I'm arguing against letting the number have the last word. The score moved, which is good. Now go find the three behaviors and the two attitudes sitting underneath it, because that's where the action actually lives.

The researchers who earn lasting trust inside a company are the ones who treat a metric as the first question, not the final answer. They see a four-point lift and immediately ask what changed in how people use the product and how they feel about it. The number is the trigger, while the understanding is the work.

Enter the experiment score card

Nowhere does single-metric thinking do more damage than in experimentation.

Most teams read an A/B test like a scoreboard. Win or lose, ship or kill, and then move on to the next test. If you run that way for a year, you end up with a backlog of "winning" experiments and still no real model of why your users behave the way they do. You optimized your way to a local maximum but learned almost nothing transferable along the way.

Years ago I started scoring experiments on three categories instead of one, which I came to think of as a “score card.” This score card changed the way my teams worked.

The first score card column is the business metric: Did the number we cared about move? This is the part every team already tracks, and on its own, it's the score everyone fixates on.

The second column is behavior: What did people actually do differently in the variant? Where did they click, where did they hesitate, and what did they ignore entirely? The behavior starts to give you the mechanism behind the outcome. This is where heatmaps and session replays earn their place because they show you the “how” that a conversion number cannot.

The third column is attitude: How did the change make people feel, and did it shift their perception in the direction you intended? This is the column almost everyone skips, usually because it's the hardest to capture in the moment. It's also, in my experience, where the real explanation tends to hide.

Read all three columns together and something useful happens: a "losing" test stops being a failure. A variant that lost on conversion but clearly changed how people understood the product just handed you a lesson you can apply to the next ten experiments. A win you can't explain is fragile because you don't know which part to protect, but a loss you understand is fuel because you know exactly what to try next.

Not every experiment needs to win, but every experiment should teach you something. If experimentation exists to generate knowledge and not just winners, then how you read a test matters every bit as much as how you run it.

Measuring the attitude

The catch, of course, is that third column. While behavioral data is everywhere, attitudinal data is scarce because someone has to deliberately go ask for it.

So most behavioral data never gets paired with a single word about “why.” Every test answers "what happened" and almost none of them answer "what were people actually thinking." That gap is the cheapest, most overlooked research most teams have sitting right in front of them, and closing it doesn't require a full research cycle.

The move is straightforward. When a variant goes live, you target a short survey at the users inside that variant. Now you have behavioral data from the experiment and attitudinal data from the survey on the same users in the same moment. When the test wins, you know what changed in their heads. When it loses, you still walk away with the why..

This is exactly what we built experiment-paired surveys for at Sprig. You point a study at a specific variant, whether by URL or by a list of user IDs, and capture the attitude alongside the behavior. The survey isn't a static form, either. It can ask an AI-driven follow-up based on what someone just told you, so when a user says the new layout "felt cluttered," the next question digs into which part

None of this replaces deep, generative research, as the score card is not a substitute for the foundational studies that tell you what's worth building in the first place. What the score card does is make the everyday flood of experiments far more instructive without pulling a researcher onto every single test.

Read the signals, not the score

A score tells you the room got hotter, behavior tells you where the heat is coming from, and attitude tells you why someone lit the match in the first place. You want all three, and the good news is you can have all three on the experiments you're already running. Most teams have two of the three columns sitting in systems they already own; they've just never been read together.

Reporting a single number isn't the enemy, but treating it as the end of the conversation is. The next time a metric moves, resist the urge to celebrate or panic, and treat it as the opening line of a better question. Then go find the behaviors and the attitudes that explain it.

That's how a number stops being a thing you just report and starts being a thing that helps you understand.

Jun 30, 2026

Thought Leadership

False Fluency Is AI's Biggest Research Risk. Zendesk's Head of UX Research Has a Fix.

Jun 26, 2026

Thought Leadership

Related Articles

False Fluency Is AI's Biggest Research Risk. Zendesk's Head of UX Research Has a Fix.

The New Researcher Paradigm: In the Loop vs. On the Loop