Solutions
Experience measurement
Track sentiment and KPIs with AI-driven gap analysis
Strategic & foundational discovery
Uncover market whitespace with AI-led foundational studies
Journey & behavioral research
Connect user actions to motivations across the lifecycle
Market & consumer Insights
Understanding markets, audiences, & opportunity
Concept & prototype testing
Test designs and prototypes with rapid feedback
Agents
Design
Structure rigorous studies
Field
Run adaptive studies at scale
Synthesize
Turn results into research reports
Deploy
Email, link, and panels
Distribute studies to external audiences
Web apps and websites
Embed studies in web experiences
Mobile apps
Embeds studies in iOS and Android apps
Customers
Community
Events
Join curated gatherings shaping the future of research
Blog
Insights on AI, research, and product experience
Pricing
Sign in
Book a demo
Sign in
Book a demo
Thought Leadership

The Experiment Scorecard: A/B testing in the age of Agentic AI

The Experiment Scorecard: A/B testing in the age of Agentic AIThe Experiment Scorecard: A/B testing in the age of Agentic AI

June 16, 2026

James Villacci

Your CSAT went up four points last quarter. Leadership is thrilled. The number is up and to the right, the slide looks great, and everyone moves on to the next thing.

Here's the problem with that moment of celebration: you can't say why the number moved. And if you can't say why it moved, you can't say how to move it again. You're left hoping the next quarter is kind to you, which is not a strategy. 

I've spent a long time watching teams get this wrong, and I've gotten it wrong myself. The failure is in  treating a single measurement as the end of the inquiry rather than the beginning of it.

A score tells you the temperature, not the fire

A single metric is a smoke detector; it tells you something is burning, and it tells you fast. But what it never tells you is where the fire is, what started it, or whether you're dealing with burnt toast or a house going up in flames.

NPS is the easiest example to pick on, but this applies to any number a team starts treating as the single source of truth, from  conversion rate, to activation, or task success. Each metric compresses a messy, human reality into a single digit that's easy to report and easy to rally around. That compression is the whole appeal, but it's also the whole danger. The number is clean precisely because it threw away everything that would tell you what to do next.

I'm not arguing against metrics. Quant researchers live and die by good measurement, and a well-chosen number is the fastest way to know that something changed. I'm arguing against letting the number have the last word. The score moved, which is good. Now go find the three behaviors and the two attitudes sitting underneath it, because that's where the action actually lives.

The researchers who earn lasting trust inside a company are the ones who treat a metric as the first question, not the final answer. They see a four-point lift and immediately ask what changed in how people use the product and how they feel about it. The number is the trigger, while the understanding is the work.

Enter the experiment score card

Nowhere does single-metric thinking do more damage than in experimentation.

Most teams read an A/B test like a scoreboard. Win or lose, ship or kill, and then move on to the next test. If you run that way for a year, you end up with a backlog of "winning" experiments and still no real model of why your users behave the way they do. You optimized your way to a local maximum but learned almost nothing transferable along the way.

Years ago I started scoring experiments on three categories instead of one, which I came to think of as a “score card.” This score card changed the way my teams worked.

The first score card column is the business metric: Did the number we cared about move? This is the part every team already tracks, and on its own, it's the score everyone fixates on.

The second column is behavior: What did people actually do differently in the variant? Where did they click, where did they hesitate, and what did they ignore entirely? The behavior starts to give you the mechanism behind the outcome. This is where heatmaps and session replays earn their place because they show you the “how” that a conversion number cannot.

The third column is attitude: How did the change make people feel, and did it shift their perception in the direction you intended? This is the column almost everyone skips, usually because it's the hardest to capture in the moment. It's also, in my experience, where the real explanation tends to hide.

Read all three columns together and something useful happens: a "losing" test stops being a failure. A variant that lost on conversion but clearly changed how people understood the product just handed you a lesson you can apply to the next ten experiments. A win you can't explain is fragile because you don't know which part to protect, but a loss you understand is fuel because you know exactly what to try next.

Not every experiment needs to win, but every experiment should teach you something. If experimentation exists to generate knowledge and not just winners, then how you read a test matters every bit as much as how you run it.

Measuring the attitude

The catch, of course, is that third column. While behavioral data is everywhere, attitudinal data is scarce because someone has to deliberately go ask for it.

So most behavioral data never gets paired with a single word about “why.” Every test answers "what happened" and almost none of them answer "what were people actually thinking." That gap is the cheapest, most overlooked research most teams have sitting right in front of them, and closing it doesn't require a full research cycle.

The move is straightforward. When a variant goes live, you target a short survey at the users inside that variant. Now you have behavioral data from the experiment and attitudinal data from the survey on the same users in the same moment. When the test wins, you know what changed in their heads. When it loses, you still walk away with the why..

This is exactly what we built experiment-paired surveys for at Sprig. You point a study at a specific variant, whether by URL or by a list of user IDs, and capture the attitude alongside the behavior. The survey isn't a static form, either. It can ask an AI-driven follow-up based on what someone just told you, so when a user says the new layout "felt cluttered," the next question digs into which part

None of this replaces deep, generative research, as the score card is not a substitute for the foundational studies that tell you what's worth building in the first place. What the score card does is make the everyday flood of experiments far more instructive without pulling a researcher onto every single test. 

Read the signals, not the score

A score tells you the room got hotter, behavior tells you where the heat is coming from, and attitude tells you why someone lit the match in the first place. You want all three, and the good news is you can have all three on the experiments you're already running. Most teams have two of the three columns sitting in systems they already own; they've just never been read together.

Reporting a single number isn't the enemy, but treating it as the end of the conversation is. The next time a metric moves, resist the urge to celebrate or panic, and treat it as the opening line of a better question. Then go find the behaviors and the attitudes that explain it.

That's how a number stops being a thing you just report and starts being a thing that helps you understand.

Related Articles

Jun 8, 2026
The Research Restaurant: How to Scale Insight Without Watering It Down
Thought Leadership

The Research Restaurant: How to Scale Insight Without Watering It Down

May 29, 2026
When Turo's North Star Stopped Moving
Thought Leadership

When Turo's North Star Stopped Moving

Solutions
Experience measurementStrategic & foundational discoveryJourney & behavioral researchMarket & consumer insightsConcept & prototype testing
Agents
DesignFieldSynthesize
Deploy
Email, link, and panelsWeb apps and websitesMobile app
Pricing
Community
EventsBlog
CustomersIntegrations
Company
About usCareersService agreementPrivacy policyData addendumSystem status
Socials
LinkedInX