Skip to content
hey annahey anna
Back to blog
Guides

How Anna Shows Her Work

AI observations are not the same as AI evidence. Here's how Anna backs every finding with statistical proof you can stand behind in a meeting.

By Anna·~6 min read·Updated Mar 6, 2026

You found the insight. Revenue from paid search is up 34% this quarter. You're about to put it in the deck for your VP.

Then the thought hits: but is it actually up? Or is this just noise?

This is the gap that keeps smart people from trusting their own data. Not the finding — the proof. You can see the number. You just can't defend it.

Observations vs. evidence

Most AI tools stop at observations. "Revenue appears to be higher in Q3." "Channel B seems to outperform Channel A." Appears. Seems. Weasel words dressed up as analysis.

An observation tells you what the data looks like. Evidence tells you whether you should bet on it.

The difference is statistical testing. When Anna finds a pattern, she doesn't just report it. She tests it. And she tells you the result in plain language — not in p-values.

A concrete example

Say you're a marketing manager comparing two acquisition channels — paid search and paid social — over the last six months. You upload the CSV. You ask Anna: Which channel is actually performing better?

Here's what Anna comes back with:

Paid Search avg
$14.2K
+20%
Monthly attributed revenue
Paid Social avg
$11.8K
Monthly attributed revenue
Effect size
Large
p = 0.003
Cohen's d = 0.82
Monthly attributed revenue across six months — an example of the kind of pattern Anna tests for significance, not just direction.

The observation: Paid search generated an average of $14,200/month in attributed revenue. Paid social generated $11,800/month. That's a $2,400 gap.

The evidence: Anna runs a Welch's t-test on the monthly revenue figures. The result: t = 3.18, p = 0.003, Cohen's d = 0.82 (large effect). She used Welch's t-test because the variance between channels wasn't equal. She translates it:

"Paid search outperforms paid social by $2,400/month on average. This difference is statistically significant — there's a 0.3% chance it's due to random variation. The effect size is large, meaning this isn't a marginal difference. It's a real gap."

That's the difference between "it looks like search is better" and "search is better, here's the proof, and here's how confident I am."

When Anna reports a finding, look for three things: the direction (up or down), the significance (p-value in plain English), and the effect size (how much it matters practically). All three together make a finding you can present without caveats.

Why this matters in the meeting

Your VP doesn't care about p-values. They care about making the right call.

But there's a difference between presenting "paid search revenue is higher" and presenting "paid search outperforms paid social by $2,400/month — statistically significant, large effect size, based on six months of data." The first invites challenge. The second invites a decision.

Anna gives you the backing. The statistical test is right there. The methodology is right there. The sample size is right there.

What happens when the data says "not sure"

This is the part most AI tools skip entirely.

Sometimes the difference between two groups isn't significant. Sometimes the sample is too small to draw a conclusion. Sometimes the trend is real but the effect size is trivial — statistically significant but practically meaningless.

Anna tells you that too. She'll say something like: "There's a 2.1% difference in conversion rate between the two landing pages, but it's not statistically significant (p = 0.23). You'd need about 4 more weeks of data at current traffic to detect a meaningful difference."

That's not a non-answer. That's a genuinely useful answer. It means: don't make a decision yet. Keep running the test. Come back when you have more data.

Knowing when you can't conclude something is just as valuable as knowing when you can. It prevents the premature call — pulling the budget from a channel that might actually be working, or shipping a landing page change based on two weeks of inconclusive data.

The confidence chain

Every finding Anna produces follows the same structure:

  1. What she found — the pattern, trend, or difference, stated plainly
  2. How she tested it — the statistical method, chosen based on your data's characteristics
  3. How confident she is — significance level and effect size, translated to English
  4. What it means for your decision — the practical implication, not just the statistical one

This chain is what turns an AI output into something you can present. Not because it looks impressive — because it's defensible. Your VP can push back on the number, and you can point to the test. Your stakeholder can ask "are you sure?" and you can say yes, and explain why.

The real fear

Nobody says this out loud, but the fear is simple: What if I present AI-generated numbers and they're wrong?

Fair. That's a career risk, not a data risk.

The answer isn't to avoid using AI for analysis. It's to use AI that shows its work. Anna doesn't ask you to trust her. She shows you the evidence and lets you decide. The statistical test is right there. The confidence interval is right there. The sample size is right there.

You're not presenting AI-generated numbers. You're presenting statistically tested findings that an AI helped you produce. There's a meaningful difference.

Your numbers are only as strong as the evidence behind them. Anna makes sure the evidence is there.

FAQ: how Anna shows her work

What is the difference between an observation and statistical evidence?

An observation is a pattern visible in the data — "revenue is up", "conversion rate looks higher". Evidence is whether that pattern would survive a hypothesis test given the sample size and variance. Anna runs the test, reports the result in plain English, and only states a finding as a conclusion once the evidence is there.

Which statistical tests does Anna use?

She picks based on the data shape. Welch's t-test for two-group continuous comparisons with unequal variance, chi-squared for categorical contingency, ANOVA for multi-group comparisons, regression for relationships between continuous variables, and bootstrapped confidence intervals when the distribution is skewed. She tells you which method she chose and why.

How does Anna translate a p-value into plain English?

She converts "p = 0.003" into "there's a 0.3% chance this difference is due to random variation." Effect size is reported separately as small/medium/large with a Cohen's d or equivalent. Both numbers are visible if a reviewer wants to verify the underlying statistic.

What does Anna do when the result is not significant?

She says so. The page shows a concrete example: "There's a 2.1% difference in conversion rate but it's not statistically significant (p = 0.23). You'd need about 4 more weeks of data at current traffic to detect a meaningful difference." That is a useful answer — it prevents the premature call.

How much data do I need before Anna can show evidence?

There is no universal minimum, but Anna will flag when a sample is too small for confident inference and recommend either combining groups, lengthening the window, or accepting a wider confidence band. Underpowered findings get reported as inconclusive, not as fake significance.

Can I show the methodology to a colleague or auditor?

Yes. Every Anna report includes the statistical test used, the input data, the assumptions checked (equal variance, normality, sample size), and the calculation. The work is visible — that is the difference between AI observation and AI evidence.

Does Anna replace a data analyst for high-stakes decisions?

For routine analysis — comparing channels, testing a landing page change, measuring a launch — yes, for most teams. For high-stakes statistical work (experimental design, custom modelling, peer-reviewed analysis) a human analyst is still the right call. Anna handles the volume of routine questions so the analyst can focus on the harder ones.

Start with your data.