Skip to content
hey annahey anna
Back to blog
Guides

Where Your Data Actually Goes

A plain-English walkthrough of what happens to your file from upload to insight to deletion. No marketing speak, just the architecture.

By Anna·~8 min read·Updated Mar 15, 2026

You're about to upload a file with employee salary data. Or customer emails. Or revenue numbers your board hasn't seen yet. Where does this actually go?

That's a reasonable question. Most AI tools answer it with a trust badge and a link to a privacy policy nobody reads. Here's the actual answer, in plain English, for hey anna — and what it means in practice for the questions security teams ask first.

The shape of the answer, up front

Before the architecture diagram, the numbers. The methodology paragraph for this post is the architecture itself — Sankey first, then retention over time, then the comparison table. If those three diagrams agree, the prose is just commentary.

Full file copies on servers
0
Only encrypted object storage holds it
Browser-side compute
100%
Python runs in your tab, not on a server
Training data sent to Anthropic
0%
API terms prohibit it, contractually
Deletion latency
Immediate
No soft-delete, no 30-day retention
Sub-processors in the data path
2
Cloudflare and Anthropic
SOC 2 status today
On the path
Audit not yet completed — see below

The full journey of your file

Your file touches four systems. Here's the actual flow.

Width of each ribbon is roughly the share of the dataset that travels along it. Teal: your file or its computed results, staying inside Cloudflare or your browser. Sage: metadata only — schema, conversation history, audit log entries. Coral: the small slice of column samples and aggregates Anna sends to Anthropic to reason about the data.

A few things to read off this diagram before the prose explanations:

  • The CSV splits into two ribbons immediately. The full file goes to R2; a working copy stays in the browser tab so Pyodide can run Python against it without round-trips to a server.
  • D1 only ever receives metadata — schema, row counts, your conversation history, audit log entries. The cells of your spreadsheet do not live in the database.
  • The only ribbon that leaves Cloudflare is the coral one. It carries column metadata, summary statistics, and small samples to Anthropic. The full dataset never travels along it.
  • Anthropic's response comes back the same way and is treated as inference output, not training data. That is a contractual property of the Anthropic API, covered below.

1. Upload: Cloudflare R2

When you drag a CSV into hey anna, it goes to Cloudflare R2 — object storage on Cloudflare's network. R2 stores the file encrypted at rest with AES-256. No third-party storage providers are in the path; the file does not leave Cloudflare for storage.

2. Metadata and structure: Cloudflare D1

The database that tracks your datasets, conversations, and reports runs on Cloudflare D1, a SQLite-based database at the edge. D1 holds metadata — file names, column types, row counts, conversation history. The actual data values stay in R2. D1 knows the shape of your data, not the contents.

3. Analysis: your browser (Pyodide)

This is the part that surprises people.

When Anna runs Python analysis — statistical tests, data transformations, chart generation — the code executes in your browser. Not on a server. Not in the cloud. In a WebAssembly sandbox running Pyodide (a full Python environment compiled to run in the browser).

Your data is pulled into the browser tab, the Python runs locally, and the results render on screen. The intermediate calculations, transformed datasets, and generated charts are produced and rendered client-side. hey anna's servers do not see them.

This is a deliberate architectural decision, not a limitation. Browser-based execution means your data does not travel to a compute server for processing. It stays in the tab.

4. AI reasoning: Anthropic's Claude API

When Anna reasons about your data — interpreting patterns, deciding which statistical test to run, writing the narrative for your report — she uses Claude, Anthropic's large language model.

This means: yes, parts of your data are sent to Anthropic's API. Specifically, the parts Anna needs to reason about — column names, summary statistics, sample values, and the results of her analysis.

Here is what matters:

  • Anthropic does not train on API data. Anthropic's API terms prohibit using customer data for model training. Data sent through the API is used to generate a response and is not used to train or improve their models. This applies to all API customers, including hey anna.
  • Anna does not send the entire file. She sends what she needs for the current reasoning step — typically column metadata, aggregated statistics, and small samples. The full dataset stays in R2 and in your browser.

If you're evaluating hey anna for sensitive data, the key question is whether Anthropic's API data-handling terms meet your requirements. Those terms are public and contractual, not just policy. Anna can answer methodology questions about the analysis; she cannot waive Anthropic's contract on your behalf.

How your data ages — minute 1 to "after delete"

The Sankey shows where data flows. It does not show how long each piece sticks around. This one does.

Retention over time across the four locations data can live. Teal R2 storage persists until you delete. Sage D1 metadata persists with it. The grey browser-tab copy disappears the moment the tab closes. The coral Anthropic-API slice exists only in-flight during a reasoning step — it is discarded once the response returns.

Two things worth highlighting from this view:

  • The browser-tab copy is the most ephemeral piece of the system. Close the tab and the working copy is gone — no client-side cache survives a refresh.
  • The Anthropic-API slice exists only for the duration of a reasoning step. It is not held by hey anna and, per Anthropic's API terms, it is not retained beyond the response generation. Anthropic does store API inputs and outputs for up to 30 days for abuse detection; nothing about that data flow makes it eligible for training.

hey anna versus the alternatives

The phrase "AI analyst tool" covers wildly different architectures. The difference matters for your security review.

Where Python actually runsOpenAI servers (Code Interpreter)Vendor's cloud computeYour browser (Pyodide / WebAssembly)
Full file persistence on vendor serversYes, while session is liveYes, on vendor storageEncrypted in Cloudflare R2; no other server copies
Training-data policyOpt-out for ChatGPT consumer; API excluded by defaultVaries — read the DPAAnthropic API: contractually excluded
Deletion latencyUp to 30 days30-90 days typicalImmediate — no soft-delete window
Browser-side processingNoRareDefault for every analysis
Single-tenant data scopePer-account, shared infraPer-account, shared infraPer-account, no shared data layer
Supports a SOC 2 evidence pathOpenAI is SOC 2 Type 2VariesInfra ready; hey anna audit not yet complete
Comparison is based on publicly documented architectures as of 2026-03. The hey anna row reflects current shipped behaviour; the SOC 2 column is the honest version of the answer, not the aspirational one.

The row that matters most for regulated buyers is the third one. "API excluded from training by default" is a different contractual statement than "we promise not to train on your data." The first is enforceable; the second is policy.

The three questions security teams ask first

Where is my data stored, geographically? Cloudflare R2 supports regional storage. By default, hey anna stores data in Cloudflare's global network with EU-localised storage available on request. If you have a strict residency requirement (EU-only, US-only, no transit through specific regions), this is a configuration conversation — reach out before the security review, not after.

Who else processes my data? Two sub-processors are in the data path: Cloudflare (for R2 storage, D1 metadata, and Workers compute) and Anthropic (for the reasoning step). Both publish their own SOC 2 reports and DPAs. No third-party analytics vendor sees your dataset contents; product analytics is scoped to UI events, not row-level data.

Can I get an audit log? Yes. D1 records dataset access, conversation timestamps, and report generation per user. The log is queryable through the API and surfaced in account settings. It does not include row-level read events — those happen in your browser and hey anna's servers cannot observe them.

What hey anna does not do

Sometimes the clearest way to explain a security posture is to list what doesn't happen.

No training on your data

Not by hey anna. Not by Anthropic. Not by Cloudflare. No model in this system improves because of your file.

No third-party data sharing

Your data is not sold, syndicated, benchmarked, or pooled. There is no "anonymised aggregate" carve-out in the terms.

No persistent server-side analysis results

Python output lives in your browser. Unless you save it as a report, hey anna's servers do not retain the computed result.

No cross-user data access

Your datasets, conversations, and reports are scoped to your account. No shared data layer between users.

No soft-delete retention

Delete means delete. No 30-day window. No recoverable trash. Once it is gone from R2 and D1, it is gone.

No silent third-party analytics on your data

Product analytics tracks UI events — clicks, page views — never the contents of your spreadsheet.

What about deletion?

When you delete a dataset in hey anna, three things happen:

  1. The file is removed from R2 (object storage).
  2. The metadata is removed from D1 (database), including conversation history tied to that dataset.
  3. Any vector embeddings tied to the dataset are purged from Cloudflare Vectorize.

Deletion is immediate and permanent. No "soft delete" period, no 30-day retention window, no backup you cannot control.

If you are subject to GDPR or have specific data residency requirements, reach out — Cloudflare R2 supports regional storage and the rest of the stack inherits its region.

The SOC 2 question

If you work in HR, finance, healthcare, or any regulated industry, you are probably looking for a SOC 2 Type II badge. Fair.

hey anna does not have SOC 2 certification yet. That is the honest answer. It is on the roadmap — infrastructure choices (Cloudflare's platform, browser-based computation, Anthropic's API terms) were made with this path in mind. But the audit has not happened yet.

If SOC 2 is a hard requirement today, email support@heyanna.studio — Cal will share the current timeline and what evidence is available in the meantime. If your security review is based on understanding the actual data flow and infrastructure rather than a compliance badge, this post gives you the full picture.

The infrastructure stack, summarised

LayerTechnologyWhat it handlesWhere it runs
File storageCloudflare R2Raw uploaded filesCloudflare's network
DatabaseCloudflare D1Metadata, conversations, reportsCloudflare edge
Vector indexCloudflare VectorizeMemory embeddings (column metadata)Cloudflare edge
ComputationPyodide (WebAssembly)Python analysis, charts, transformsYour browser
AI reasoningAnthropic Claude APIPattern interpretation, narrativeAnthropic's infrastructure
ApplicationCloudflare WorkersAPI, auth, routingCloudflare edge

Things you should double-check before signing off

Three items where you are in a better position to verify than this post is:

  1. Anthropic's current API data-handling terms. They are public and contractual, but they do update. Confirm the version in force at the time of your contract.
  2. Your own residency requirement against R2's region map. If you need EU-only or US-only storage with no transit, ask before onboarding so the bucket is provisioned in the right region from day one.
  3. Whether your team's security review accepts an architecture-and-DPA paper trail today, with SOC 2 evidence to follow. Some reviews block until the badge exists; others accept the underlying controls. Knowing which kind of review you are running saves a month.

FAQ

Where is my data stored?

Raw files live in Cloudflare R2, encrypted at rest with AES-256. Metadata (schemas, conversation history, audit logs) lives in Cloudflare D1. Memory embeddings live in Cloudflare Vectorize. All three are Cloudflare services on Cloudflare's network. Regional storage is supported — by default storage is global on Cloudflare's edge, EU-localised storage is available on request.

Does Anthropic train on my data?

No. Anthropic's API terms prohibit training on customer data sent through the API, and that contract applies to every API customer including hey anna. The samples and statistics Anna sends during a reasoning step are used to generate a response and discarded. Anthropic retains API inputs and outputs for up to 30 days for abuse detection, which is separate from training.

Can I delete everything?

Yes. Deletion is immediate and permanent. When you delete a dataset, the file is purged from R2, all metadata and conversation history is removed from D1, and any embeddings are purged from Vectorize. No soft-delete window, no backup you cannot reach. Account deletion follows the same pattern across every record tied to your account.

Is hey anna SOC 2 certified?

Not yet. The audit is on the roadmap, and the infrastructure choices (Cloudflare platform, browser-based compute, Anthropic API terms) were made with the SOC 2 path in mind. If SOC 2 is a hard requirement for your security review today, email support@heyanna.studio for the current timeline and the evidence available in the interim.

Can I use hey anna with GDPR or HIPAA data?

GDPR-covered data: yes, with the standard caveats around data subject rights, processing terms, and EU residency. hey anna acts as a data processor, Cloudflare and Anthropic act as sub-processors, and a DPA is available. HIPAA: not today. A Business Associate Agreement is not in place, so hey anna is not appropriate for protected health information at this time.

Where does the Python actually run?

In your browser tab, inside a WebAssembly sandbox running Pyodide. Pyodide is a full Python distribution (the same CPython interpreter, compiled to WebAssembly) with NumPy, Pandas, and other scientific libraries available. Code authored by Anna executes locally against your data; results render in the same tab. hey anna's servers do not run your analysis code or see the computed output.

What metadata is stored about me and my data?

In D1: account identifiers, dataset names, column names and types, row counts, conversation history (prompts and Anna's responses, but not the row values she reasoned about), and report titles and contents you have saved. In Vectorize: vector embeddings of column metadata and saved memory items, never the underlying row values. Product analytics tracks UI events (which pages you visit, which buttons you click) and is scoped to UI behaviour, not data contents.

Who can see my data inside hey anna?

Operationally, no one routinely. Your datasets are scoped to your account; there is no shared data layer between users and no internal "view as user" tool that touches dataset contents. For incident response, a small number of engineers can access R2 buckets with audit logging — that access path is documented and is the same path a SOC 2 audit will eventually inspect.