Index / Notes / Problem Statement

Honest Zeros: Why a New Data-Pipeline Dashboard Should Launch With Empty Cells

When a new dashboard launches with drift at 0.04σ, recall at 0.94, and schema validity at 99.4%, the team running it has already lost the argument. The dashboard is not measuring anything yet.

Reid Spachman 5 min read
TL;DR
  • Synthetic launch numbers are commitments the team cannot keep.
  • Empty cells force the team to build the missing measurement, not tune the visible one.
  • A pipeline at hour zero has no drift baseline. Print '—'.
  • Vectors-total and ingestion-run counts are honest at hour zero. Drift, recall, and schema validity are not.

An honest zero is a numerical metric that reads as 0, , or null on a public dashboard because the underlying measurement has not yet been computed against real production data. Most data-pipeline dashboards launch with synthetic baselines instead: drift at 0.04σ, recall at 0.94, schema validity at 99.4%. The numbers look right. The dashboard works as a sales asset. The team has already lost the argument with anyone who asks how the numbers were computed.

The choice between honest zeros and synthetic baselines is the first operating discipline a data-for-AI team commits to. Every later decision flows from it.

What is the difference between honest zeros and synthetic baselines?

A retrieval pipeline at hour zero has a small set of metrics that can be computed from production state, and a larger set that cannot. The split is structural, not stylistic: drift requires a prior batch to compute against, retrieval-recall requires a held-out evaluation set to score, and schema validity requires at least one eval cycle to have run. None of those things exist on the first day of a new pipeline.

Metric Honest at hour zero? What's needed to compute
Corpora count Yes SELECT COUNT(*) FROM corpora
Vectors total Yes Sum of points across vector sinks (Qdrant, pgvector, Pinecone)
Last run age Yes MAX(completed_at) in run audit table
Records per corpus Yes Run audit table aggregation
24h ingestion success/failure Yes Run audit table grouped by status
Centroid drift No Prior batch baseline + current batch
Label-distribution KL No Prior batch label histogram + current
Retrieval recall@10 No Held-out (query, answer) eval set
Schema validity % No Baseline schema run + current records
Mean-time-to-first-byte No Quiet baseline period for percentiles

A posture-two dashboard prints real values in the top half and "—" in the bottom half, then explains nothing else. The reader who knows what these metrics are knows why "—" is correct. The reader who does not learns fast.

Why are synthetic launch numbers a trap?

Every synthetic number is a commitment to a future real number. The first time the real metric arrives, the team has either to match the synthetic baseline or to ship a regression. Teams tend to tune the metric rather than the underlying pipeline. Energy goes into making the chart green, not into making retrieval better.

The pattern is well-documented in the broader ML observability literature: Google's data-cascades paper found that vague metric specifications and unmaintained data baselines drive the largest production-quality failures in deployed ML systems. The MLCommons benchmark family, the HELM leaderboards, and the RAGAs framework all share a common discipline: every published metric is computed against a specified evaluation set on a defined cadence. Without that specification, the number is a marketing artifact.

The same logic applies one layer down. A public dashboard either shows numbers that are computed against a specified eval cycle, or it shows "—". There is no honest third option.

Why does the discipline matter more than the appearance?

A serious prospect on a serious vendor call will ask, within the first thirty minutes, how a specific metric was computed. The team that synthesized 99.4% on day one is now improvising. The team that printed "—" and is now showing the same dashboard with a real 99.4% they can trace to a specific evaluation cycle is doing technical due diligence in real time. The prospect can tell the difference. The investor can tell the difference. The next engineer the team tries to hire can tell the difference.

There is a second-order effect. The team running a dashboard with "—" in three columns has a permanent reminder of what is not measured yet. Every standup, every weekly review, every quarterly plan benefits from those dashes. The team running a dashboard with synthetic baselines has the opposite: a perpetual reason to keep the appearance up. Energy goes into the wrong place.

How do production teams operate this discipline?

Three patterns from teams shipping retrieval at scale in 2026:

1. Build the run-audit table first. Before any drift or recall is computed, the pipeline writes one row per ingestion run with status, started-at, completed-at, docs-fetched, docs-after-dedup, and an error-message column. That table is the source of truth for the top-half honest metrics. Every other measurement layers on top.

2. Tie metric publication to the eval cycle, not the publish cycle. The dashboard publishes whatever metrics the last eval cycle measured. If the eval has not run yet, the cell stays "—". This is the same pattern used by Arize AI and WhyLabs for model-drift monitoring: the measurement schedule defines the dashboard, not the reverse.

3. Ship through a one-way publish valve. Internal infrastructure writes HMAC-signed metrics through a single POST endpoint. The public site verifies and persists. There is no read path back into the internal data plane. Vendor-side examples of similar asymmetric architectures: Cloudflare's public status page, GitHub's public uptime dashboard, Vercel's status surface. All three publish honest operating data without exposing internals.

What is the minimum credible launch checklist?

For a new retrieval pipeline shipping a public dashboard, the minimum credible launch includes:

  • The corpora list, with each row's schedule mode visible.
  • A vectors-total counter, fetched live from the sink.
  • A 24-hour ingestion-run summary that does not invent runs that have not happened.
  • A drift chart that renders blank when no drift has been measured.
  • A retrieval-recall chart that renders blank when no eval has been run.
  • A timestamp on the last successful publish.
  • A footer link explaining what the dashboard measures and what it does not.

That is enough. The team has the rest of the year to fill in the empty cells, and every cell they fill in will be a real measurement.

This is the posture we took with Diagest. The dashboard at diagest.ixprt.com/pipeline shows real per-corpus record counts today. It shows real vector totals. It shows drift, recall, and validity as "—" because the eval cycle has not yet run against the data we backfilled this week. Those columns will fill in over the next few cycles. We are not going to fabricate them in the interim.

What is the one test to run before launching any data-pipeline dashboard?

Before you launch a data-pipeline dashboard, look at every cell with a number in it and ask: if a sophisticated engineer asks on a call tomorrow how this specific value was computed, can I answer in under a minute, with a specific query, on a specific table, in a specific schema?

If the answer is no, that cell should be "—" until the answer is yes.

Diagest is the data-for-AI pipeline we are building toward at ixprt. For the broader architecture see the Diagest product page, and for the case that the dashboard belongs in public at all see Why Retrieval Drift Goes Undetected: The Case for Public Pipeline Dashboards.

Frequently asked

Won't prospects be unimpressed by a dashboard full of dashes?

Sophisticated prospects ask how the numbers were calculated. The team that can defend its numbers wins. The team that can't loses on the second call.

How long until the dashes turn into numbers on a new retrieval pipeline?

For drift: one ingestion cycle to establish baseline, one more to measure delta. For retrieval recall: as long as it takes to build the held-out evaluation set, then nightly.

What if our backfill data produces real numbers from day one?

Then publish them. The rule is honesty, not pessimism.

Founder at ixprt. Building Diagest, AssetModel, and DailyWallStreet. Based in New York.

Want to skip the work?

Diagest absorbs the parse / clean / dedup / chunk / embed work and hands your AI exactly what it needs.

Contact us now →