Index / Notes / Problem Statement
Why Retrieval Drift Goes Undetected: The Case for Public Pipeline Dashboards
If your retrieval pipeline has a private dashboard nobody outside the team looks at, the dashboard is doing half its job. The other half is forcing the team to keep it honest.
- Drift, retrieval-recall, and schema-validity numbers belong on a public page, not a private one.
- A public dashboard forces honest numbers. Honest numbers force the team to keep them healthy.
- Start with three columns: docs ingested, drift over the last 24h, retrieval recall on a held-out set.
- Empty cells are fine on day one. Fabricated values are not.
Retrieval drift is the gradual divergence between the corpus a vector index was built against and the corpus the model is queried against, measured as embedding-centroid shift, label-distribution KL divergence, or retrieval-recall decline on a held-out set. Most teams discover retrieval drift through customer complaints. The teams that catch it first publish the measurements to a page anyone can read.
The structural reason teams miss drift is that the operator's dashboard lives behind a VPN, behind a login, behind a quarterly review that rarely gets scheduled on time. The dashboard exists. The dashboard is private. The dashboard does not get read. By the time an internal review catches a declining recall number, downstream consumers have been getting worse answers for weeks.
What changes when the retrieval dashboard is public?
A public dashboard is an operating discipline, not a marketing surface. Three things shift the moment retrieval metrics are visible from outside the team:
The team stops shipping fabricated baselines. If drift goes on the front page, the drift had better be measured. If recall goes on the front page, the recall had better come from a real eval set.
The team starts reading their own numbers. A dashboard the team rarely shares gets one weekly glance. A dashboard at a public URL gets a glance every time the team links it to a prospect, an investor, or a candidate.
The team builds the missing pieces. The first time a column reads "—" on the public page is the last time it stays that way for long.
What goes on a public retrieval-drift dashboard?
The minimum useful public surface for a retrieval pipeline has three regions: a headline KPI strip, a per-corpus health table, and an operating chart over time. The table below lays out the working set of columns most teams find load-bearing once the dashboard ships.
| Column | Source | Honest at hour zero? | Updated |
|---|---|---|---|
| Corpora count | SELECT COUNT(*) FROM corpora |
Yes | On publish |
| Vectors total | Sum across Qdrant / pgvector / equivalent | Yes | On publish |
| Last run age | MAX(completed_at) in run audit table |
Yes | On publish |
| Records per corpus | SUM(docs_after_dedup) from run audit |
Yes | On publish |
| Schedule mode | Per-corpus scheduled/manual/paused enum |
Yes | On publish |
| Centroid drift | Wasserstein or KS-stat against prior batch centroid | No | Nightly eval |
| Label KL | KL divergence on label distribution vs. baseline | No | Nightly eval |
| Retrieval recall@10 | Held-out eval set scored against current index | No | Nightly eval |
| Schema validity % | Share of records matching current schema_version | No | Per cycle |
Columns marked "No" at hour zero render as "—" until the underlying eval cycle has run at least once. The dashboard is honest about what it has not measured yet. Anyone reading the page can see the operating tempo of the pipeline at a glance.
Why do empty cells matter more than full ones?
Every team building data-for-AI infrastructure faces a temptation in the first week: fill the dashboard with synthetic baselines so it looks lived-in. Drift is set to 0.02σ because that is what drift "looks like" on production systems. Schema validity is set to 99.4% because the model is producing valid records. Retrieval recall is set to 0.94 because that is what a healthy index returns.
Every one of those numbers is a commitment. The dashboard now owes the world a follow-up. When the real metric comes in, it has to land at least that high or the team has shipped a regression on day one. Teams that start with synthetic baselines spend the next month tuning the numbers to match the synthetic baselines they should never have shipped. Energy goes into the chart, not into retrieval quality.
The cleaner discipline is the inverse: empty cells render as "—", headline KPIs reflect the actual count, and the eval cycle runs on its own schedule. The first time the dashboard publishes a real drift number, the team has earned that number. The first time it publishes a real recall figure, the team has built the held-out set. Everything on the page is something the team can defend.
This is the framing behind the Diagest pipeline at ixprt. The dashboard at diagest.ixprt.com/pipeline is live. The numbers on it are the operating numbers of the pipeline, on production data, at the moment of the last publish. Several columns are "—" today because the eval cycle has not yet run against the freshly backfilled corpora. They will fill in. We are not going to fabricate them in the interim.
How does this differ from existing observability tools?
Most production teams already use observability stacks for ingestion infrastructure: Datadog, Grafana, Honeycomb, Weights & Biases, Arize AI, WhyLabs, Fiddler. All of these are private surfaces by default. They were designed for internal SRE teams.
A public retrieval-drift dashboard is a different artifact. It is built around three constraints those stacks do not impose:
1. The surface is unauthenticated by design. No login. No SAML. No analyst-only Slack channel. A reader from outside the company can read it the same way a reader inside the company can.
2. The data flows through a one-way publish valve. No public service has a network path back into the internal data plane. The internal system signs and pushes a curated payload; the public site verifies and persists. The architecture is asymmetric on purpose, similar in spirit to the public observability page pattern that vendors like Cloudflare have run for years on status surfaces.
3. The dashboard renders what is measured, never what is plausible. A standard observability dashboard often surfaces synthetic check results, alerting thresholds, or rolling averages computed from sparse data. A public retrieval dashboard either has a real measurement to surface or shows "—".
How should a team operationalize this in one quarter?
Three operating practices for a team that wants the same discipline:
Publish the dashboard before the metrics are pretty. Ship the page on day one with whatever real numbers exist. Drift over zero days of ingestion is undefined. Print "—" and move on.
Treat empty cells as a forcing function. Every "—" on the dashboard is a small commitment to the team that the underlying measurement will get built. If a cell has been empty for a quarter, the team has decided that metric is not load-bearing. Either build it or remove the column.
Gate the publish path behind a signed valve. Internal infrastructure should reach the public surface through a one-way valve only. The internal system signs the payload; the public site verifies and persists. There is no read path from the public site back into the internal data plane. That separation is what makes the dashboard safe to keep open.
What does a public drift dashboard mean for a team shipping this year?
A public drift dashboard does three jobs at once: it is a recruiting page, a sales page, and an operating discipline. Engineers reading the page understand the team takes retrieval seriously. Prospects reading the page understand the pipeline is not a slide-deck capability. The team reading the page understands they cannot let any of those numbers slip without someone noticing.
Diagest is the data-for-AI pipeline we are building toward at ixprt. The public dashboard for it lives at diagest.ixprt.com/pipeline. For the broader architecture see the Diagest product page, and for the discipline of starting with empty cells see Honest Zeros: What a Real Data-Pipeline Dashboard Looks Like at Hour Zero.
Doesn't a public retrieval dashboard expose internals competitors could copy?
It exposes operating discipline. The internals are the corpus, the models, and the retrieval logic. None of that has to ship. What ships is whether the team is doing the work.
What if the dashboard numbers look bad on day one?
If the numbers look bad, the team has a problem worth solving. A hidden bad number is the same problem with a longer fuse.
How often should a retrieval-drift dashboard refresh?
Drift and recall are 24-hour metrics. A nightly publish cycle is enough. Ingestion-run health benefits from a per-hour refresh.
Want to skip the work?
Diagest absorbs the parse / clean / dedup / chunk / embed work and hands your AI exactly what it needs.
Contact us now →