Index / Diagest

Raw data in.
AI-ready data out.

A pipeline that ingests at scale, cleans the noise, organizes what remains, and hands AI systems exactly what they need — at the shape and quality they need it in.

diagest.ixprt.com / pipeline · v1.07 · LIVE
SYSOPERATIONAL INGEST240k /s PROC LAG14ms WORKERS128/128 GPU14×H100 MODELS9 LOADED QUEUE1,402 REGIONUS-EAST-1 UTC14:32:08.117 EVAL · GREEN
THROUGHPUT
240k /s
7d avg 215k
VOLUME · 24H
1.6 TB
+8.2% vs 7d
SCHEMA VALID
99.42%
drift 0.04
NOISE REMOVED
−38.4%
post-dedup + filter
AI-READY OUT
982 M
vectors / day
P99 LAT
42 ms
SLO 80
$ / GB
$0.014
unit cost
Source connectors · 11 active
S3 stream1.2 TB
HTTP API poll340 M
Webhooks push4.1 M
PDF / DOCX batch82 M
Postgres cdc19 M
Kafka stream208 M
RSS / feeds poll880 K
Email IMAP batch14 K
Snowflake batch
Pipeline DAG · 8 stages · live throughput
INGEST PARSE DECODE PARSE PARSE CLEAN VALIDATE DEDUP FILTER ORGANIZE CHUNK EMBED OUT
Live event log ·
DEDUPs3-bucket-a142,803 near-duplicates merged via MinHash98.7%0s
PARSEPDF/86extracted 4,210 structured rows from 86 reports100%3s
CHUNKcorpus-7tokenized + chunked 12.4k docs · 512t window512t8s
EMBEDix-embed-l982M vectors written → vector store1024d12s
DRIFTschema/v17RSS feed-3 dropped column "publishedTs" → backfill0.4σ31s
EVALretrievalrecall@10 = 0.94 on holdout · drift 0.020.941m
Embedding space · UMAP proj · 50k sample · 1024 → 2D
3 clusters · 982M pts silhouette 0.71 knn k=8
Models & parsers
IX-EMBED-Lv3.2 · 1024d · R@10 0.94
IX-CHUNKv1.8 · semantic · 512t
IX-DEDUPv2.1 · MinHash · 256 perm
IX-PARSE-PDFv4.0 · table-aware
IX-CLASSv2.6 · 14 labels · F1 0.88
IX-NERv1.4 · 22 entities
IX-CLEANv3.1 · regex + LM rules
AI-ready outputs · 6 sinks
Vector store qdrant982 Mlive
Parquet s33.2 TBlive
JSONL s31.1 TBlive
HF dataset hub122 klive
REST API https42 mslive
Snowflake batchpaused
Inferred schema · v17 · corpus-7
// auto-generated · drift 0.04σ
id            string · uuid      // 100% present
source        enum · 11 levels   // 100% present
timestamp     datetime · ISO8601  // 100% present
content       text · < 32k tokens // 99.8% present
embedding     float32[1024]      // 100% present
labels        array<enum>         // 88% present · 14 levels
entities      array<span>         // 76% present · 22 types
quality       float · [0,1]       // 100% present · μ 0.91
lineage       array<ref>          // 100% present
Quality scorecard · last 24h
Schema validity
99.4
Completeness
96.1
Uniqueness (post-dedup)
99.0
Freshness P95
88.0
Embedding coverage
100
Retrieval recall@10
94.0
Drift (ks-stat avg)
0.14
VECTORS982,114,206 DEDUP RECALL98.7% CORPUScorpus-7 CHUNK MEAN487t UNIQUE TOKENS2.41 B RETRIEVAL R@100.94 SCHEMA DRIFT0.04σ CACHE HIT87.4% $ / GB$0.014 REGIONUS-EAST-1 SLA99.97% VECTORS982,114,206 DEDUP RECALL98.7% CORPUScorpus-7 CHUNK MEAN487t UNIQUE TOKENS2.41 B RETRIEVAL R@100.94 SCHEMA DRIFT0.04σ CACHE HIT87.4% $ / GB$0.014 REGIONUS-EAST-1 SLA99.97%

Who it's for

Built for teams burning time on data prep.

Diagest replaces the bespoke ETL + cleaning + chunking + embedding work that sits between raw inputs and a model that's actually useful.

AI Labs · Frontier Research

Train on cleaner data

Deduplicated, schema-aligned, drift-monitored corpora. We absorb the messy work so your team focuses on architecture, not pipelines.

Enterprise Data Teams

One retrieval-ready surface

Bring private documents, databases, and event streams into a single retrieval surface. Vector store, Parquet, JSONL, or REST — your call.

Research Organizations

Convert archives to corpora

PDF/DOCX archives, RSS feeds, historical APIs become AI-queryable with provenance, quality scoring, and retrieval-recall measurement built in.

Custom AI Products

Stop burning weeks on prep

If you're building an AI product and burning weeks on data prep, that's exactly the work Diagest replaces.

How it works

From any source to any AI-ready sink.

01 · Connect

Any source

S3, HTTP APIs, webhooks, PDF/DOCX, Postgres CDC, Kafka, RSS, email. New connectors on request.

02 · Process

8-stage pipeline

Ingest → parse → clean → validate → dedup → filter → chunk → embed. Quality + drift tracked end-to-end.

03 · Deliver

AI-ready outputs

Vector store, Parquet, JSONL, HF dataset, REST. One schema, full provenance, query-ready.

Get on the platform.

Tell us about your sources, your model, and what AI-ready output you need. We'll come back with a scoped pilot.

FAQ

Frequently asked.

What is Diagest?

A data-for-AI pipeline. It consumes large data volumes from any source, cleans and deduplicates them, filters out noise, and organizes what remains into AI-ready outputs — vectors, Parquet, JSONL, or REST API.

What sources does Diagest ingest?

Object storage (S3 and equivalents), HTTP APIs, webhooks, PDF and DOCX archives, SQL databases via CDC, Kafka streams, RSS feeds, and email mailboxes. New connectors are added on request.

What output formats does Diagest produce?

Vector store (Qdrant-compatible), Parquet on S3, JSONL on S3, Hugging Face datasets, REST API, and on-demand sinks. All outputs share a single schema with provenance and quality scoring.

How does Diagest handle data quality and drift?

Schema validity, completeness, uniqueness, freshness, embedding coverage, retrieval recall, and KS-stat drift are tracked per corpus per 24h. Drift alerts trigger schema-version bumps and backfills.

What is the latency profile?

Streaming sources reach AI-ready output in tens of milliseconds. Batch sources are throughput-limited but typically complete inside the same hour for typical corpus sizes.

How is pricing structured?

Per-GB processed plus a flat platform fee. Contact us for quotes calibrated to your sources, output sinks, and SLA needs.

From the blog

Notes on what we're building.