RAG Ingestion
Build retrieval indexes deliberately before agent runs need them.
RAG quality starts before the agent runs. Ingestion is the application-owned workflow that loads source material, normalizes it, chooses ids and metadata, embeds it, writes it to a vector store, and decides when indexes are refreshed.
Do not rebuild a retrieval index inside a hot prompt path. Build it during deploy, startup, admin ingestion, or background work.
Scenario
A support agent answers product questions from Markdown docs, PDFs, release notes, and internal policy pages. The application needs a repeatable ingestion job so every agent run sees fresh, filtered, traceable context.
When to Use It
Use this pattern when:
- knowledge is larger than static
.context(...) - documents change outside the request path
- documents need metadata filters such as tenant, product, visibility, version, or locale
- retrieval evidence should be debugged in traces and evals
- multiple agents share the same knowledge source
Architecture Shape
| Layer | Responsibility |
|---|---|
| loader | read files, directories, PDFs, bytes, or application records |
| normalizer | clean text, split sections, remove boilerplate, attach source ids |
| metadata | tenant, product, visibility, version, locale, source path, updated time |
| embedding job | call embedDocuments(...) with bounded concurrency |
| vector store | store embedded documents and expose an index |
| refresh policy | rebuild or incrementally update at an explicit boundary |
Code Example
import { InMemoryVectorStore, embedDocuments } from "@anvia/core";
import { FileLoader, fileLoaderToDocuments } from "@anvia/core/loaders";
import { embeddings } from "./models";
const loaded = await fileLoaderToDocuments(
FileLoader.withGlob("content/support/**/*.md")
.readWithPath()
.ignoreErrors(),
);
const documents = loaded.map((doc) => ({
id: doc.id,
title: doc.metadata.path,
body: doc.text,
product: "support",
visibility: "public",
}));
const embedded = await embedDocuments(embeddings, documents, {
id: (doc) => doc.id,
content: (doc) => `${doc.title}\n${doc.body}`,
metadata: (doc) => ({
product: doc.product,
visibility: doc.visibility,
title: doc.title,
}),
concurrency: 2,
});
export const supportDocsIndex = InMemoryVectorStore.fromDocuments(embedded).index(embeddings);For production storage, build the same embedded document shape and write it to the vector store your application owns.
Document Ids and Metadata
Stable ids make refreshes and debugging easier.
const document = {
id: `support:${locale}:${slug}:${sectionId}`,
title,
body,
locale,
product,
visibility,
sourceUrl,
updatedAt,
};Metadata should support the filters and trace evidence you need later. Do not rely on prompt instructions to prevent a tenant or visibility leak.
Chunking Ownership
Anvia loaders convert source files into documents. Your application owns semantic chunking policy.
| Source | Chunk default |
|---|---|
| Markdown docs | one document per page or heading section |
| PDFs | one document per page, then merge or split if needed |
| tickets or records | one document per record or summary |
| code files | one document per file or symbol-level section |
| long policies | one document per rule group with source id |
Refresh Strategy
| Strategy | Use when |
|---|---|
| deploy-time rebuild | docs change with code deploys |
| startup build | small indexes and local development |
| background ingestion | docs change frequently or source data is remote |
| tenant-specific index | strict tenant isolation is required |
| metadata-filtered shared index | source store supports reliable filtering |
Failure Modes
| Failure | Fix |
|---|---|
| stale answers | define refresh triggers and store updatedAt metadata |
| weak retrieval | improve chunking, titles, content selector, and source text |
| tenant leakage | separate indexes or enforce metadata filters before search |
| slow ingestion | bound concurrency and move ingestion off request paths |
| hard-to-debug context | include stable ids, source paths, titles, and versions |
Test Checklist
- Test loaders against representative files and PDFs.
- Snapshot normalized documents before embedding.
- Verify ids are stable across repeated ingestion.
- Verify metadata includes tenant, visibility, product, and source fields where needed.
- Run search probes and assert expected document ids appear.
- Add eval cases for known questions that depend on retrieved facts.
