Knowledge Patterns

RAG Ingestion

Build retrieval indexes deliberately before agent runs need them.

RAG quality starts before the agent runs. Ingestion is the application-owned workflow that loads source material, normalizes it, chooses ids and metadata, embeds it, writes it to a vector store, and decides when indexes are refreshed.

Do not rebuild a retrieval index inside a hot prompt path. Build it during deploy, startup, admin ingestion, or background work.

Scenario

A support agent answers product questions from Markdown docs, PDFs, release notes, and internal policy pages. The application needs a repeatable ingestion job so every agent run sees fresh, filtered, traceable context.

When to Use It

Use this pattern when:

  • knowledge is larger than static .context(...)
  • documents change outside the request path
  • documents need metadata filters such as tenant, product, visibility, version, or locale
  • retrieval evidence should be debugged in traces and evals
  • multiple agents share the same knowledge source

Architecture Shape

LayerResponsibility
loaderread files, directories, PDFs, bytes, or application records
normalizerclean text, split sections, remove boilerplate, attach source ids
metadatatenant, product, visibility, version, locale, source path, updated time
embedding jobcall embedDocuments(...) with bounded concurrency
vector storestore embedded documents and expose an index
refresh policyrebuild or incrementally update at an explicit boundary

Code Example

import { InMemoryVectorStore, embedDocuments } from "@anvia/core";
import { FileLoader, fileLoaderToDocuments } from "@anvia/core/loaders";
import { embeddings } from "./models";

const loaded = await fileLoaderToDocuments(
  FileLoader.withGlob("content/support/**/*.md")
    .readWithPath()
    .ignoreErrors(),
);

const documents = loaded.map((doc) => ({
  id: doc.id,
  title: doc.metadata.path,
  body: doc.text,
  product: "support",
  visibility: "public",
}));

const embedded = await embedDocuments(embeddings, documents, {
  id: (doc) => doc.id,
  content: (doc) => `${doc.title}\n${doc.body}`,
  metadata: (doc) => ({
    product: doc.product,
    visibility: doc.visibility,
    title: doc.title,
  }),
  concurrency: 2,
});

export const supportDocsIndex = InMemoryVectorStore.fromDocuments(embedded).index(embeddings);

For production storage, build the same embedded document shape and write it to the vector store your application owns.

Document Ids and Metadata

Stable ids make refreshes and debugging easier.

const document = {
  id: `support:${locale}:${slug}:${sectionId}`,
  title,
  body,
  locale,
  product,
  visibility,
  sourceUrl,
  updatedAt,
};

Metadata should support the filters and trace evidence you need later. Do not rely on prompt instructions to prevent a tenant or visibility leak.

Chunking Ownership

Anvia loaders convert source files into documents. Your application owns semantic chunking policy.

SourceChunk default
Markdown docsone document per page or heading section
PDFsone document per page, then merge or split if needed
tickets or recordsone document per record or summary
code filesone document per file or symbol-level section
long policiesone document per rule group with source id

Refresh Strategy

StrategyUse when
deploy-time rebuilddocs change with code deploys
startup buildsmall indexes and local development
background ingestiondocs change frequently or source data is remote
tenant-specific indexstrict tenant isolation is required
metadata-filtered shared indexsource store supports reliable filtering

Failure Modes

FailureFix
stale answersdefine refresh triggers and store updatedAt metadata
weak retrievalimprove chunking, titles, content selector, and source text
tenant leakageseparate indexes or enforce metadata filters before search
slow ingestionbound concurrency and move ingestion off request paths
hard-to-debug contextinclude stable ids, source paths, titles, and versions

Test Checklist

  • Test loaders against representative files and PDFs.
  • Snapshot normalized documents before embedding.
  • Verify ids are stable across repeated ingestion.
  • Verify metadata includes tenant, visibility, product, and source fields where needed.
  • Run search probes and assert expected document ids appear.
  • Add eval cases for known questions that depend on retrieved facts.