Loaders
File and PDF ingestion helpers for retrieval preprocessing.
Import from @anvia/core/loaders.
Loaders are async iterables for ingestion pipelines. They read source material and yield LoaderResult<T> values by default. Call .ignoreErrors() when a batch should skip unreadable files instead of returning failed results.
Loaders are intentionally not exported from the root @anvia/core entry point because they depend on Node filesystem and PDF extraction packages.
LoaderResult
type LoaderResult<T> =
| { ok: true; value: T }
| { ok: false; error: unknown };
type FileSource = { path: string } | { path: "<memory>"; bytes: Uint8Array };
type FileReadWithPath = { path: string; text: string };
type PdfSource = { path: string } | { path: "<memory>"; bytes: Uint8Array };
type PdfReadWithPath = { path: string; text: string };
type PdfPage = { pageNumber: number; text: string };
type PdfPageWithPath = { path: string; pageNumber: number; text: string };Purpose: keeps batch ingestion from throwing on the first failed item and names the file/PDF records yielded by loader modes.
Return behavior: .ignoreErrors() unwraps successful values and filters failures.
FileLoader
FileLoader.withGlob(pattern);
FileLoader.withDir(directory);
FileLoader.fromBytes(bytes);
FileLoader.fromBytesMany(bytesArray);
loader.read();
loader.readWithPath();
loader.ignoreErrors();Purpose: read UTF-8 text files from globs, directories, or memory.
Return behavior:
| Method | Output |
|---|---|
.read() | string text |
.readWithPath() | { path, text } |
fromBytes(...) | uses "<memory>" as the path |
withDir(...) reads direct files only; subdirectories are ignored.
PdfFileLoader
PdfFileLoader.withGlob(pattern);
PdfFileLoader.withDir(directory);
PdfFileLoader.fromBytes(bytes);
PdfFileLoader.fromBytesMany(bytesArray);
loader.read();
loader.readWithPath();
loader.byPage();
loader.ignoreErrors();Purpose: extract text from PDF files or PDF bytes.
Return behavior:
| Method | Output |
|---|---|
.read() | full PDF text |
.readWithPath() | { path, text } |
.byPage() | { pageNumber, text } with zero-based page numbers |
.readWithPath().byPage() | { path, pageNumber, text } |
PDF page splitting is the only built-in loader chunking behavior.
Document Adapters
fileToDocument(file);
fileLoaderToDocuments(loader);
pdfToDocument(pdf);
pdfLoaderToDocuments(loader);
pdfPageToDocument(page);
pdfPageLoaderToDocuments(loader);Purpose: convert successful loader outputs into Anvia Document[] for retrieval preprocessing.
Text documents include source and mediaType: "text/plain" metadata. PDF documents include source and mediaType: "application/pdf" metadata. PDF page documents also include a string pageNumber metadata value.
For workflow guidance, see Embed Documents.
