Testing and Observability
Test deterministic harness boundaries before inspecting model behavior.
Most correctness should be tested before a provider call is made. Tools, service wrappers, retrieval filters, pipeline steps, prompt wrappers, runners, and error mapping can all be checked with fakes.
Use provider-backed tests, Studio, traces, and evals after the application boundaries are covered.
Test Tools Directly
Tools have schemas and executable handlers. Call them directly or through a ToolSet to verify validation, permissions, expected states, and output shape.
const tools = createSupportToolSet({
userId: "user_123",
tenantId: "tenant_123",
orders: fakeOrders,
tickets: fakeTickets,
});
const result = await tools.call(
"lookup_order",
JSON.stringify({ orderId: "A-100" }),
);
expect(JSON.parse(result)).toEqual({
status: "found",
orderId: "A-100",
fulfillmentStatus: "shipped",
});This keeps product policy testable without asking a model to choose the right branch.
Test Runners With Fakes
Runner tests should verify application behavior: validation, auth, history loading, scoped tool creation, trace metadata, persistence, and known errors.
it("persists new messages after a successful support turn", async () => {
const conversations = fakeConversations({
history: [Message.user("Earlier question")],
});
const result = await runSupportTurn({
conversationId: "conv_123",
message: "Where is order A-100?",
auth: fakeAuth({ userId: "user_123", tenantId: "tenant_123" }),
conversations,
services: fakeServices(),
});
expect(result.ok).toBe(true);
expect(conversations.append).toHaveBeenCalledWith(
"conv_123",
expect.any(Array),
);
});If the runner currently creates the real agent internally, extract a factory dependency for tests or keep provider-backed tests narrow. Do not make every route test depend on a live model.
Test Retrieval Filters
Retrieval bugs are often permission bugs. Test the filter or index handle before testing answer quality.
const results = await tenantKnowledge.search("refund policy", {
tenantId: "tenant_123",
topK: 5,
});
expect(results.every((result) => result.metadata.tenantId === "tenant_123")).toBe(true);Then use Studio, traces, or evals to inspect whether the retrieved context produces the answer you expect.
Observe Runs
Attach observers when you need traces, usage records, external reporting, or debugging evidence.
const agent = new AgentBuilder("support", model)
.instructions("Answer support questions clearly.")
.observe(observer)
.build();Use stable trace names such as support-chat, ticket-summary, or retrieval-answer. Include application identifiers in trace metadata when they are safe to store.
const response = await agent
.prompt(message)
.withTrace({
name: "support-chat",
userId,
metadata: {
tenantId,
conversationId,
channel: "web",
},
})
.send();Trace metadata should help connect an agent run to application state without leaking secrets or large records.
Use Studio During Iteration
Use Studio to inspect messages, tool calls, approvals, sessions, traces, and quick prompts while the workflow is still changing.
import { Studio } from "@anvia/studio";
import { supportAgent } from "./ai/support-agent";
new Studio([supportAgent]).start({ port: 3000 });For request-scoped agents, create a safe development scope with fake or sandbox services and register that built agent in Studio. Do not wire Studio directly to production credentials just to test prompt behavior.
Use Evals for Repeatable Behavior
Use evals when the workflow has known prompts, regression cases, or output expectations that should be checked repeatedly.
Keep eval targets narrow:
| Eval target | Good use |
|---|---|
| tool output | permission and state contracts |
| runner output | product response shape and known error mapping |
| agent output | answer quality for stable prompts |
| extractor output | schema adherence and downstream contract |
| pipeline output | stage composition and typed workflow behavior |
Evals should complement unit tests, not replace them. Unit tests prove app-owned boundaries. Evals watch model-dependent behavior.
Harness Test Matrix
| Area | Test without provider | Inspect with provider |
|---|---|---|
| input validation | empty, malformed, unsupported payloads | not needed |
| auth and permissions | fake users, tenants, denied reads and writes | trace denied tool calls |
| tools | direct ToolSet.call(...) with fake services | observe model tool choice |
| context assembly | snapshot prompt messages or scoped facts | inspect retrieved documents in traces |
| history and persistence | fake conversation store calls | verify multi-turn behavior |
| guardrails | hook cancellation, approval decisions, idempotency | inspect approval and cancellation traces |
| final output | runner response shape | eval answer quality |
Practical Rule
Test product policy with fakes and direct calls. Use provider tests, Studio, traces, and evals to inspect model behavior after the application-owned boundaries are already covered.
Related Patterns
- Use Tool Validation and Contracts for schema and direct-call tests.
- Use MCP Tool Inspection for server capability checks.
- Use Eval Strategy for repeatable model-dependent checks.
- Use Tracing and Debugging to connect traces, logs, and eval outcomes.
- Use Production Readiness Checklist for deployment validation.
