Common Patterns

Testing and Observability

Test deterministic harness boundaries before inspecting model behavior.

Most correctness should be tested before a provider call is made. Tools, service wrappers, retrieval filters, pipeline steps, prompt wrappers, runners, and error mapping can all be checked with fakes.

Use provider-backed tests, Studio, traces, and evals after the application boundaries are covered.

Test Tools Directly

Tools have schemas and executable handlers. Call them directly or through a ToolSet to verify validation, permissions, expected states, and output shape.

const tools = createSupportToolSet({
  userId: "user_123",
  tenantId: "tenant_123",
  orders: fakeOrders,
  tickets: fakeTickets,
});

const result = await tools.call(
  "lookup_order",
  JSON.stringify({ orderId: "A-100" }),
);

expect(JSON.parse(result)).toEqual({
  status: "found",
  orderId: "A-100",
  fulfillmentStatus: "shipped",
});

This keeps product policy testable without asking a model to choose the right branch.

Test Runners With Fakes

Runner tests should verify application behavior: validation, auth, history loading, scoped tool creation, trace metadata, persistence, and known errors.

it("persists new messages after a successful support turn", async () => {
  const conversations = fakeConversations({
    history: [Message.user("Earlier question")],
  });

  const result = await runSupportTurn({
    conversationId: "conv_123",
    message: "Where is order A-100?",
    auth: fakeAuth({ userId: "user_123", tenantId: "tenant_123" }),
    conversations,
    services: fakeServices(),
  });

  expect(result.ok).toBe(true);
  expect(conversations.append).toHaveBeenCalledWith(
    "conv_123",
    expect.any(Array),
  );
});

If the runner currently creates the real agent internally, extract a factory dependency for tests or keep provider-backed tests narrow. Do not make every route test depend on a live model.

Test Retrieval Filters

Retrieval bugs are often permission bugs. Test the filter or index handle before testing answer quality.

const results = await tenantKnowledge.search("refund policy", {
  tenantId: "tenant_123",
  topK: 5,
});

expect(results.every((result) => result.metadata.tenantId === "tenant_123")).toBe(true);

Then use Studio, traces, or evals to inspect whether the retrieved context produces the answer you expect.

Observe Runs

Attach observers when you need traces, usage records, external reporting, or debugging evidence.

const agent = new AgentBuilder("support", model)
  .instructions("Answer support questions clearly.")
  .observe(observer)
  .build();

Use stable trace names such as support-chat, ticket-summary, or retrieval-answer. Include application identifiers in trace metadata when they are safe to store.

const response = await agent
  .prompt(message)
  .withTrace({
    name: "support-chat",
    userId,
    metadata: {
      tenantId,
      conversationId,
      channel: "web",
    },
  })
  .send();

Trace metadata should help connect an agent run to application state without leaking secrets or large records.

Use Studio During Iteration

Use Studio to inspect messages, tool calls, approvals, sessions, traces, and quick prompts while the workflow is still changing.

import { Studio } from "@anvia/studio";
import { supportAgent } from "./ai/support-agent";

new Studio([supportAgent]).start({ port: 3000 });

For request-scoped agents, create a safe development scope with fake or sandbox services and register that built agent in Studio. Do not wire Studio directly to production credentials just to test prompt behavior.

Use Evals for Repeatable Behavior

Use evals when the workflow has known prompts, regression cases, or output expectations that should be checked repeatedly.

Keep eval targets narrow:

Eval targetGood use
tool outputpermission and state contracts
runner outputproduct response shape and known error mapping
agent outputanswer quality for stable prompts
extractor outputschema adherence and downstream contract
pipeline outputstage composition and typed workflow behavior

Evals should complement unit tests, not replace them. Unit tests prove app-owned boundaries. Evals watch model-dependent behavior.

Harness Test Matrix

AreaTest without providerInspect with provider
input validationempty, malformed, unsupported payloadsnot needed
auth and permissionsfake users, tenants, denied reads and writestrace denied tool calls
toolsdirect ToolSet.call(...) with fake servicesobserve model tool choice
context assemblysnapshot prompt messages or scoped factsinspect retrieved documents in traces
history and persistencefake conversation store callsverify multi-turn behavior
guardrailshook cancellation, approval decisions, idempotencyinspect approval and cancellation traces
final outputrunner response shapeeval answer quality

Practical Rule

Test product policy with fakes and direct calls. Use provider tests, Studio, traces, and evals to inspect model behavior after the application-owned boundaries are already covered.