Product // WatchLLM

Your agent passed every test. WatchLLM shows you what it does when things go wrong.

WatchLLM is an agent reliability platform. It lets engineers stress test, replay, and debug AI agents before and after production failures. Not a logger. Not an observability dashboard. WatchLLM breaks agents on purpose, then gives you the tools to understand and fix what broke.

Core Features01

Stress Testing

Run a battery of attack scenarios against any agent before it ships. Attack categories include prompt injection, tool abuse, hallucination induction, context poisoning, infinite loop triggering, jailbreak attempts, data exfiltration probing, and role confusion.

Graph Replay

Every agent run is recorded as a directed graph of execution nodes. Each node captures type, input, output, timestamp, latency, token count, and cost. Scrub through chronologically to find the exact moment of failure.

Fork & Replay

From any node in any recorded run, fork a new run that starts from that exact state. Change input, prompt, or tool response — rerun from that node forward without re-executing everything before it.

Architecture02

Zero Cost Until Revenue

Stack

Frontend: Next.js + CF Pages

API Layer: Hono.js on CF Workers

Execution: CF Workers (separate)

Database: Cloudflare D1

Trace Storage: Cloudflare R2

Cache/State: Cloudflare KV

Auth: Clerk Pro

LLM Judge: Cloudflare AI

Data Flow

bash

Engineer installs SDK → decorates agent with @watchllm.test() SDK calls POST /api/agents/register → Worker creates agent row in D1 Engineer runs watchllm simulate → POST /api/simulations API Worker creates simulation row → enqueues to CF Queue (sim job) Orchestrator Worker picks up sim job → fans out N chaos jobs (one per category) Chaos Worker picks up chaos job → runs single attack loop: 1. Generate adversarial input (CF AI or template) 2. Call engineer's agent endpoint 3. Rule-based filter on response 4. LLM judge evaluates (CF AI) 5. Write trace node to R2 (gzip JSON) 6. Write run metadata to D1 Orchestrator aggregates results when all chaos jobs complete Dashboard polls simulation status → fetches report from R2 Dashboard streams graph replay from R2 trace

Positioning03

What It Is Not

Not a general observability platform (no metrics dashboards, no uptime monitoring)

Not a prompt management tool

Not a model evaluation/evals platform (though it uses LLM-as-judge internally)

Not a LangChain/CrewAI wrapper

Target User04

AI engineers shipping agents to production

They use LangChain, CrewAI, AutoGen, or raw OpenAI SDK. They have a working agent in dev that breaks in prod in ways they can't reproduce. They are willing to pay to not be woken up at 3am by a rogue agent deleting a database.

Category: Agent reliability platform
Tagline: Agent reliability, from first run to production.

Competitive Moat

No single tool has all four: stress testing + graph replay + fork & replay + run versioning. Balagan Agent has chaos injection only. agent-replay has CLI replay only. LangSmith/Langfuse have post-mortem logs only. WatchLLM is the only unified platform.