Changelog¶
All notable changes to AgentLens will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
[0.2.0] - 2026-06-18¶
Two major additions in one release: Codex as a second execution engine (behind a new pluggable engine abstraction), and auto-judge (an optional LLM that evaluates the running trajectory against a rubric and can early-exit the loop). Every run is now explicitly labeled with its engine across the CLI, metadata, trajectories, and web UI. Also adds OpenRouter routing for Codex, experiment lifecycle hooks, and Codex goal/token-budget support.
Breaking change (internal API):
ATIFAdapternow consumes normalizedEngineEvents instead of Claude SDK messages — see Changed. The YAML config, CLI, andrun_meta.jsonsurfaces are backward compatible; existingclaude_codeconfigs run unchanged (the newenginefield defaults toclaude_code).
Added — Codex engine¶
- Engine abstraction (
src/harness/engines/) — a normalizedEngineEventmodel (base.py) plus anEngineinterface with two implementations:claude_code.py(Claude Agent SDK) andcodex.py(Codex CLI). Engines translate their native message/event streams into a common event model that the ATIF adapter consumes, so shadow git, diffs, change tracking, and ATIF mapping are identical across engines. Add a new engine by implementingEngineand registering it inengines/__init__.py. - Codex engine — runs
codex exec --json(codex-cli >= 0.135) and maps its event stream (thread.started,turn.*,item.*) to ATIF. Codex tool items (command_execution,file_change,mcp_tool_call,web_search) carry their call and result together, mapped to a single step with an inline observation. Unknown item types are preserved as tool calls (forward-compatible). engineconfig field —claude_code(default) orcodex. Defaulting preserves existing behavior for all current configs.sandbox_modeconfig field — Codex sandbox policy:read-only,workspace-write(default), ordanger-full-access.- Engine labeling everywhere —
enginerecorded inrun_meta.json(and replay metadata), ATIF trajectoryextra.engine, ATIFAgent.name, the auto-generated run-dir slug (<ts>_<engine>_<model>),harness list/harness inspectoutput, and a colored engine badge in the web UI run list and run header. - Codex API capture — the capture proxy now parses the OpenAI Responses API SSE stream in addition to the Anthropic Messages API, detected per-request by path (
/responsesvs/messages) and normalized onto one capture schema. For Codex, the runner targets the resolved upstream (OpenAI or OpenRouter) and routes Codex through the proxy via a custom model provider (-c model_providers.proxy...). - Codex via OpenRouter —
provider: openrouterroutes the Codex engine through OpenRouter, so any OpenRouter model slug can drive Codex (engine: codex,provider: openrouter,model: "openai/gpt-5.3-codex"). AgentLens injects the requiredmodel_providersblock automatically (base_url=https://openrouter.ai/api/v1,wire_api=responses,env_key=OPENROUTER_API_KEY) and validates that the model is a full vendor-prefixed slug. Capture works too: the proxy forwardsOPENROUTER_API_KEYupstream.providerfor Codex is now constrained toopenai(default) oropenrouter. sandbox_workspace_network_accessconfig field — Codex only: overridessandbox_workspace_write.network_accessforworkspace-writeruns (leave unset for the Codex default).codex_goal_token_budget/codex_goal_objectiveconfig fields (andharness run --codex-goal-token-budget) — Codex only: prepend an instruction asking Codex to callcreate_goalwith the given token budget/objective before substantive work, andupdate_goalon completion. The budget lives on thecreate_goaltool, not a Codex exec flag.- Codex resample —
harness resample/resample-editare now format-aware: they detect OpenAI vs Anthropic requests and use the correct endpoint,Authorization: Bearerauth, request shape, and response summarization. - Codex turn-level replay —
harness replayworks for Codex via the rollout transcript:transcript_codex.pyparses~/.codex/sessionsrollout JSONL into turns,--list-turnslists them, and replaying a turn resets the filesystem via the existing shadow-git worktree machinery and resumes Codex from a truncated rollout (-c experimental_resume=...) so it regenerates the branch. Turn 1 re-runs from the original prompt. - Codex subagent capture — Codex's native multi-agent workflow is captured as linked subagent trajectories, matching the Claude Code output shape. Set
codex_multi_agent: trueto enablefeatures.multi_agent; when Codex spawns agents (surfaced ascollab_tool_callspawn_agent/waititems in the headless stream), AgentLens locates each spawned thread's rollout file by thread id, rebuilds it into an ATIF trajectory (rollout_entries_to_events), and attaches aSubagentTrajectoryRefto the parent'sspawn_agentstep. A single spawn fanning out to multiple threads links multiple refs. codex_multi_agentconfig field (defaultfalse) — enable Codex subagent spawning.- Example/test configs —
examples/codex.yaml,tests/smoke_codex.yaml. - Tests —
test_codex_engine.py,test_codex_capture.py,test_transcript_codex.py, and engine config tests;test_atif_adapter.pyrewritten against theEngineEventmodel.
Added — Auto-judge¶
- Auto-judge (
judge:config block,src/harness/judge.py) — a judge LLM evaluates the live trajectory against arubriceveryevery_n_turnsagent turns and returns a structured verdict ({flagged, reason, confidence}). The judge sees the trajectory so far — messages, tool calls, observations, and (unlessinclude_reasoning: false) the agent's reasoning. Runs independently of the agent engine, so it judges both Claude Code and Codex runs. - Early exit — when a verdict is flagged and
early_exit: true, the session stops gracefully after the current turn: the engine stream is closed and the underlying agent process (Codex subprocess) or SDK query is terminated. Trajectory and diff are still saved; the session is marked flagged + early-exited. - Configurable judge backend —
provider: anthropic(Messages API),openai, oropenrouter(Chat Completions). For any other compatible endpoint, setbase_url+api_key_env. Judge sampling is configurable (max_tokens,temperature). - Outputs — verdicts written to
session_NN/judge.jsonl;run_meta.jsongains per-sessionjudge_flagged/judge_early_exit/judge_verdict_countand run totalsjudge_flagged_sessions/judge_early_exits;harness inspectshows a⚑ flaggedmarker. Example config:examples/judge.yaml. - Tests —
test_judge.py(verdict parsing, trajectory rendering, backend resolution, missing-key handling).
Added — Experiment lifecycle hooks¶
pre_run_commands/post_run_commandsconfig fields (HookCommandConfig) — lists of shell commands run before / after the agent sessions, engine-independent. Each command receivesHARNESS_RUN_DIRandHARNESS_WORK_DIRin its environment; useful for local services, fixture setup, and grading scripts. Per-commandcwd,timeout_seconds(default 30), andcheck(default true) are configurable.post_run_commandsrun in afinallyblock, so they fire even if a session errors.
Fixed¶
- Codex engine stderr deadlock — the engine read the Codex subprocess's stderr only after its stdout stream closed. When Codex emits a large stderr payload (e.g. a ~500KB "failed to refresh available models" error, which OpenRouter triggers on every run because Codex can't decode its
/modelsresponse), that output fills the OS pipe buffer, Codex blocks writing stderr, stops producing stdout, and the run hangs indefinitely. stderr is now drained concurrently via a background task, so neither stream can stall the other. This was latent for the OpenAI path (its model refresh succeeds) but blocked the Codex-via-OpenRouter path entirely.
Changed¶
- BREAKING (internal API):
ATIFAdapternow consumes normalizedEngineEvents, not Claude SDK messages.process_message(sdk_msg)is replaced byprocess_event(event)(aprocess_messagealias remains but now expects anEngineEvent). Code that drove the adapter with rawclaude_agent_sdkmessage objects must now go through an engine (or constructEngineEvents). The adapter no longer importsclaude_agent_sdk. runner.pyis engine-agnostic — it selects an engine viaget_engine()and no longer imports the Claude SDK directly. SDK-specific message translation and transcript-copying moved intoengines/claude_code.py.run_session()gained additive optional params (resume_rollout_path).build_provider_env()is engine-aware — returns an empty env for Codex (which authenticates via~/.codex/auth.jsonorOPENAI_API_KEY); Claude Code provider routing is unchanged.providerdefaults toopenaifor the Codex engine when not explicitly set, and is now constrained toopenaioropenrouterfor Codex (it selects the Codex model provider; forclaude_codeit remains the Anthropic-routing concept).- Write-detection extended to Codex tool names (
command_execution,file_change) so per-step file-write attribution works for both engines. - Runner captures a provisional
session_idfrom the engine's init /thread.startedevent so transcript and diff artifacts are still saved when a session early-exits (judge) before its terminal result event. - The Claude Agent SDK's harmless anyio "cancel scope" background-task error (emitted when its query generator is closed early) is suppressed via a scoped asyncio exception handler; all other loop errors pass through.
Notes / limitations¶
- The
agents:config block is Claude-Code-only — it defines ClaudeAgentDefinitions. Configs combiningengine: codexwithagents:are rejected at validation. Codex subagents are configured natively (TOML files in~/.codex/agents/) and enabled viacodex_multi_agent, not viaagents:. - Codex API capture/resample require an API key with active billing —
OPENAI_API_KEYforprovider: openai, orOPENROUTER_API_KEYforprovider: openrouter. Capture routes through a custom model provider using API-key auth because the built-in providers' base URLs cannot be overridden. Trajectories, diffs, change tracking, and turn-level replay work withcodex login(subscription) auth alone for the OpenAI path; setcapture_api_requests: falsein that case. - Cost is not reported by the Codex CLI, so
total_cost_usdisnullfor Codex runs (token usage is still captured). - The auto-judge requires an API key for its backend (no Claude subscription auth) and does not route through the capture proxy, so judge calls are not part of
api_captures.jsonl. Judge cadence counts agent steps: it fires when the agent-step count is a multiple ofevery_n_turns.
[0.1.1] - 2026-03-18¶
Fixed¶
- Replay filesystem reset for chained sessions — when replaying session N > 1, the filesystem was incorrectly reset to
baseline(pre-experiment state) instead of the end state of session N-1. This caused the agent to see stale files (e.g. an empty MEMORY.md instead of one populated by prior sessions). The fix falls back tosession_{N-1}when no file-write tags exist within the current session before the replay turn.
Changed¶
- Removed experiment configs from version control (already in
.gitignore, now untracked)
[0.1.0] - 2026-03-17¶
Initial release.
Added¶
- Experiment runner — YAML-based config for multi-session Claude Code experiments via the Claude Agent SDK
- ATIF trajectory capture — every agent step, tool call, observation, and thinking block captured in ATIF v1.6 format
- Shadow git change tracking — invisible bare git repo tracks all file changes with per-step write attribution and unified diffs
- Session modes —
isolated(fresh conversation, files persist),chained(conversation resumes),forked(independent branches from a base session) - Flexible forking —
fork_fromon individual sessions to fork from any prior session, not just session 1 - Session replicates —
count: Nruns the same session N times as independent replicates with_rNNdirectory suffixes - Subagent capture — separate ATIF trajectories for each subagent invocation, linked to parent via
SubagentTrajectoryRef - API request capture — local reverse proxy captures raw request/response bodies, system prompts, tool definitions, token usage, and compaction events
- Turn-level resampling — replay a specific API request N times to study response variance (stateless, no tool execution)
- Intervention testing — edit captured API requests (assistant text, tool results, system prompt) and resample with modified inputs; available from both CLI (
harness resample-edit) and web UI - Session-level resampling — re-run a forked session N times with full tool execution (
harness resample-session) - Turn-level replay — branch execution from any API turn with exact-match context, filesystem reset via git worktrees, and full tool execution; replicates run in parallel (
harness replay) - Transcript capture — Claude Code transcript JSONL copied into session output for replay support
- UUID map — per-turn correlation across transcript, ATIF trajectory, and raw API dumps using
tool_call_idas join key - Web UI — SvelteKit interface for browsing runs, viewing trajectories, memory diffs, API captures, resamples, edit & resample, and file changelogs
- CLI —
harness run,list,inspect,resample,resample-edit,resample-session,replay - Provider support — OpenRouter (default), Anthropic, AWS Bedrock, GCP Vertex AI
- Memory file — auto-seeded file in working directory for cross-session note persistence