Agent context health

Agent Context Health Evaluation for AI Workflows

Agents fail on bad context. CtxGov checks AI-facing repo and workflow context for stale claims, conflicting instructions, unsupported releases, unsafe action guidance, and hidden terminal failures before execution.

Before AI-facing context
README.md v0.6.3 is released and safe to publish.
release notes External deploy, proactive outreach, public benchmark, and package publication remain pending.
AGENTS.md Run the deploy script after tests pass.
terminal.log FAILED tests/test_release.py::test_release_url_not_404
After CtxGov report
stale_claim Release wording is stronger than the readiness evidence. v0.6.3 is released and safe to publish.
unsupported_release_claim Release URL must exist before public copy can point to it. test_release_url_not_404
conflicting_policy Deploy instruction conflicts with pending external approval. External deploy ... remain pending.
unsafe_action_guidance Side-effect action lacks explicit approval and rollback. Run the deploy script after tests pass.

Problem

AI agents execute against context assembled from many surfaces: repository docs, rules files, release notes, memory summaries, saved traces, and terminal logs. When those surfaces disagree, the model can produce polished work from broken premises.

Stale

Old claims outlive the evidence that once supported them.

Conflicting

Two instructions grant incompatible authority.

Unsupported

Copy references a release, benchmark, or capability without an artifact.

Unsafe

Action guidance skips approval, rollback, or side-effect boundaries.

60-Second Demo

The demo story is a before/after report: left side shows a small sample repo with stale, conflicting, unsupported, unsafe, hidden-failure, and Memory X-Ray L1 gaps; right side shows a context-health report with finding type, evidence span, and claim boundaries.

View the companion demo GIF or inspect the reproducible demo report.

Failure Taxonomy

Finding TypeEvidence SpanAction
stale_claimCurrent-facing claim contradicted by fresher source.Downgrade, caveat, or refresh the claim.
conflicting_policyTwo context files authorize incompatible behavior.Pick the authoritative source or block execution.
unsupported_release_claimRelease, tag, package, benchmark, or demo claim without artifact.Create the artifact or remove the claim.
unsafe_action_guidanceInstruction asks an agent to run/write/deploy without approval.Require side-effect approval and rollback.
hidden_terminal_failureLog shows failure while handoff says pass or ready.Preserve the failure and rerun verification.

Benchmark

The companion evaluation artifact lives in ctxgov/agent-context-evals. It currently contains 50 v0.2 trace-pattern labeled cases, 20 v0.4 hard negatives, 160 v0.5 deterministic mutation cases with 206 labels, 60 v0.6 adversarial hard negatives, 96 v0.7 trace-shaped cases, 12 hidden-holdout public case texts, 12 v0.3 review-intake cases, evidence spans, a regex baseline, an offline LLM-judge harness, CtxGov heuristic and native doctor adapter modes, offline GitHub/CI/rules/registry/transcript/memory adapters, single-label and multi-label scoring, automated error analysis, per-finding metrics, evidence-span diagnostics, a technical report draft, a review packet, and a demo fixture.

160

v0.5 mutation cases

96

v0.7 trace-shaped cases

60

v0.6 adversarial clean controls

0

public benchmark claims

Limitations

CtxGov is not a security scanner, sandbox, agent harness, provider SDK, memory backend, automatic remediation agent, or universal benchmark. Current evaluation materials are public v0.2 scaffold data, v0.4 synthetic hard negatives, v0.5 deterministic mutation data, v0.6 adversarial hard negatives, v0.7 trace-shaped local data, and a v0.3 review-ready packet until independently reviewed trace-derived labels and administered holdout results exist.