Agents fail on bad context. CtxGov checks AI-facing repo and workflow context for stale claims, conflicting instructions, unsupported releases, unsafe action guidance, and hidden terminal failures before execution.
v0.6.3 is released and safe to publish.
External deploy, proactive outreach, public benchmark, and package publication remain pending.
Run the deploy script after tests pass.
FAILED tests/test_release.py::test_release_url_not_404
v0.6.3 is released and safe to publish.
test_release_url_not_404
External deploy ... remain pending.
Run the deploy script after tests pass.
AI agents execute against context assembled from many surfaces: repository docs, rules files, release notes, memory summaries, saved traces, and terminal logs. When those surfaces disagree, the model can produce polished work from broken premises.
Old claims outlive the evidence that once supported them.
Two instructions grant incompatible authority.
Copy references a release, benchmark, or capability without an artifact.
Action guidance skips approval, rollback, or side-effect boundaries.
The demo story is a before/after report: left side shows a small sample repo with stale, conflicting, unsupported, unsafe, hidden-failure, and Memory X-Ray L1 gaps; right side shows a context-health report with finding type, evidence span, and claim boundaries.
View the companion demo GIF or inspect the reproducible demo report.
| Finding Type | Evidence Span | Action |
|---|---|---|
stale_claim | Current-facing claim contradicted by fresher source. | Downgrade, caveat, or refresh the claim. |
conflicting_policy | Two context files authorize incompatible behavior. | Pick the authoritative source or block execution. |
unsupported_release_claim | Release, tag, package, benchmark, or demo claim without artifact. | Create the artifact or remove the claim. |
unsafe_action_guidance | Instruction asks an agent to run/write/deploy without approval. | Require side-effect approval and rollback. |
hidden_terminal_failure | Log shows failure while handoff says pass or ready. | Preserve the failure and rerun verification. |
The companion evaluation artifact lives in ctxgov/agent-context-evals. It currently contains 50 v0.2 trace-pattern labeled cases, 20 v0.4 hard negatives, 160 v0.5 deterministic mutation cases with 206 labels, 60 v0.6 adversarial hard negatives, 96 v0.7 trace-shaped cases, 12 hidden-holdout public case texts, 12 v0.3 review-intake cases, evidence spans, a regex baseline, an offline LLM-judge harness, CtxGov heuristic and native doctor adapter modes, offline GitHub/CI/rules/registry/transcript/memory adapters, single-label and multi-label scoring, automated error analysis, per-finding metrics, evidence-span diagnostics, a technical report draft, a review packet, and a demo fixture.
v0.5 mutation cases
v0.7 trace-shaped cases
v0.6 adversarial clean controls
public benchmark claims
CtxGov is not a security scanner, sandbox, agent harness, provider SDK, memory backend, automatic remediation agent, or universal benchmark. Current evaluation materials are public v0.2 scaffold data, v0.4 synthetic hard negatives, v0.5 deterministic mutation data, v0.6 adversarial hard negatives, v0.7 trace-shaped local data, and a v0.3 review-ready packet until independently reviewed trace-derived labels and administered holdout results exist.