Quorum, task-aware agent orchestrator
Honest finding
K=3 adversarial verification cut false positives from 27.8% to 0.0% (95% CI [11.1, 50.0] to [0, 0]; recall 100% to 77.8%) on a 36-snippet labeled set including prompt-injection traps.
FP rate, before27.8%
FP rate, after K=30.0%
Recall100% to 77.8%
Held-out real target3/3 bugs, 0 surviving FPs
Cost per run~$0.25
Test suite58 tests, CI green
Cost-routing claim (DeepSeek to Haiku to Sonnet to Opus) is operator-gated on an Anthropic key. The harness is committed and wired; the live multi-tier number requires a key to produce.
Aegis, adaptive red-team gauntlet
Honest finding (the sophisticated one)
A reasoning model is significantly more robust: injection ASR 49.3% vs 68.1%, p=0.0012. But the full defense stack erases that gap entirely (1.7% vs 2.8%, p=0.40, not significant). The defense is the story, not the model.
Injection ASR, reasoning model49.3%
Injection ASR, standard68.1% (p=0.0012)
ASR with full defense stack1.7% vs 2.8% (p=0.40)
Adaptation lift (scaled)24.0% to 29.9%
Defense reduction29.2% to 4.2% (-25%)
Test suite78 tests, CI green
Adaptation lift was a null at small n. Became significant only after scaling the benchmark. McNemar b=17/c=0. Framed as a methodology lesson: scaling is the legit power lever, not p-hacking.
FieldAgent, CUAD contract red-flag finder
Honest finding (the null)
The "agentic chunking lift" is model-specific noise, not a real advantage. +0.45 on DeepSeek collapsed to +0.07 on a fair rerun (CIs overlap; ties on Sonnet). The truncation artifact is documented and in the case study.
Detection F10.548 (P=0.741, R=0.435)
95% CI[0.460, 0.637]
vs. keyword floor+0.21 F1 (robust)
Agentic lift (claimed)+0.45 (truncation artifact)
Agentic lift (fair rerun)+0.07 (CIs overlap)
Test suite47 tests, CI green
A 4-proxy council (taste, pragmatism, intent, anti-drift) votes on every self-improvement before it ships. Pipeline: adversary agent finds regressions, editor agents verify independently, merger deduplicates, council votes, escalate on disagreement. 576 tests gate every patch.
Internal infra. No public URL. Below: a real pipeline run, replayed as a lab-notebook log.
adversary→editors (x3)→merger→council (4 proxies)→escalate on split→patch + 576 tests