2026-06-07 · pg. 1 · Thomas Peng

Builds agentic systems. Reports the nulls.

Graphic designer turned AI-native engineer. Four artifacts, one shared kernel. Deterministic scoring, adversarial verification, cost-gated runs, and honest findings including the ones that did not pan out.

thomas@thomaspeng.ca View work GitHub

4 public artifacts. 1 shared kernel (Quorum core/). Target: Applied AI / Forward-Deployed / Agent Engineer at a frontier lab.

Thesis / kernel narrative

One substrate. Three problems. Honest scores.

Shared kernel

Quorum's core/ (cost-aware routing, K-adversarial verification, full tracing) ships in all four artifacts. That is a real substrate, not a portfolio theme.

Deterministic scoring

Every evaluation uses exact-match, span-IoU, or chi-square. No LLM judge in the success path. Results are reproducible offline via make eval-dry.

Honest nulls

FieldAgent's agentic chunking lift collapsed from +0.45 to +0.07 on a fair rerun. That is in the case study, not buried. At a frontier lab, honest nulls read as rigor.

Case studies

The work, entry by entry.

Artifact 1 / Flagshipgithub.com/7P3ng/quorum

Quorum, task-aware agent orchestrator

Honest finding

K=3 adversarial verification cut false positives from 27.8% to 0.0% (95% CI [11.1, 50.0] to [0, 0]; recall 100% to 77.8%) on a 36-snippet labeled set including prompt-injection traps.

FP rate, before27.8%

FP rate, after K=30.0%

Recall100% to 77.8%

Held-out real target3/3 bugs, 0 surviving FPs

Cost per run~$0.25

Test suite58 tests, CI green

Cost-routing claim (DeepSeek to Haiku to Sonnet to Opus) is operator-gated on an Anthropic key. The harness is committed and wired; the live multi-tier number requires a key to produce.

Live demo (tap to open)

Quorum trace UI, live (external link)

Quorum trace UI, liveOpen live

Artifact 2github.com/7P3ng/aegis

Aegis, adaptive red-team gauntlet

Honest finding (the sophisticated one)

A reasoning model is significantly more robust: injection ASR 49.3% vs 68.1%, p=0.0012. But the full defense stack erases that gap entirely (1.7% vs 2.8%, p=0.40, not significant). The defense is the story, not the model.

Injection ASR, reasoning model49.3%

Injection ASR, standard68.1% (p=0.0012)

ASR with full defense stack1.7% vs 2.8% (p=0.40)

Adaptation lift (scaled)24.0% to 29.9%

Defense reduction29.2% to 4.2% (-25%)

Test suite78 tests, CI green

Adaptation lift was a null at small n. Became significant only after scaling the benchmark. McNemar b=17/c=0. Framed as a methodology lesson: scaling is the legit power lever, not p-hacking.

Live demo (tap to open)

Aegis demo, live (external link)

Aegis demo, liveOpen live

Artifact 3github.com/7P3ng/fieldagent

FieldAgent, CUAD contract red-flag finder

Honest finding (the null)

The "agentic chunking lift" is model-specific noise, not a real advantage. +0.45 on DeepSeek collapsed to +0.07 on a fair rerun (CIs overlap; ties on Sonnet). The truncation artifact is documented and in the case study.

Detection F10.548 (P=0.741, R=0.435)

95% CI[0.460, 0.637]

vs. keyword floor+0.21 F1 (robust)

Agentic lift (claimed)+0.45 (truncation artifact)

Agentic lift (fair rerun)+0.07 (CIs overlap)

Test suite47 tests, CI green

Live demo (tap to open)

FieldAgent live demo (external link)

FieldAgent live demoOpen live

Artifact 4 / Methodology

Skill-Tuning Council: watch it run.

A 4-proxy council (taste, pragmatism, intent, anti-drift) votes on every self-improvement before it ships. Pipeline: adversary agent finds regressions, editor agents verify independently, merger deduplicates, council votes, escalate on disagreement. 576 tests gate every patch.

Internal infra. No public URL. Below: a real pipeline run, replayed as a lab-notebook log.

adversaryeditors (x3)mergercouncil (4 proxies)escalate on splitpatch + 576 tests

Eval discipline

How results are built and verified.

Deterministic scoring

Exact match, span-IoU, chi-square. No LLM judge in the success path. Findings are reproducible with make eval-dry.

Adversarial verification

K skeptic agents independently challenge each finding before it ships. Catches prompt-injection traps and false positives upstream.

Cost-gated runs

Model routing routes cheapest-capable first: DeepSeek to Haiku to Sonnet to Opus. Concurrency capped at 8. Full runs cost ~$0.25.

Honest nulls

Truncation artifacts, overlapping CIs, not-significant p-values: all documented, all in the case studies. A null is data.

Held-out test sets

20 held-out CUAD contracts (FieldAgent), 36-snippet labeled set (Quorum). Metrics reported on held-out data, not train.

Scaling as the lever

Aegis adaptation lift was a null at small n. Became significant after scaling. Framed as a methodology lesson, not buried.

Contact

Let's talk.

Open to Applied AI, Forward-Deployed Engineer, and Agent Engineer roles at frontier labs. Reach out by email or explore the work on GitHub.

thomas@thomaspeng.ca github.com/7P3ng

EOF

End of notebook. thomas@thomaspeng.ca, 2026.