Posts

The State of Robotics (May 2026)

A field briefing on the physical-AI industry as of May 2026: ~2,500 companies across 24 use cases — where the money and talent sit, when it was built, where the AI runs — plus a browsable directory.

Does Caveman Mode Actually Work? I Benchmarked It (Properly This Time).

TL;DR: Caveman is a Claude plugin that adds a ~700-token system prompt telling the model to talk tersely. I benchmarked it with n=3 runs × 15 prompts × 6 conditions × 2 models via the direct Anthropic API (540+ calls). Then I QA’d my own analysis and found it was riddled with methodology errors. Here’s the actually-honest story. Task type Caveman vs terse (“Answer concisely.”) Statistical significance Short-answer, 4-7 (math, code Q&A) +2.5% to +10% (caveman costs MORE) Not significant Short-answer, 4-6 -3% to -16% Not significant Long-form, 4-7 (tutorials, docs) -5% to +10% (caveman barely helps or hurts) Not significant Long-form, 4-6 -55% to -59% Large effect; paired t-test p≈0.012-0.028 (n=5 prompts) Caveman’s effect is concentrated almost entirely in one cell: long-form generation on claude-opus-4-6. Everywhere else it’s either noise or a slight cost increase. ...

We Benchmarked MCP Against Code Generation. MCP Won (Mostly).

TL;DR: For a small, well-designed API (10 tools), structured MCP tool calls consistently outscore code generation on correctness — 0.99 vs 0.97 — and the gap concentrates in tasks where domain-specific logic matters. Adding a reference document to MCP tools costs 6–14% more tokens with zero accuracy gain. Cloudflare’s search+execute pattern matches MCP accuracy but uses more tokens. With MCP tools, Haiku is within 1% of Opus at 1/12 the cost. ...

Finding the Human in the Machine

50+ open-source orgs are rebuilding how they evaluate contributors. Here’s what’s emerging.

Hidden Technical Debt in Agentic Systems

Agentic systems are replaying hidden technical debt from early ML, and the missing control plane is where the biggest risk accumulates.