Llm | npow

Does Caveman Mode Actually Work? I Benchmarked It (Properly This Time).

TL;DR: Caveman is a Claude plugin that adds a ~700-token system prompt telling the model to talk tersely. I benchmarked it with n=3 runs × 15 prompts × 6 conditions × 2 models via the direct Anthropic API (540+ calls). Then I QA’d my own analysis and found it was riddled with methodology errors. Here’s the actually-honest story. Task type Caveman vs terse (“Answer concisely.”) Statistical significance Short-answer, 4-7 (math, code Q&A) +2.5% to +10% (caveman costs MORE) Not significant Short-answer, 4-6 -3% to -16% Not significant Long-form, 4-7 (tutorials, docs) -5% to +10% (caveman barely helps or hurts) Not significant Long-form, 4-6 -55% to -59% Large effect; paired t-test p≈0.012-0.028 (n=5 prompts) Caveman’s effect is concentrated almost entirely in one cell: long-form generation on claude-opus-4-6. Everywhere else it’s either noise or a slight cost increase. ...

We Benchmarked MCP Against Code Generation. MCP Won (Mostly).

TL;DR: For a small, well-designed API (10 tools), structured MCP tool calls consistently outscore code generation on correctness — 0.99 vs 0.97 — and the gap concentrates in tasks where domain-specific logic matters. Adding a reference document to MCP tools costs 6–14% more tokens with zero accuracy gain. Cloudflare’s search+execute pattern matches MCP accuracy but uses more tokens. With MCP tools, Haiku is within 1% of Opus at 1/12 the cost. ...

Distillation Isn’t an Attack. It’s a Moat Test.

Distillation reveals which capabilities are economically reproducible.