Does Caveman Mode Actually Work? I Benchmarked It (Properly This Time).

TL;DR: Caveman is a Claude plugin that adds a ~700-token system prompt telling the model to talk tersely. I benchmarked it with n=3 runs × 15 prompts × 6 conditions × 2 models via the direct Anthropic API (540+ calls). Then I QA’d my own analysis and found it was riddled with methodology errors. Here’s the actually-honest story. Task type Caveman vs terse (“Answer concisely.”) Statistical significance Short-answer, 4-7 (math, code Q&A) +2.5% to +10% (caveman costs MORE) Not significant Short-answer, 4-6 -3% to -16% Not significant Long-form, 4-7 (tutorials, docs) -5% to +10% (caveman barely helps or hurts) Not significant Long-form, 4-6 -55% to -59% Large effect; paired t-test p≈0.012-0.028 (n=5 prompts) Caveman’s effect is concentrated almost entirely in one cell: long-form generation on claude-opus-4-6. Everywhere else it’s either noise or a slight cost increase. ...

April 20, 2026 · 9 min · npow