This article will be published soon
Scheduled for: 2026-07-01 18:52
The article will appear automatically when the time arrives — no need to refresh.
The Problem of AI Model Usage Costs
With the increasing use of AI models such as Claude Code, Codex, and Gemini, cost has become a major obstacle for developers and companies. These models are priced based on the number of tokens consumed, and every extra word or symbol in the response means higher costs [citation:4][citation:5]. Long, polite responses containing pleasantries, apologies, and lengthy explanations significantly inflate usage bills.
The Solution: Caveman Plugin
A new plugin called Caveman emerged to solve this problem by changing the models’ response style. The plugin removes pleasantries, apologies, and lengthy explanations, keeping only the essential and concise information [citation:5].
Instead of an AI responding with:
“You’re right, there’s a bug in the code, I apologize for not noticing it earlier, and I have now fixed it.”
The response becomes simply:
“Bug exist. Bug fix. Code good.”
This approach, resembling “caveman” communication, focuses on conveying meaning with the fewest possible words [citation:1][citation:2].
Cost Savings
According to initial tests, this approach reduces token consumption by 65% to 75% [citation:1][citation:3][citation:5]. This significant saving translates directly to a sharp drop in model operating costs, especially in development environments that rely heavily on frequent and numerous queries.
For example, if using a particular model costs $10 per million tokens, reducing consumption by 70% means the same task would cost only $3. On the scale of large companies using these models daily, this can save millions annually.
Caveman Usage in Major Companies
Reports indicate that the Caveman plugin is already being used inside major tech companies such as OpenAI, NVIDIA, and GitHub [citation:5][citation:7]. Shayne Sweeney, Director of Engineering at OpenAI, has contributed to its development by adding support for Codex, demonstrating the organization’s confidence in this technology and its effectiveness [citation:4][citation:5].
Uber reportedly burned through its entire annual AI budget in four months, and companies like Walmart have introduced usage caps. An internal memo from electrical and digital infrastructure giant Legrand explicitly lists “use ‘caveman skill’ to reduce output consumption” as one of four high-impact practices to control costs [citation:5].
How Caveman Works
The plugin works by either modifying the prompt sent to the model or by processing the received response. It adds clear instructions for the model to respond in a direct and concise style, avoiding any unnecessary words, reminding the model to focus only on the essence [citation:3].
Key features include [citation:1][citation:2][citation:3]:
| Skill | What It Does |
|---|---|
/caveman [lite\|full\|ultra\|wenyan] |
Compresses every reply. Levels stick until session end. |
/caveman-commit |
Generates Conventional Commit messages with ≤50 character subject. |
/caveman-review |
Produces one-line PR comments. |
/caveman-stats |
Shows real session token usage + lifetime savings + USD. |
/caveman-compress <file> |
Rewrites memory files into caveman-speak, cutting ~46% input tokens every session. |
The plugin offers multiple levels of “grunt”: Lite (drop filler), Full (default caveman), Ultra (telegraphic), or Wenyan (classical Chinese, even shorter) [citation:1][citation:3]. It also preserves technical accuracy, compressing only surrounding language while keeping code, commands, URLs, paths, and numbers intact [citation:5].
Benchmarks
Real token counts from the Claude API confirm the effectiveness [citation:1][citation:3]:
| Task | Normal | Caveman | Saved |
|---|---|---|---|
| Explain React re-render bug | 1180 | 159 | 87% |
| Fix auth middleware token expiry | 704 | 121 | 83% |
| Set up PostgreSQL connection pool | 2347 | 380 | 84% |
| Explain git rebase vs merge | 702 | 292 | 58% |
| Refactor callback to async/await | 387 | 301 | 22% |
| Architecture: microservices vs monolith | 446 | 310 | 30% |
| Review PR for security issues | 678 | 398 | 41% |
| Docker multi-stage build | 1042 | 290 | 72% |
| Debug PostgreSQL race condition | 1200 | 232 | 81% |
| Implement React error boundary | 3454 | 456 | 87% |
| Average | 1214 | 294 | 65% |
A separate test by 404 Media with Claude Code logged a 5,800-token, 65% saving on one session [citation:5]. Elastic Labs independently measured 63.6% average reduction across eight Elasticsearch scenarios with zero accuracy loss [citation:4].
Other Cost-Saving Considerations
Critics rightly point out that output tokens are often the smaller cost driver. Long input contexts, bloated prompt histories, and agent loops burning tokens in the background do more damage. Structural fixes matter more [citation:4][citation:12]:
- Prompt pruning
- RAG (injecting only relevant data instead of entire databases)
- Small-model routing for intake tasks
- Token caching at roughly 10% of standard input price
However, Caveman remains a useful, cheap intervention worth trying first before downgrading to a smaller model. Prompt engineering is now partly a budget tool, not just a quality tool [citation:9].
Conclusion
The Caveman plugin represents a shift in how developers handle AI costs. Instead of treating pleasantries and apologies as part of the response, developers and companies can rely on the “caveman” approach to save time and money, making AI more efficient and cost-effective. As the use of AI models continues to rise, these tools are expected to become a core part of any developer’s toolkit.
Quick Links
https://github.com/juliusbrussee/caveman
Published in the Artificial Intelligence section – Tools and Cost Optimization
Reactions & Comments
Sign in with GitHub to leave a comment or react. Powered by Giscus (stored in GitHub Discussions)