On LLM Task-Specification
← Back to HomeMarch 13, 2025
On LLM Task-Specification — smarter prompts can cut cost and boost accuracy
Rule of thumb: whenever a prompt feels like “too much to swallow at once,” split the task—then check every few months whether new models make recombining worthwhile.
When accuracy is mission-critical, many teams instinctively reach for a larger model. Yet model size is only one lever on the cost-accuracy tradeoff curve. An equally powerful (and often cheaper) lever is task-specification: deciding how much work you hand the model at one time.
Example: Refactoring a Codebase
It is easy to think of cases where LLMs fail. On a personal note, I decided to refactor the code in my old project for my game Cell Division https://mattmotoki.github.io/cell-division/ It was written in plain JS, CSS, and HTML. I wanted to use react. A one-shot prompt failed. Instead I used cursor to write tests and break the current codebase into smaller chunks – the board, the game, the AI, etc. Then I asked it to give me the bare bones implementation of the game. We implemented the scoring mechanics next and then the AI. Breaking down the problem into smaller chunks make the it posssible to refactor.
Example: Multi-level Product Extraction
While building a Hawaiʻi coffee directory I faced thousands of products across dozens of businesses, each formatted differently. One giant prompt crashed into context limits and hallucinations. A three-tier workflow solved it.
Level | What the prompt asked | Context data | # API calls | Tokens/call |
---|---|---|---|---|
1. Zero-shot grouping | “Classify each product into broad buckets: coffee, food, subscriptions, experiences …” | [fill] | [fill] | [fill] |
2. Generic extraction | “For products in the same bucket, return name, price, short description.” | [fill] | [fill] | [fill] |
3. Bucket-specific details | If bucket = coffee → “Add roast level, bean type (whole / ground / pod), origin, tasting notes.” | [fill] | [fill] | [fill] |
The same lightweight Gemini 2 Flash-Lite model handled every tier. I could have broken level 3 into separate prompts for each attribute, but a single call extracted all five features just fine—extra decomposition would add orchestration overhead without improving accuracy.
Why does this work? Smaller scopes trim irrelevant tokens, letting a modest model focus on the essentials—a dynamic that e-commerce teams have used to cut token bills by half while raising exact-match accuracy.
The Task Specification Tradeoff
- Higher signal-to-noise – fewer distractor tokens per decision.
- Right-sizing – mid-size models often outperform frontiers when the task is narrow.
- Parallelism – independent subtasks can run concurrently.
- Cost proportionality – you pay only for the capabilities you invoke.
But splitting is not free. Each extra hop adds glue code, latency, and failure modes. Both practice and literature show a U-shaped curve: too little decomposition wastes capacity; too much drowns you in orchestration overhead. The sweet spot is usually “a handful of coherent subtasks,” not dozens.
Conclusion
Model capabilities shift fast. Boundaries that felt necessary last quarter may already be obsolete. Keep a lightweight evaluation loop:
- Re-benchmark a monolith whenever a new model arrives.
- Measure task affinity – do grouped subtasks still help, or hurt?
- Re-split only where data prove the benefits remain.
Take-away
Task-specification is a controllable design choice. Before paying for a larger model, test whether a smarter breakdown—or a newly capable monolith—delivers better accuracy for less money. The cheapest, most reliable solution is often in the spec, not the size.