Tool calling in production: patterns that work and mistakes to avoid
Practical patterns for building reliable tool-calling agents with Claude, GPT-4o, and Gemini — based on what actually holds up at scale.
Tool calling (function calling) is now supported across all major frontier models, but reliability, latency, and correctness still vary significantly in practice. Here's what consistently works.
Define narrow, composable tools
Tools with fewer than 5 parameters and a single clear purpose call more reliably than broad multi-parameter tools. Instead of a `query_database(sql, format, max_rows, timeout, explain)` tool, prefer `run_sql(query)` and `get_schema(table)` separately. Models make fewer mistakes when each tool has one job.
Parallel tool calling
GPT-4o, Claude, and Gemini all support requesting multiple tool calls in a single model response. Enable this — it's often the difference between a 5-second and a 15-second agentic turn when the model needs to gather data from multiple sources.
Validation and retry
Always validate tool outputs before returning them to the model. If a tool call returns malformed data, catching it at the boundary and returning a structured error is more reliable than letting the model try to interpret it. Claude and GPT-4o handle tool errors well when the error message is specific.
Structured outputs for reliability
When you need deterministic JSON output from a tool result, use structured outputs (response_format) rather than prompting for JSON. The reliability gap is significant — structured outputs virtually eliminate JSON parse errors.
Model-specific notes
Claude's tool calling is more conservative — it tends to ask clarifying questions rather than guess parameters, which is better for interactive agents but can be frustrating in automated pipelines. Use `tool_choice: any` to force tool use when you know it's always appropriate. GPT-4o defaults to more aggressive parallel tool calls; set `parallel_tool_calls: false` if ordering matters.