Real-world lessons from architecting and shipping MCP-based agentic AI workflows at scale — pitfalls, patterns, and performance tips from the frontlines at Intuit.
When Anthropic released the Model Context Protocol (MCP) spec in late 2024, it quietly shifted how we think about AI integration in production systems. Instead of ad-hoc API wrappers, MCP gives you a structured, typed interface between an AI model and the tools it can use. At Intuit, I was tasked with being one of the first engineers to take this from proof-of-concept to production — and it taught me a lot.
Our MCP server for QuickBooks content generation sits between the AI model and a set of internal CMS APIs. Here's the simplified flow we settled on after several iterations of design and real-world testing:
// mcp-server/src/index.ts
import { MCPServer, Tool } from '@anthropic/mcp-sdk';
const server = new MCPServer({
name: 'quickbooks-content-agent',
version: '1.0.0',
});
server.addTool('generate_content', {
description: 'Generate QB help article from a topic',
inputSchema: {
type: 'object',
properties: {
topic: { type: 'string' },
audience: { type: 'string' },
tone: { type: 'string', enum: ['formal', 'conversational'] }
},
required: ['topic', 'audience']
},
handler: async (input) => {
const draft = await callInternalCMS(input);
return { content: draft };
}
});
server.listen(3000);
console.log('MCP Server running on port 3000');
The single biggest reliability improvement came from strict JSON Schema validation on every tool input and output. Early in development, the AI would occasionally pass malformed inputs that crashed our downstream CMS APIs. Adding Zod validation at the MCP boundary reduced these errors to zero overnight.
import { z } from 'zod';
const GenerateInputSchema = z.object({
topic: z.string().min(3).max(200),
audience: z.enum(['small-business', 'accountant', 'enterprise']),
tone: z.enum(['formal', 'conversational'])
.default('conversational'),
});
// Validate before passing to handler
const parsed = GenerateInputSchema.safeParse(rawInput);
if (!parsed.success) {
throw new MCPToolError('Invalid input', parsed.error);
}
AI agents can be surprisingly aggressive about calling tools in loops. Without rate limiting, a single runaway agent session cost us $80 in LLM API calls during QA testing. We added a token-bucket rate limiter per session and a hard cap on tool call depth.
Traditional request logging doesn't cut it for agentic workflows. We built a custom trace logger that records the full tool call chain per session — input, output, latency, token count, and model version. This was invaluable for debugging non-deterministic failures that happened only in production.
// trace-logger.ts — Full session observability
interface ToolCallTrace {
sessionId: string;
toolName: string;
input: unknown;
output: unknown;
latencyMs: number;
tokenCount: number;
model: string;
timestamp: string;
error?: string;
}
export async function withTrace<T>(
meta: Omit<ToolCallTrace, 'latencyMs' | 'timestamp'>,
fn: () => Promise<T>
): Promise<T> {
const start = Date.now();
try {
const result = await fn();
await logTrace({ ...meta, latencyMs: Date.now() - start,
timestamp: new Date().toISOString() });
return result;
} catch (err) {
await logTrace({ ...meta, error: String(err),
latencyMs: Date.now() - start,
timestamp: new Date().toISOString() });
throw err;
}
}
Our MCP server sits in the critical path for content authoring. When the AI model is unavailable or rate-limited, we fall back to a template-based generation system so authors are never blocked. Never make an AI agent a single point of failure in a production workflow.
We made the mistake early on of hardcoding prompts inside the MCP handler functions. When we needed to update a prompt, it required a full deployment. We moved all prompts to a versioned config store — now prompt updates are zero-downtime config changes, and we can A/B test prompt variants with zero code changes.
After 6 months in production, the QuickBooks Content Agent has generated over 4,000 draft articles, reduced average authoring time from 45 minutes to 30 minutes per article (a 33% reduction), and achieved a 94% author satisfaction rate on first drafts — surpassing our original 30% effort reduction target.