AI Integration Patterns for Web Applications
Practical patterns for integrating LLMs into web apps: streaming responses, handling rate limits, cost optimization, and more.
AI Integration Patterns for Web Applications
Integrating LLMs into web apps sounds simple until you hit production. Here are the patterns that work.
Pattern 1: Streaming Responses
Users hate waiting for long responses. Stream them:
export async function POST(req: Request) {
const { prompt } = await req.json();
const stream = await openai.chat.completions.create({
model: "gpt-4",
messages: [{ role: "user", content: prompt }],
stream: true,
});
const reader = stream.toReadableStream();
return new Response(reader);
}export async function POST(req: Request) {
const { prompt } = await req.json();
const stream = await openai.chat.completions.create({
model: "gpt-4",
messages: [{ role: "user", content: prompt }],
stream: true,
});
const reader = stream.toReadableStream();
return new Response(reader);
}
On the frontend:
On the frontend:
const response = await fetch("/api/chat", { method: "POST", body });
const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { value, done } = await reader.read();
if (done) break;
const text = decoder.decode(value);
setResponse(prev => prev + text); // update UI in real-time
}const response = await fetch("/api/chat", { method: "POST", body });
const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { value, done } = await reader.read();
if (done) break;
const text = decoder.decode(value);
setResponse(prev => prev + text); // update UI in real-time
}
Users see responses appearing word-by-word. Way better UX than waiting 5 seconds for the complete response.
## Pattern 2: Fallback Chains
Never trust a single API:
Users see responses appearing word-by-word. Way better UX than waiting 5 seconds for the complete response.
Pattern 2: Fallback Chains
Never trust a single API:
async function generateText(prompt: string) {
try {
// Try Claude first (best quality)
return await anthropic.messages.create({...});
} catch (error) {
try {
// Fall back to GPT
return await openai.chat.completions.create({...});
} catch (error) {
// Fall back to local model
return await localLLM.generate(prompt);
}
}
}async function generateText(prompt: string) {
try {
// Try Claude first (best quality)
return await anthropic.messages.create({...});
} catch (error) {
try {
// Fall back to GPT
return await openai.chat.completions.create({...});
} catch (error) {
// Fall back to local model
return await localLLM.generate(prompt);
}
}
}
This prevents outages from taking down your app.
## Pattern 3: Cost Optimization
LLM costs add up fast. Strategies:
**Use cheaper models for low-risk tasks:**This prevents outages from taking down your app.
Pattern 3: Cost Optimization
LLM costs add up fast. Strategies:
Use cheaper models for low-risk tasks:
// High-stakes: use GPT-4
const response = await gpt4(prompt);
// Low-stakes: use gpt-3.5-turbo (1/10th the cost)
const summary = await gpt35(prompt);// High-stakes: use GPT-4
const response = await gpt4(prompt);
// Low-stakes: use gpt-3.5-turbo (1/10th the cost)
const summary = await gpt35(prompt);
**Batch similar requests:**Batch similar requests:
// Bad: 100 separate API calls
for (let item of items) {
await openai.chat.completions.create({ messages: [item] });
}
// Good: batch them
const responses = await openai.batch.create({
requests: items.map(item => ({ messages: [item] }))
});// Bad: 100 separate API calls
for (let item of items) {
await openai.chat.completions.create({ messages: [item] });
}
// Good: batch them
const responses = await openai.batch.create({
requests: items.map(item => ({ messages: [item] }))
});
**Cache responses:**Cache responses:
const cache = new Map();
async function askLLM(question: string) {
if (cache.has(question)) {
return cache.get(question);
}
const response = await openai.chat.completions.create({...});
cache.set(question, response);
return response;
}const cache = new Map();
async function askLLM(question: string) {
if (cache.has(question)) {
return cache.get(question);
}
const response = await openai.chat.completions.create({...});
cache.set(question, response);
return response;
}
For DocMind, caching reduced API costs by 40%.
## Pattern 4: Rate Limiting
APIs have rate limits. Handle them gracefully:
For DocMind, caching reduced API costs by 40%.
Pattern 4: Rate Limiting
APIs have rate limits. Handle them gracefully:
import pLimit from "p-limit";
const limit = pLimit(5); // 5 concurrent requests
const promises = items.map(item =>
limit(() => openai.chat.completions.create({...}))
);
await Promise.all(promises);import pLimit from "p-limit";
const limit = pLimit(5); // 5 concurrent requests
const promises = items.map(item =>
limit(() => openai.chat.completions.create({...}))
);
await Promise.all(promises);
Also implement exponential backoff:
Also implement exponential backoff:
async function retryWithBackoff(fn, maxRetries = 3) {
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
return await fn();
} catch (error) {
if (attempt === maxRetries - 1) throw error;
const delay = Math.pow(2, attempt) * 1000; // 1s, 2s, 4s
await new Promise(resolve => setTimeout(resolve, delay));
}
}
}async function retryWithBackoff(fn, maxRetries = 3) {
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
return await fn();
} catch (error) {
if (attempt === maxRetries - 1) throw error;
const delay = Math.pow(2, attempt) * 1000; // 1s, 2s, 4s
await new Promise(resolve => setTimeout(resolve, delay));
}
}
}
## Pattern 5: Token Counting
Track tokens before sending requests:
Pattern 5: Token Counting
Track tokens before sending requests:
import { encoding_for_model } from "js-tiktoken";
const enc = encoding_for_model("gpt-4");
const tokens = enc.encode(prompt).length;
if (tokens > 4000) {
console.warn("Prompt too long, truncating...");
}import { encoding_for_model } from "js-tiktoken";
const enc = encoding_for_model("gpt-4");
const tokens = enc.encode(prompt).length;
if (tokens > 4000) {
console.warn("Prompt too long, truncating...");
}
Prevents surprise $100 bills from long prompts.
## Pattern 6: Safety Rails
Not all AI output is safe. Add guards:
Prevents surprise $100 bills from long prompts.
Pattern 6: Safety Rails
Not all AI output is safe. Add guards:
// Check for dangerous content
const toxicity = await perspective.analyze(response);
if (toxicity.score > 0.7) {
return "I can't help with that.";
}
// Check for hallucinations (for RAG)
if (!response.includes(citation)) {
return "I don't have this information.";
}
// Check for PII leaks
const hasPII = /d{3}-d{2}-d{4}/.test(response); // SSN pattern
if (hasPII) {
return "Response redacted for privacy.";
}// Check for dangerous content
const toxicity = await perspective.analyze(response);
if (toxicity.score > 0.7) {
return "I can't help with that.";
}
// Check for hallucinations (for RAG)
if (!response.includes(citation)) {
return "I don't have this information.";
}
// Check for PII leaks
const hasPII = /d{3}-d{2}-d{4}/.test(response); // SSN pattern
if (hasPII) {
return "Response redacted for privacy.";
}
## Pattern 7: Observability
Track everything:
Pattern 7: Observability
Track everything:
logger.info({
event: "llm_request",
model: "gpt-4",
tokens_used: 234,
latency_ms: 1500,
cost_cents: 12,
error: null
});logger.info({
event: "llm_request",
model: "gpt-4",
tokens_used: 234,
latency_ms: 1500,
cost_cents: 12,
error: null
});
This data is gold:
- Which models are slow?
- Which requests cost the most?
- What's failing?
## When NOT to Use LLMs
### Small, deterministic tasks
If you can write if/else logic, do it. LLMs are overkill and expensive.
### Real-time latency-critical apps
User hovers over a button, needs result in <50ms. LLMs take 500ms+. Use a rule engine.
### Zero-tolerance for errors
Medical diagnosis, legal contracts. LLMs hallucinate. Use expert systems or ask a human.
## Conclusion
LLM integration is becoming standard. The difference between good and bad implementations is usually observability and fallbacks. Know your costs, cache aggressively, and always have a plan B.
This data is gold:
When NOT to Use LLMs
Small, deterministic tasks
If you can write if/else logic, do it. LLMs are overkill and expensive.
Real-time latency-critical apps
User hovers over a button, needs result in <50ms. LLMs take 500ms+. Use a rule engine.
Zero-tolerance for errors
Medical diagnosis, legal contracts. LLMs hallucinate. Use expert systems or ask a human.
Conclusion
LLM integration is becoming standard. The difference between good and bad implementations is usually observability and fallbacks. Know your costs, cache aggressively, and always have a plan B.

Full-Stack Engineer & AI Product Builder
4+ years of experience building scalable web applications and AI-powered products. Passionate about end-to-end product development, clean architecture, and solving real-world problems.