AI Integration Patterns for Web Applications

Integrating LLMs into web apps sounds simple until you hit production. Here are the patterns that work.

Pattern 1: Streaming Responses

Users hate waiting for long responses. Stream them:

export async function POST(req: Request) {
  const { prompt } = await req.json();

  const stream = await openai.chat.completions.create({
    model: "gpt-4",
    messages: [{ role: "user", content: prompt }],
    stream: true,
  });

  const reader = stream.toReadableStream();
  return new Response(reader);
}

export async function POST(req: Request) {

const { prompt } = await req.json();

const stream = await openai.chat.completions.create({

model: "gpt-4",

messages: [{ role: "user", content: prompt }],

stream: true,

});

const reader = stream.toReadableStream();

return new Response(reader);

}


On the frontend:

On the frontend:

const response = await fetch("/api/chat", { method: "POST", body });
const reader = response.body.getReader();
const decoder = new TextDecoder();

while (true) {
  const { value, done } = await reader.read();
  if (done) break;
  
  const text = decoder.decode(value);
  setResponse(prev => prev + text); // update UI in real-time
}

const response = await fetch("/api/chat", { method: "POST", body });

const reader = response.body.getReader();

const decoder = new TextDecoder();

while (true) {

const { value, done } = await reader.read();

if (done) break;

const text = decoder.decode(value);

setResponse(prev => prev + text); // update UI in real-time

}


Users see responses appearing word-by-word. Way better UX than waiting 5 seconds for the complete response.

## Pattern 2: Fallback Chains

Never trust a single API:

Users see responses appearing word-by-word. Way better UX than waiting 5 seconds for the complete response.

Pattern 2: Fallback Chains

Never trust a single API:

async function generateText(prompt: string) {
  try {
    // Try Claude first (best quality)
    return await anthropic.messages.create({...});
  } catch (error) {
    try {
      // Fall back to GPT
      return await openai.chat.completions.create({...});
    } catch (error) {
      // Fall back to local model
      return await localLLM.generate(prompt);
    }
  }
}

async function generateText(prompt: string) {

try {

// Try Claude first (best quality)

return await anthropic.messages.create({...});

} catch (error) {

try {

// Fall back to GPT

return await openai.chat.completions.create({...});

} catch (error) {

// Fall back to local model

return await localLLM.generate(prompt);

}


This prevents outages from taking down your app.

## Pattern 3: Cost Optimization

LLM costs add up fast. Strategies:

**Use cheaper models for low-risk tasks:**

This prevents outages from taking down your app.

Pattern 3: Cost Optimization

LLM costs add up fast. Strategies:

Use cheaper models for low-risk tasks:

// High-stakes: use GPT-4
const response = await gpt4(prompt);

// Low-stakes: use gpt-3.5-turbo (1/10th the cost)
const summary = await gpt35(prompt);

// High-stakes: use GPT-4

const response = await gpt4(prompt);

// Low-stakes: use gpt-3.5-turbo (1/10th the cost)

const summary = await gpt35(prompt);


**Batch similar requests:**

Batch similar requests:

// Bad: 100 separate API calls
for (let item of items) {
  await openai.chat.completions.create({ messages: [item] });
}

// Good: batch them
const responses = await openai.batch.create({
  requests: items.map(item => ({ messages: [item] }))
});

// Bad: 100 separate API calls

for (let item of items) {

await openai.chat.completions.create({ messages: [item] });

}

// Good: batch them

const responses = await openai.batch.create({

requests: items.map(item => ({ messages: [item] }))

});


**Cache responses:**

Cache responses:

const cache = new Map();

async function askLLM(question: string) {
  if (cache.has(question)) {
    return cache.get(question);
  }
  
  const response = await openai.chat.completions.create({...});
  cache.set(question, response);
  return response;
}

const cache = new Map();

async function askLLM(question: string) {

if (cache.has(question)) {

return cache.get(question);

}

const response = await openai.chat.completions.create({...});

cache.set(question, response);

return response;

}


For DocMind, caching reduced API costs by 40%.

## Pattern 4: Rate Limiting

APIs have rate limits. Handle them gracefully:

For DocMind, caching reduced API costs by 40%.

Pattern 4: Rate Limiting

APIs have rate limits. Handle them gracefully:

import pLimit from "p-limit";

const limit = pLimit(5); // 5 concurrent requests

const promises = items.map(item =>
  limit(() => openai.chat.completions.create({...}))
);

await Promise.all(promises);

import pLimit from "p-limit";

const limit = pLimit(5); // 5 concurrent requests

const promises = items.map(item =>

limit(() => openai.chat.completions.create({...}))

);

await Promise.all(promises);


Also implement exponential backoff:

Also implement exponential backoff:

async function retryWithBackoff(fn, maxRetries = 3) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await fn();
    } catch (error) {
      if (attempt === maxRetries - 1) throw error;
      const delay = Math.pow(2, attempt) * 1000; // 1s, 2s, 4s
      await new Promise(resolve => setTimeout(resolve, delay));
    }
  }
}

async function retryWithBackoff(fn, maxRetries = 3) {

for (let attempt = 0; attempt < maxRetries; attempt++) {

try {

return await fn();

} catch (error) {

if (attempt === maxRetries - 1) throw error;

const delay = Math.pow(2, attempt) * 1000; // 1s, 2s, 4s

await new Promise(resolve => setTimeout(resolve, delay));

}


## Pattern 5: Token Counting

Track tokens before sending requests:

Pattern 5: Token Counting

Track tokens before sending requests:

import { encoding_for_model } from "js-tiktoken";

const enc = encoding_for_model("gpt-4");
const tokens = enc.encode(prompt).length;

if (tokens > 4000) {
  console.warn("Prompt too long, truncating...");
}

import { encoding_for_model } from "js-tiktoken";

const enc = encoding_for_model("gpt-4");

const tokens = enc.encode(prompt).length;

if (tokens > 4000) {

console.warn("Prompt too long, truncating...");

}


Prevents surprise $100 bills from long prompts.

## Pattern 6: Safety Rails

Not all AI output is safe. Add guards:

Prevents surprise $100 bills from long prompts.

Pattern 6: Safety Rails

Not all AI output is safe. Add guards:

// Check for dangerous content
const toxicity = await perspective.analyze(response);
if (toxicity.score > 0.7) {
  return "I can't help with that.";
}

// Check for hallucinations (for RAG)
if (!response.includes(citation)) {
  return "I don't have this information.";
}

// Check for PII leaks
const hasPII = /d{3}-d{2}-d{4}/.test(response); // SSN pattern
if (hasPII) {
  return "Response redacted for privacy.";
}

// Check for dangerous content

const toxicity = await perspective.analyze(response);

if (toxicity.score > 0.7) {

return "I can't help with that.";

}

// Check for hallucinations (for RAG)

if (!response.includes(citation)) {

return "I don't have this information.";

}

// Check for PII leaks

const hasPII = /d{3}-d{2}-d{4}/.test(response); // SSN pattern

if (hasPII) {

return "Response redacted for privacy.";

}


## Pattern 7: Observability

Track everything:

Pattern 7: Observability

Track everything:

logger.info({
  event: "llm_request",
  model: "gpt-4",
  tokens_used: 234,
  latency_ms: 1500,
  cost_cents: 12,
  error: null
});

logger.info({

event: "llm_request",

model: "gpt-4",

tokens_used: 234,

latency_ms: 1500,

cost_cents: 12,

error: null

});


This data is gold:
- Which models are slow?
- Which requests cost the most?
- What's failing?

## When NOT to Use LLMs

### Small, deterministic tasks

If you can write if/else logic, do it. LLMs are overkill and expensive.

### Real-time latency-critical apps

User hovers over a button, needs result in <50ms. LLMs take 500ms+. Use a rule engine.

### Zero-tolerance for errors

Medical diagnosis, legal contracts. LLMs hallucinate. Use expert systems or ask a human.

## Conclusion

LLM integration is becoming standard. The difference between good and bad implementations is usually observability and fallbacks. Know your costs, cache aggressively, and always have a plan B.

This data is gold:

Which models are slow?

Which requests cost the most?

What's failing?

When NOT to Use LLMs

Small, deterministic tasks

If you can write if/else logic, do it. LLMs are overkill and expensive.

Real-time latency-critical apps

User hovers over a button, needs result in <50ms. LLMs take 500ms+. Use a rule engine.

Zero-tolerance for errors

Medical diagnosis, legal contracts. LLMs hallucinate. Use expert systems or ask a human.

Conclusion

LLM integration is becoming standard. The difference between good and bad implementations is usually observability and fallbacks. Know your costs, cache aggressively, and always have a plan B.

AI Integration Patterns for Web Applications

AI Integration Patterns for Web Applications

Pattern 1: Streaming Responses

Pattern 2: Fallback Chains

Pattern 3: Cost Optimization

Pattern 4: Rate Limiting

Pattern 5: Token Counting

Pattern 6: Safety Rails

Pattern 7: Observability

When NOT to Use LLMs

Small, deterministic tasks

Real-time latency-critical apps

Zero-tolerance for errors

Conclusion

More Articles

What is RAG? A Complete Guide to Retrieval-Augmented Generation