Back to Blog
AI/ML

AI Integration Patterns for Web Applications

Practical patterns for integrating LLMs into web apps: streaming responses, handling rate limits, cost optimization, and more.

11 min read
Updated Mar 1, 2026

AI Integration Patterns for Web Applications

Integrating LLMs into web apps sounds simple until you hit production. Here are the patterns that work.

Pattern 1: Streaming Responses

Users hate waiting for long responses. Stream them:

export async function POST(req: Request) {
  const { prompt } = await req.json();

  const stream = await openai.chat.completions.create({
    model: "gpt-4",
    messages: [{ role: "user", content: prompt }],
    stream: true,
  });

  const reader = stream.toReadableStream();
  return new Response(reader);
}

export async function POST(req: Request) {

const { prompt } = await req.json();

const stream = await openai.chat.completions.create({

model: "gpt-4",

messages: [{ role: "user", content: prompt }],

stream: true,

});

const reader = stream.toReadableStream();

return new Response(reader);

}


On the frontend:

On the frontend:

const response = await fetch("/api/chat", { method: "POST", body });
const reader = response.body.getReader();
const decoder = new TextDecoder();

while (true) {
  const { value, done } = await reader.read();
  if (done) break;
  
  const text = decoder.decode(value);
  setResponse(prev => prev + text); // update UI in real-time
}

const response = await fetch("/api/chat", { method: "POST", body });

const reader = response.body.getReader();

const decoder = new TextDecoder();

while (true) {

const { value, done } = await reader.read();

if (done) break;

const text = decoder.decode(value);

setResponse(prev => prev + text); // update UI in real-time

}


Users see responses appearing word-by-word. Way better UX than waiting 5 seconds for the complete response.

## Pattern 2: Fallback Chains

Never trust a single API:

Users see responses appearing word-by-word. Way better UX than waiting 5 seconds for the complete response.

Pattern 2: Fallback Chains

Never trust a single API:

async function generateText(prompt: string) {
  try {
    // Try Claude first (best quality)
    return await anthropic.messages.create({...});
  } catch (error) {
    try {
      // Fall back to GPT
      return await openai.chat.completions.create({...});
    } catch (error) {
      // Fall back to local model
      return await localLLM.generate(prompt);
    }
  }
}

async function generateText(prompt: string) {

try {

// Try Claude first (best quality)

return await anthropic.messages.create({...});

} catch (error) {

try {

// Fall back to GPT

return await openai.chat.completions.create({...});

} catch (error) {

// Fall back to local model

return await localLLM.generate(prompt);

}

}

}


This prevents outages from taking down your app.

## Pattern 3: Cost Optimization

LLM costs add up fast. Strategies:

**Use cheaper models for low-risk tasks:**

This prevents outages from taking down your app.

Pattern 3: Cost Optimization

LLM costs add up fast. Strategies:

Use cheaper models for low-risk tasks:

// High-stakes: use GPT-4
const response = await gpt4(prompt);

// Low-stakes: use gpt-3.5-turbo (1/10th the cost)
const summary = await gpt35(prompt);

// High-stakes: use GPT-4

const response = await gpt4(prompt);

// Low-stakes: use gpt-3.5-turbo (1/10th the cost)

const summary = await gpt35(prompt);


**Batch similar requests:**

Batch similar requests:

// Bad: 100 separate API calls
for (let item of items) {
  await openai.chat.completions.create({ messages: [item] });
}

// Good: batch them
const responses = await openai.batch.create({
  requests: items.map(item => ({ messages: [item] }))
});

// Bad: 100 separate API calls

for (let item of items) {

await openai.chat.completions.create({ messages: [item] });

}

// Good: batch them

const responses = await openai.batch.create({

requests: items.map(item => ({ messages: [item] }))

});


**Cache responses:**

Cache responses:

const cache = new Map();

async function askLLM(question: string) {
  if (cache.has(question)) {
    return cache.get(question);
  }
  
  const response = await openai.chat.completions.create({...});
  cache.set(question, response);
  return response;
}

const cache = new Map();

async function askLLM(question: string) {

if (cache.has(question)) {

return cache.get(question);

}

const response = await openai.chat.completions.create({...});

cache.set(question, response);

return response;

}


For DocMind, caching reduced API costs by 40%.

## Pattern 4: Rate Limiting

APIs have rate limits. Handle them gracefully:

For DocMind, caching reduced API costs by 40%.

Pattern 4: Rate Limiting

APIs have rate limits. Handle them gracefully:

import pLimit from "p-limit";

const limit = pLimit(5); // 5 concurrent requests

const promises = items.map(item =>
  limit(() => openai.chat.completions.create({...}))
);

await Promise.all(promises);

import pLimit from "p-limit";

const limit = pLimit(5); // 5 concurrent requests

const promises = items.map(item =>

limit(() => openai.chat.completions.create({...}))

);

await Promise.all(promises);


Also implement exponential backoff:

Also implement exponential backoff:

async function retryWithBackoff(fn, maxRetries = 3) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await fn();
    } catch (error) {
      if (attempt === maxRetries - 1) throw error;
      const delay = Math.pow(2, attempt) * 1000; // 1s, 2s, 4s
      await new Promise(resolve => setTimeout(resolve, delay));
    }
  }
}

async function retryWithBackoff(fn, maxRetries = 3) {

for (let attempt = 0; attempt < maxRetries; attempt++) {

try {

return await fn();

} catch (error) {

if (attempt === maxRetries - 1) throw error;

const delay = Math.pow(2, attempt) * 1000; // 1s, 2s, 4s

await new Promise(resolve => setTimeout(resolve, delay));

}

}

}


## Pattern 5: Token Counting

Track tokens before sending requests:

Pattern 5: Token Counting

Track tokens before sending requests:

import { encoding_for_model } from "js-tiktoken";

const enc = encoding_for_model("gpt-4");
const tokens = enc.encode(prompt).length;

if (tokens > 4000) {
  console.warn("Prompt too long, truncating...");
}

import { encoding_for_model } from "js-tiktoken";

const enc = encoding_for_model("gpt-4");

const tokens = enc.encode(prompt).length;

if (tokens > 4000) {

console.warn("Prompt too long, truncating...");

}


Prevents surprise $100 bills from long prompts.

## Pattern 6: Safety Rails

Not all AI output is safe. Add guards:

Prevents surprise $100 bills from long prompts.

Pattern 6: Safety Rails

Not all AI output is safe. Add guards:

// Check for dangerous content
const toxicity = await perspective.analyze(response);
if (toxicity.score > 0.7) {
  return "I can't help with that.";
}

// Check for hallucinations (for RAG)
if (!response.includes(citation)) {
  return "I don't have this information.";
}

// Check for PII leaks
const hasPII = /d{3}-d{2}-d{4}/.test(response); // SSN pattern
if (hasPII) {
  return "Response redacted for privacy.";
}

// Check for dangerous content

const toxicity = await perspective.analyze(response);

if (toxicity.score > 0.7) {

return "I can't help with that.";

}

// Check for hallucinations (for RAG)

if (!response.includes(citation)) {

return "I don't have this information.";

}

// Check for PII leaks

const hasPII = /d{3}-d{2}-d{4}/.test(response); // SSN pattern

if (hasPII) {

return "Response redacted for privacy.";

}


## Pattern 7: Observability

Track everything:

Pattern 7: Observability

Track everything:

logger.info({
  event: "llm_request",
  model: "gpt-4",
  tokens_used: 234,
  latency_ms: 1500,
  cost_cents: 12,
  error: null
});

logger.info({

event: "llm_request",

model: "gpt-4",

tokens_used: 234,

latency_ms: 1500,

cost_cents: 12,

error: null

});


This data is gold:
- Which models are slow?
- Which requests cost the most?
- What's failing?

## When NOT to Use LLMs

### Small, deterministic tasks

If you can write if/else logic, do it. LLMs are overkill and expensive.

### Real-time latency-critical apps

User hovers over a button, needs result in <50ms. LLMs take 500ms+. Use a rule engine.

### Zero-tolerance for errors

Medical diagnosis, legal contracts. LLMs hallucinate. Use expert systems or ask a human.

## Conclusion

LLM integration is becoming standard. The difference between good and bad implementations is usually observability and fallbacks. Know your costs, cache aggressively, and always have a plan B.
    

This data is gold:

  • Which models are slow?
  • Which requests cost the most?
  • What's failing?
  • When NOT to Use LLMs

    Small, deterministic tasks

    If you can write if/else logic, do it. LLMs are overkill and expensive.

    Real-time latency-critical apps

    User hovers over a button, needs result in <50ms. LLMs take 500ms+. Use a rule engine.

    Zero-tolerance for errors

    Medical diagnosis, legal contracts. LLMs hallucinate. Use expert systems or ask a human.

    Conclusion

    LLM integration is becoming standard. The difference between good and bad implementations is usually observability and fallbacks. Know your costs, cache aggressively, and always have a plan B.

    #LLM#Integration#Next.js#Streaming#Cost
    Vasanth Kumar

    Full-Stack Engineer & AI Product Builder

    4+ years of experience building scalable web applications and AI-powered products. Passionate about end-to-end product development, clean architecture, and solving real-world problems.

    More Articles