How to Integrate AI Into Your Web App (Next.js + OpenAI + RAG, 2026 Guide)

Users do not just want apps that store and retrieve data anymore. They expect software that understands context, drafts text, summarizes documents, and answers questions in natural language. AI integration in your web app has moved from differentiator to baseline expectation.

This guide is what I actually do when a client asks me to add AI to their existing Next.js or MERN application. It is not "wrap ChatGPT in a textarea." It is the production patterns: orchestration layer, RAG over proprietary data, streaming, rate limiting, and cost controls.

The architecture that actually works

LLMs are stateless and have no idea what your database contains. Building an AI feature means building an orchestration layer between three things:

Your frontend (the UI that triggers the AI feature).
Your backend (Next.js API routes or a separate Node/Go service).
The LLM provider (OpenAI, Anthropic Claude, Google Gemini).

The frontend must never talk to the LLM directly. Every AI call goes through your backend so you can: authenticate the user, check rate limits, attach business context (RAG retrieved chunks), inject system prompts, and meter usage per tenant.

Pattern 1: Simple generative feature

The starting point is one focused generative feature. For example, a SaaS for real estate agents that turns 5 bullet points into a listing description.

The flow:

User submits bullets from the React form.
Frontend POSTs to your Next.js API route.
API route authenticates, builds the prompt, calls the LLM, streams back the response.

import { streamText } from "ai";
import { openai } from "@ai-sdk/openai";

export async function POST(req: Request) {
  const { propertyDetails } = await req.json();

  const result = await streamText({
    model: openai("gpt-4o"),
    system:
      "You write luxurious, inviting real estate listing descriptions in 3 short paragraphs.",
    prompt: `Property bullet points: ${propertyDetails}`,
  });

  return result.toDataStreamResponse();
}

That is a working production grade endpoint in 12 lines using the Vercel AI SDK. The SDK handles streaming, error retries, and provider switching. I recommend it over calling the raw OpenAI SDK directly in 90 percent of cases.

Pattern 2: Structured output (when prose is not enough)

Most AI features eventually need structured output, not just text. Tagging emails by intent, extracting line items from invoices, classifying tickets. For this use structured outputs via Zod schemas.

import { generateObject } from "ai";
import { z } from "zod";
import { openai } from "@ai-sdk/openai";

const ticketSchema = z.object({
  category: z.enum(["billing", "technical", "general"]),
  urgency: z.enum(["low", "medium", "high"]),
  confidence: z.number().min(0).max(1),
  summary: z.string().max(140),
});

const { object } = await generateObject({
  model: openai("gpt-4o"),
  schema: ticketSchema,
  prompt: `Classify this support ticket: ${ticket.body}`,
});

// object is now type safe and validated
if (object.confidence > 0.9 && object.category === "billing") {
  await routeToBilling(ticket.id, object.summary);
}

This pattern eliminates the "the AI returned malformed JSON" class of bugs that plague naive integrations.

Pattern 3: Retrieval Augmented Generation (RAG)

When your AI needs to answer questions about your data, simple prompts will not work. LLMs have context limits and hallucinate when asked about things they were not trained on.

Retrieval Augmented Generation (RAG) is the architecture that solves this. The AI does not memorize your data. It searches your data, retrieves the relevant pieces, and uses them as context for each answer.

The four phases of a RAG system

Ingestion and chunking. Take your data (PDFs, docs, FAQs, past emails) and split it into 500 to 1000 token chunks. LangChain has good built in splitters.
Embedding. Convert each chunk to a numerical vector using an embedding model like OpenAI's text-embedding-3-small. A 1500 word document becomes a 1536 dimensional vector.
Storage. Store vectors in a vector database. For most projects, Supabase pgvector is the right call because it lives next to your existing Postgres data and avoids a separate service. Pinecone or Weaviate make sense at scale.
Retrieval and generation. When a user asks a question, embed the query, find the top 5 nearest vectors, inject those chunks into the LLM prompt with instructions to answer only from the provided context.

Minimal RAG implementation with Supabase pgvector

import { openai } from "@ai-sdk/openai";
import { embed, generateText } from "ai";
import { createClient } from "@supabase/supabase-js";

const supabase = createClient(
  process.env.SUPABASE_URL!,
  process.env.SUPABASE_SERVICE_KEY!,
);

export async function answerQuestion(question: string, tenantId: string) {
  // 1. Embed the user's question
  const { embedding } = await embed({
    model: openai.embedding("text-embedding-3-small"),
    value: question,
  });

  // 2. Find the 5 most similar chunks for this tenant
  const { data: chunks } = await supabase.rpc("match_documents", {
    query_embedding: embedding,
    match_threshold: 0.78,
    match_count: 5,
    tenant_filter: tenantId,
  });

  // 3. Inject chunks into the prompt
  const context = chunks?.map((c) => c.content).join("\n---\n");

  const { text } = await generateText({
    model: openai("gpt-4o"),
    system:
      "Answer using only the provided context. If the answer is not in the context, say you don't know.",
    prompt: `Context:\n${context}\n\nQuestion: ${question}`,
  });

  return text;
}

This is roughly the architecture I shipped for VistaChat (hospitality inbox assistant) and other production grade AI integrations.

Five production grade practices (skip these and your AI feature will break)

1. Never expose API keys to the client

The number of side projects with OPENAI_API_KEY shipped in a public NEXT_PUBLIC_ variable is depressing. Every API call must go through your backend. Period.

2. Implement rate limiting and per tenant quotas

LLM APIs are billed by token. One malicious user looping a generative endpoint can ring up a 10,000 USD bill overnight. Add:

IP based rate limiting at the edge (Upstash ratelimit + Vercel Edge Middleware).
Per user quotas tracked in your database.
Hard kill switches if usage spikes 10x.

3. Always stream

LLM responses take 3 to 20 seconds. A spinner for that long feels broken. Streaming responses word by word, like ChatGPT, drops perceived latency dramatically. The Vercel AI SDK handles this in one line.

4. Pick the right model per task

In 2026 the smart move is model routing: cheap and fast for simple tasks, premium for complex reasoning.

Classification, tagging, short summaries: GPT-4o-mini, Gemini 2.5 Flash, Claude Haiku 4.5.
General chat, drafting, RAG answers: GPT-4o, Claude Sonnet 4.6.
Long context, careful reasoning, agentic work: Claude Opus 4.7 (1M context), Gemini 2.5 Pro.

A well routed app spends 70 percent less per token than one that uses the premium model everywhere.

5. Log everything

Log prompt, response, tokens used, latency, and user ID for every AI call. You will need this for debugging, billing reconciliation, and prompt iteration. Tools like Langfuse, Helicone, or a simple Supabase table all work.

What I would skip in 2026

LangChain heavyweight chains. Use it for primitives (text splitters, document loaders) but the new Vercel AI SDK plus structured outputs handle most production needs more cleanly.
Vector only RAG for small datasets. If you have under 1000 chunks, just stuff them all in context and let the model figure it out. Vector DBs add complexity that is only worth it at scale.
Building everything from scratch. If you need a chat UI, use assistant-ui or the AI SDK's prebuilt components. Save build time for the proprietary parts.

Frequently asked questions

What is the easiest way to add AI to a Next.js app?

Install ai (Vercel AI SDK) and @ai-sdk/openai, add your OPENAI_API_KEY to env, and use streamText in a Route Handler. You can ship a working AI feature in under 50 lines.

What is RAG in simple terms?

RAG (Retrieval Augmented Generation) is how you make an AI answer questions about your data without retraining the model. You split your data into chunks, convert them to vectors, and at query time retrieve the most relevant chunks and feed them to the LLM as context.

Should I use OpenAI, Claude, or Gemini?

Use whichever has the best price to quality ratio for your task. GPT-4o is the strongest all rounder. Claude Sonnet 4.6 wins on long context and reasoning. Gemini 2.5 wins on cost and multimodal. Most production apps use 2 to 3 providers via the AI SDK and switch based on the task.

Is RAG better than fine tuning?

For 95 percent of business use cases, yes. RAG is faster to build, cheaper, easier to update (just re-embed changed docs), and produces verifiable answers grounded in your data. Fine tuning is for narrow style or format adjustments, not knowledge.

How much does AI integration cost to run?

Inference costs scale with usage. A typical SaaS feature handling 10,000 AI calls per month costs 30 to 200 USD per month on OpenAI or Anthropic. RAG infrastructure (Supabase pgvector) adds 10 to 25 USD per month. Most apps stay under 500 USD per month at moderate scale.

Ready to add AI to your product?

If you want a secure, scalable AI integration built into your existing web app, reach out. I have shipped production RAG systems including VistaChat and DuChat. See my services for the full scope.