Most "build a chatbot" tutorials are glorified curl wrappers — call the OpenAI API, print the response, call it done. That's not a chatbot. A real chatbot needs conversation memory that doesn't blow up your context window, retrieval so it doesn't hallucinate, streaming so users aren't staring at a spinner for 10 seconds, and guardrails so it doesn't go off the rails.

I've built several production chatbots with LangChain, and this guide covers what I actually do — from a basic chain to a deployment-ready setup.

What you'll build:

Conversational AI chatbot with LangChain and Python
Persistent conversation memory
RAG pipeline grounded in your own documents
Streaming responses
FastAPI deployment

Why LangChain?

LangChain abstracts the boilerplate of chaining prompts, managing memory, and wiring up retrieval. For a simple Q&A bot, you don't need it — just call the API directly. But once you need memory, retrieval, streaming, and multi-provider support, LangChain saves about 60-70% of the integration code you'd write yourself.

The tradeoff is another dependency and some abstraction overhead. Worth it in my experience for anything beyond a single-prompt wrapper.

Prerequisites

Python 3.11+
An OpenAI API key (or Anthropic, Google, etc.)
Basic async Python familiarity

pip install langchain langchain-openai langchain-community faiss-cpu python-dotenv fastapi uvicorn

OPENAI_API_KEY=sk-your-key-here

Step 1: Basic Conversational Chain

The simplest possible chatbot — takes input, returns a response with history:

from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.messages import HumanMessage, AIMessage

load_dotenv()

llm = ChatOpenAI(model="gpt-4o", temperature=0.7)

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant. Be concise and direct."),
    MessagesPlaceholder(variable_name="history"),
    ("human", "{input}"),
])

chain = prompt | llm

# Simple in-memory history
history = []

def chat(user_input: str) -> str:
    response = chain.invoke({"input": user_input, "history": history})
    history.append(HumanMessage(content=user_input))
    history.append(AIMessage(content=response.content))
    return response.content

This works, but has problems. History grows forever (you'll blow past the context window), it's not persistent, and there's no retrieval. Let's fix each one.

Step 2: Memory That Actually Works

The most practical approach for production: a sliding window of the last 10-20 messages, with session isolation.

from langchain_core.chat_history import InMemoryChatMessageHistory
from langchain_core.runnables.history import RunnableWithMessageHistory

store = {}

def get_session_history(session_id: str):
    if session_id not in store:
        store[session_id] = InMemoryChatMessageHistory()
    return store[session_id]

chain_with_history = RunnableWithMessageHistory(
    chain,
    get_session_history,
    input_messages_key="input",
    history_messages_key="history",
)

response = chain_with_history.invoke(
    {"input": "What's the best way to deploy a Python app?"},
    config={"configurable": {"session_id": "user-123"}},
)
print(response.content)

For production, swap InMemoryChatMessageHistory with Redis or Postgres:

from langchain_community.chat_message_histories import RedisChatMessageHistory

def get_session_history(session_id: str):
    return RedisChatMessageHistory(session_id, url="redis://localhost:6379")

Step 3: RAG — Stop the Hallucinating

A chatbot relying only on training data will make stuff up. RAG fixes this by fetching relevant documents before generating. This is the single most impactful improvement you can make to accuracy.

from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains.retrieval import create_retrieval_chain

# 1. Load your documents
loader = DirectoryLoader("./docs", glob="**/*.md", loader_cls=TextLoader)
documents = loader.load()

# 2. Split into chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = splitter.split_documents(documents)

# 3. Create vector store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = FAISS.from_documents(splits, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

# 4. Build RAG chain
rag_prompt = ChatPromptTemplate.from_messages([
    ("system", """Answer the user's question based on the context below.
If the context doesn't contain relevant info, say so honestly.

Context: {context}"""),
    MessagesPlaceholder(variable_name="history"),
    ("human", "{input}"),
])

document_chain = create_stuff_documents_chain(llm, rag_prompt)
rag_chain = create_retrieval_chain(retriever, document_chain)

Some things tutorials rarely mention that I learned the hard way:

Chunk size matters a lot. 1000 characters is a solid starting point. Too small = lost context. Too large = diluted relevance.
Overlap prevents cutting sentences in half. 200 chars is a good default.
Embedding model choice affects cost. text-embedding-3-small is ~80% cheaper than text-embedding-3-large with minimal quality loss for most cases.
FAISS is fine for prototyping. Switch to Pinecone, Weaviate, or pgvector for production over 100K documents.

Step 4: Streaming

Users hate waiting 5-10 seconds for a response to appear all at once. Streaming tokens as they're generated makes the chatbot feel responsive. This isn't optional for production.

from langchain_core.callbacks import StreamingStdoutCallbackHandler

streaming_llm = ChatOpenAI(
    model="gpt-4o",
    temperature=0.7,
    streaming=True,
    callbacks=[StreamingStdoutCallbackHandler()],
)

# For programmatic streaming (e.g., SSE to frontend)
async def stream_response(user_input: str, session_id: str):
    async for chunk in chain_with_history.astream(
        {"input": user_input},
        config={"configurable": {"session_id": session_id}},
    ):
        if hasattr(chunk, "content") and chunk.content:
            yield chunk.content

Step 5: FastAPI Server

Make it accessible over HTTP with server-sent events for streaming:

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from pydantic import BaseModel

app = FastAPI()

class ChatRequest(BaseModel):
    message: str
    session_id: str

@app.post("/chat")
async def chat_endpoint(req: ChatRequest):
    async def generate():
        async for token in stream_response(req.message, req.session_id):
            yield f"data: {token}\n\n"
        yield "data: [DONE]\n\n"

    return StreamingResponse(generate(), media_type="text/event-stream")

@app.get("/health")
async def health():
    return {"status": "ok"}

uvicorn main:app --host 0.0.0.0 --port 8000

Pitfalls I've Hit in Production

1. Context overflow. Even with gpt-4o's 128K context, long conversations plus retrieved docs stack up fast. Always cap history and chunk retrieval.

2. Rate limits. OpenAI will rate-limit you. Use LangChain's retry logic:

llm = ChatOpenAI(model="gpt-4o", max_retries=3, request_timeout=30)

3. Cost surprise. A RAG chatbot retrieving 4 chunks of 1000 tokens plus 20 messages of history sends ~8-10K tokens per request. At GPT-4o pricing, that's $0.03-0.05 per turn. For 10K messages/day, that's $300-500/month. Track your usage.

4. No evaluation. You can't improve what you don't measure. Log every interaction and review response quality regularly. LangSmith is excellent for this.

LangChain vs Building from Scratch

For a simple Q&A bot, you don't need LangChain. But once you need memory + retrieval + streaming + multi-provider support, it saves massive integration effort. The tradeoff is added dependency and abstraction.

Alternatives: LlamaIndex for pure RAG, Haystack for document processing, or raw API calls with your own thin wrapper if you want full control.

For teams looking to build custom chatbot solutions, Adaptels offers end-to-end development services.

Production Checklist

Before shipping:

Rate limiting on your endpoints (not just the LLM provider)
Input validation — sanitize messages, cap at 2000-4000 chars
Content filtering — moderation chain or OpenAI's moderation API
Logging and tracing — LangSmith or your own observability
Graceful degradation — what happens when the LLM is down?
Cost alerts — billing thresholds with your provider

Wrapping Up

We went from a basic wrapper to a production chatbot with memory, RAG, streaming, and a REST API. LangChain handles the plumbing so you can focus on what matters — your data, your prompts, your UX.

The code here is a starting point. You'll tune chunk sizes, experiment with retrieval strategies, and iterate on system prompts. LangChain makes all of that swappable without rewriting core logic.

Start simple, measure everything, add complexity only when needed.

Sources

LangChain Documentation
OpenAI API Reference
LangChain GitHub Repository
OpenAI Pricing
PDPC AI Governance Framework — relevant for chatbot deployments handling personal data in Singapore

Building an AI Chatbot With LangChain: Practical Developer Guide

On this page