Building an AI Chatbot With LangChain: Practical Developer Guide
On this page
Most "build a chatbot" tutorials are glorified curl wrappers — call the OpenAI API, print the response, call it done. That's not a chatbot. A real chatbot needs conversation memory that doesn't blow up your context window, retrieval so it doesn't hallucinate, streaming so users aren't staring at a spinner for 10 seconds, and guardrails so it doesn't go off the rails.
I've built several production chatbots with LangChain, and this guide covers what I actually do — from a basic chain to a deployment-ready setup.
What you'll build:
- Conversational AI chatbot with LangChain and Python
- Persistent conversation memory
- RAG pipeline grounded in your own documents
- Streaming responses
- FastAPI deployment
Why LangChain?
LangChain abstracts the boilerplate of chaining prompts, managing memory, and wiring up retrieval. For a simple Q&A bot, you don't need it — just call the API directly. But once you need memory, retrieval, streaming, and multi-provider support, LangChain saves about 60-70% of the integration code you'd write yourself.
The tradeoff is another dependency and some abstraction overhead. Worth it in my experience for anything beyond a single-prompt wrapper.
Prerequisites
- Python 3.11+
- An OpenAI API key (or Anthropic, Google, etc.)
- Basic async Python familiarity
pip install langchain langchain-openai langchain-community faiss-cpu python-dotenv fastapi uvicorn
OPENAI_API_KEY=sk-your-key-here
Step 1: Basic Conversational Chain
The simplest possible chatbot — takes input, returns a response with history:
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.messages import HumanMessage, AIMessage
load_dotenv()
llm = ChatOpenAI(model="gpt-4o", temperature=0.7)
prompt = ChatPromptTemplate.from_messages([
("system", "You are a helpful assistant. Be concise and direct."),
MessagesPlaceholder(variable_name="history"),
("human", "{input}"),
])
chain = prompt | llm
# Simple in-memory history
history = []
def chat(user_input: str) -> str:
response = chain.invoke({"input": user_input, "history": history})
history.append(HumanMessage(content=user_input))
history.append(AIMessage(content=response.content))
return response.content
This works, but has problems. History grows forever (you'll blow past the context window), it's not persistent, and there's no retrieval. Let's fix each one.
Step 2: Memory That Actually Works
The most practical approach for production: a sliding window of the last 10-20 messages, with session isolation.
from langchain_core.chat_history import InMemoryChatMessageHistory
from langchain_core.runnables.history import RunnableWithMessageHistory
store = {}
def get_session_history(session_id: str):
if session_id not in store:
store[session_id] = InMemoryChatMessageHistory()
return store[session_id]
chain_with_history = RunnableWithMessageHistory(
chain,
get_session_history,
input_messages_key="input",
history_messages_key="history",
)
response = chain_with_history.invoke(
{"input": "What's the best way to deploy a Python app?"},
config={"configurable": {"session_id": "user-123"}},
)
print(response.content)
For production, swap InMemoryChatMessageHistory with Redis or Postgres:
from langchain_community.chat_message_histories import RedisChatMessageHistory
def get_session_history(session_id: str):
return RedisChatMessageHistory(session_id, url="redis://localhost:6379")
Step 3: RAG — Stop the Hallucinating
A chatbot relying only on training data will make stuff up. RAG fixes this by fetching relevant documents before generating. This is the single most impactful improvement you can make to accuracy.
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains.retrieval import create_retrieval_chain
# 1. Load your documents
loader = DirectoryLoader("./docs", glob="**/*.md", loader_cls=TextLoader)
documents = loader.load()
# 2. Split into chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = splitter.split_documents(documents)
# 3. Create vector store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = FAISS.from_documents(splits, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
# 4. Build RAG chain
rag_prompt = ChatPromptTemplate.from_messages([
("system", """Answer the user's question based on the context below.
If the context doesn't contain relevant info, say so honestly.
Context: {context}"""),
MessagesPlaceholder(variable_name="history"),
("human", "{input}"),
])
document_chain = create_stuff_documents_chain(llm, rag_prompt)
rag_chain = create_retrieval_chain(retriever, document_chain)
Some things tutorials rarely mention that I learned the hard way:
- Chunk size matters a lot. 1000 characters is a solid starting point. Too small = lost context. Too large = diluted relevance.
- Overlap prevents cutting sentences in half. 200 chars is a good default.
- Embedding model choice affects cost.
text-embedding-3-smallis ~80% cheaper thantext-embedding-3-largewith minimal quality loss for most cases. - FAISS is fine for prototyping. Switch to Pinecone, Weaviate, or pgvector for production over 100K documents.
Step 4: Streaming
Users hate waiting 5-10 seconds for a response to appear all at once. Streaming tokens as they're generated makes the chatbot feel responsive. This isn't optional for production.
from langchain_core.callbacks import StreamingStdoutCallbackHandler
streaming_llm = ChatOpenAI(
model="gpt-4o",
temperature=0.7,
streaming=True,
callbacks=[StreamingStdoutCallbackHandler()],
)
# For programmatic streaming (e.g., SSE to frontend)
async def stream_response(user_input: str, session_id: str):
async for chunk in chain_with_history.astream(
{"input": user_input},
config={"configurable": {"session_id": session_id}},
):
if hasattr(chunk, "content") and chunk.content:
yield chunk.content
Step 5: FastAPI Server
Make it accessible over HTTP with server-sent events for streaming:
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
app = FastAPI()
class ChatRequest(BaseModel):
message: str
session_id: str
@app.post("/chat")
async def chat_endpoint(req: ChatRequest):
async def generate():
async for token in stream_response(req.message, req.session_id):
yield f"data: {token}\n\n"
yield "data: [DONE]\n\n"
return StreamingResponse(generate(), media_type="text/event-stream")
@app.get("/health")
async def health():
return {"status": "ok"}
uvicorn main:app --host 0.0.0.0 --port 8000
Pitfalls I've Hit in Production
1. Context overflow. Even with gpt-4o's 128K context, long conversations plus retrieved docs stack up fast. Always cap history and chunk retrieval.
2. Rate limits. OpenAI will rate-limit you. Use LangChain's retry logic:
llm = ChatOpenAI(model="gpt-4o", max_retries=3, request_timeout=30)
3. Cost surprise. A RAG chatbot retrieving 4 chunks of 1000 tokens plus 20 messages of history sends ~8-10K tokens per request. At GPT-4o pricing, that's $0.03-0.05 per turn. For 10K messages/day, that's $300-500/month. Track your usage.
4. No evaluation. You can't improve what you don't measure. Log every interaction and review response quality regularly. LangSmith is excellent for this.
LangChain vs Building from Scratch
For a simple Q&A bot, you don't need LangChain. But once you need memory + retrieval + streaming + multi-provider support, it saves massive integration effort. The tradeoff is added dependency and abstraction.
Alternatives: LlamaIndex for pure RAG, Haystack for document processing, or raw API calls with your own thin wrapper if you want full control.
For teams looking to build custom chatbot solutions, Adaptels offers end-to-end development services.
Production Checklist
Before shipping:
- Rate limiting on your endpoints (not just the LLM provider)
- Input validation — sanitize messages, cap at 2000-4000 chars
- Content filtering — moderation chain or OpenAI's moderation API
- Logging and tracing — LangSmith or your own observability
- Graceful degradation — what happens when the LLM is down?
- Cost alerts — billing thresholds with your provider
Wrapping Up
We went from a basic wrapper to a production chatbot with memory, RAG, streaming, and a REST API. LangChain handles the plumbing so you can focus on what matters — your data, your prompts, your UX.
The code here is a starting point. You'll tune chunk sizes, experiment with retrieval strategies, and iterate on system prompts. LangChain makes all of that swappable without rewriting core logic.
Start simple, measure everything, add complexity only when needed.
Sources
- LangChain Documentation
- OpenAI API Reference
- LangChain GitHub Repository
- OpenAI Pricing
- PDPC AI Governance Framework — relevant for chatbot deployments handling personal data in Singapore
Related Articles
How to Debug Node.js Memory Leaks (Step-by-Step Guide)
Learn how to detect, diagnose, and fix Node.js memory leaks using heap snapshots, Chrome DevTools, and clinic.js — with real code examples.
Running Local LLMs With Ollama: Developer Setup Guide
Set up Ollama to run local LLMs on your machine. Covers installation, model selection, API usage, and integrating local models into your dev workflow.
Python Virtual Environments Explained: venv vs conda vs pyenv
A practical comparison of Python's venv, conda, and pyenv — when to use each, how to set them up, and which one fits your workflow.