MervCodes

Tech Reviews From A Programmer

Running Local LLMs With Ollama: Developer Setup Guide

8 min read

I started running LLMs locally about a year ago, mostly because I was tired of burning through API credits while iterating on prompts. What surprised me was how practical it turned out to be — not just for saving money, but for privacy, offline access, and having a model that responds instantly without rate limits or latency spikes.

TL;DR: Ollama lets you run models like Llama 3.1, Codestral, Qwen 2.5, and Gemma 2 locally with a single command. It exposes an OpenAI-compatible API on localhost:11434, works on macOS/Linux/Windows, and needs as little as 8GB RAM for smaller models. This guide covers installation, model selection, API integration, and real-world dev workflows.

Why Run LLMs Locally?

Three problems that cloud APIs can't solve: privacy, cost, and availability.

When you're working with proprietary codebases or client data, sending every prompt to OpenAI or Anthropic might not be an option. Running locally means your data stays on your machine. Period.

Cost matters too. I was making hundreds of API calls a day while testing prompts and iterating on chains. Those tokens add up fast. A local model costs exactly $0 per token after the download.

And availability is simple: no rate limits, no outages, no latency spikes. Your model runs when your machine runs.

The tradeoff is capability. Local models (7B-70B parameters) won't match Claude or GPT-4o on complex reasoning. But for code completion, summarization, structured output, and simple chat — they're genuinely good enough for daily use.

What Hardware Do You Actually Need?

Ollama uses your system RAM (or VRAM with a GPU). Here's the honest breakdown:

  • 7B models (Llama 3.1 7B, Qwen 2.5 7B): 8GB RAM minimum, 16GB comfortable. Any modern MacBook handles these.
  • 13B-14B models (Qwen 2.5 14B): 16GB minimum. Expect ~10-20 tokens/sec on M1/M2.
  • 34B-70B models (Llama 3.1 70B, Codestral): 32-64GB RAM. Serious hardware, but noticeably better output.

Apple Silicon Macs are surprisingly great for this. The unified memory architecture means your 32GB M2 Max can run a 34B model at reasonable speeds without a discrete GPU. On Linux, an NVIDIA GPU with 12GB+ VRAM gives faster inference via CUDA.

Installing Ollama

Takes under a minute on any platform.

macOS (Homebrew):

brew install ollama

macOS/Linux (direct):

curl -fsSL https://ollama.com/install.sh | sh

Windows: Download from ollama.com or use WSL2.

Start the server:

ollama serve

On macOS, the desktop app starts it automatically. On Linux, it runs as a systemd service. Verify:

curl http://localhost:11434
# Should return: Ollama is running

Your First Model

Download a model:

ollama pull llama3.1

This grabs the 7B variant by default (~4.7GB). Chat with it:

ollama run llama3.1

That's it. Local LLM running. No API keys, no accounts, no billing.

Which Model Should You Use?

I've tried a lot of models. Here are my go-tos as of mid-2026:

For coding tasks:

ollama pull qwen2.5-coder:7b

Best code model in the 7B class, hands down. Handles TypeScript, Python, Go, and Rust well.

For code completion/inline suggestions:

ollama pull codestral:22b

Mistral's 22B hits a sweet spot — significantly better than 7B, still runnable on 16GB with quantization.

For general chat and reasoning:

ollama pull llama3.1:70b

If you've got the RAM, 70B is remarkably capable. Closest you'll get to cloud quality locally.

List what you have:

ollama list

Using the API

This is where Ollama really shines for developers. The REST API at localhost:11434 is OpenAI-compatible, which means most existing tools and libraries work — just change the base URL.

Basic API Call

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt": "Write a TypeScript function that debounces any async function",
  "stream": false
}'

OpenAI-Compatible Chat

curl http://localhost:11434/v1/chat/completions -d '{
  "model": "qwen2.5-coder:7b",
  "messages": [
    {"role": "system", "content": "You are a senior TypeScript developer."},
    {"role": "user", "content": "Write a retry wrapper with exponential backoff."}
  ]
}'

Node.js

const response = await fetch("http://localhost:11434/v1/chat/completions", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    model: "qwen2.5-coder:7b",
    messages: [
      { role: "user", content: "Explain this error: Cannot read properties of undefined" }
    ],
    temperature: 0.2,
  }),
});

const data = await response.json();
console.log(data.choices[0].message.content);

Python

import requests

response = requests.post("http://localhost:11434/v1/chat/completions", json={
    "model": "llama3.1",
    "messages": [
        {"role": "user", "content": "Generate a SQL query to find duplicate emails in a users table"}
    ],
    "temperature": 0.1,
})

print(response.json()["choices"][0]["message"]["content"])

Because the API is OpenAI-compatible, you can use the official OpenAI SDK by pointing it at localhost:

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:11434/v1",
  apiKey: "ollama", // required by the SDK but not used
});

const completion = await client.chat.completions.create({
  model: "qwen2.5-coder:7b",
  messages: [{ role: "user", content: "Refactor this function to use async/await" }],
});

Integrating Into Your Workflow

VS Code

The Continue extension is the most mature Ollama integration for VS Code — tab completion, inline chat, context-aware generation, all powered by your local model. Set the provider to Ollama, pick your model, and you've got a local Copilot alternative.

Docker

If you want to keep your dev environment clean or run Ollama on a shared server:

services:
  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

volumes:
  ollama_data:

Drop the deploy.resources block if you don't have an NVIDIA GPU. Same API endpoint at localhost:11434.

AI Coding Tools

Many AI code assistants now support local model backends. This is great when you want AI assistance but can't send code to external APIs — common in enterprise environments or when working with client projects.

Custom Models with Modelfiles

Ollama's Modelfile lets you create custom configurations — like a Dockerfile for LLMs. Lock in system prompts, tweak parameters, build task-specific models.

FROM qwen2.5-coder:7b

PARAMETER temperature 0.1
PARAMETER top_p 0.9
PARAMETER num_ctx 8192

SYSTEM """
You are a senior full-stack developer. When asked to write code:
- Use TypeScript by default
- Include error handling
- Add brief comments for non-obvious logic
- Prefer functional patterns over classes
"""

Build and run:

ollama create my-coder -f Modelfile
ollama run my-coder

I keep a few of these for different tasks — one for code review, one for writing tests, one for docs. Each has baked-in context so every prompt starts from a useful baseline.

Common Issues and Fixes

System becomes unresponsive: You're running a model too large for your RAM. The system starts swapping and everything grinds to a halt. Drop to a smaller model or more aggressive quantization (e.g., qwen2.5-coder:7b-q4_0).

Context window too small: Default is 2048 tokens, which isn't enough for code. Set it per-request:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt": "...",
  "options": { "num_ctx": 8192 }
}'

Or bake it into a Modelfile.

Port conflicts:

OLLAMA_HOST=0.0.0.0:11435 ollama serve

Disk space: Models live in ~/.ollama/models. A 70B model takes 40GB+. Clean up unused ones:

ollama rm llama3.1:70b

When to Use Local vs Cloud

Local models aren't a cloud replacement — they're complementary. Here's how I split it:

Local: code completion, boilerplate, commit messages, simple refactoring, structured output, anything involving sensitive code.

Cloud: complex multi-step reasoning, large codebase analysis, nuanced code review, tasks where quality matters more than speed or cost.

Start with local, reach for cloud when the output isn't good enough. You'll be surprised how often local handles it fine.

Wrapping Up

Ollama has made local LLM inference genuinely practical. Five-minute setup, OpenAI-compatible API, and Apple Silicon performance that's good enough for real-time use. Start with qwen2.5-coder:7b if you have 16GB RAM, or llama3.1 for general tasks.

The ecosystem moves fast — new models drop monthly with meaningful quality improvements. Once Ollama's running, staying current is just ollama pull model-name away.

Sources

  1. Ollama Official Documentation
  2. Llama 3.1 Model Card - Meta AI
  3. Qwen 2.5 Technical Report - Alibaba Cloud
  4. Mistral AI - Codestral
  5. Continue - Open-source AI code assistant

Related Articles

How to Debug Node.js Memory Leaks (Step-by-Step Guide)

Learn how to detect, diagnose, and fix Node.js memory leaks using heap snapshots, Chrome DevTools, and clinic.js — with real code examples.

How to Set Up GitHub Actions for CI/CD (Beginner-Friendly Guide)

Learn how to set up GitHub Actions for CI/CD pipelines — from your first workflow file to automated deployments with real YAML examples.

Python Virtual Environments Explained: venv vs conda vs pyenv

A practical comparison of Python's venv, conda, and pyenv — when to use each, how to set them up, and which one fits your workflow.