Running AI/ML Models Locally with Docker Compose

01Why Run AI Models Locally?

The AI landscape has changed dramatically. Models that required a data center two years ago now run on consumer hardware. Llama 3, Mistral, Stable Diffusion, and Whisper can all run on a desktop GPU or even a modern CPU. And with Docker Compose, setting them up is as simple as any other self-hosted service. I started running local AI models in 2024 when I realized I was spending $40/month on API calls for tasks I could handle locally: summarizing documents, generating code suggestions, transcribing meeting notes, and creating images for blog posts. Now I run Ollama for LLMs and Stable Diffusion for images, and the experience is faster (no network latency) and more private (nothing leaves my machine). This guide covers the practical setup for running AI/ML models with Docker Compose, including GPU passthrough configuration.

02Hardware Requirements

AI model performance depends heavily on your hardware: For LLMs (text generation): - CPU-only: Modern CPUs can run 7B parameter models (Llama 3 8B, Mistral 7B) at 5-15 tokens/second. Usable for casual interaction, slow for heavy workloads. - NVIDIA GPU (8GB+ VRAM): Run 7-13B models at 30-80 tokens/second. An RTX 3060 12GB or RTX 4060 Ti 16GB is the sweet spot. - NVIDIA GPU (24GB+ VRAM): Run 70B models quantized. An RTX 3090 or RTX 4090 opens up the most capable open models. - Apple Silicon: M1 Pro/Max/Ultra and M2/M3/M4 chips have unified memory that works well with LLMs. A MacBook Pro with 32GB RAM can run 13B models smoothly. For Image Generation (Stable Diffusion): - Minimum: NVIDIA GPU with 6GB VRAM (slow, small images) - Recommended: NVIDIA GPU with 12GB+ VRAM for 512x512 and above - Apple Silicon: Works via MPS backend, slower than NVIDIA but functional For Speech-to-Text (Whisper): - CPU: Works but slow. A 1-hour audio file takes 15-30 minutes to transcribe. - GPU: Same file transcribes in 1-3 minutes.

03GPU Passthrough Setup

Docker needs the NVIDIA Container Toolkit to access GPUs. Install it once on your host:

[terminal]

1# Install NVIDIA Container Toolkit (Ubuntu/Debian)
2curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
3  sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
4
5curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
6  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
7  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
8
9sudo apt update && sudo apt install -y nvidia-container-toolkit
10sudo nvidia-ctk runtime configure --runtime=docker
11sudo systemctl restart docker
12
13# Verify GPU is accessible from Docker
14docker run --rm --gpus all nvidia/cuda:12.3.1-base-ubuntu22.04 nvidia-smi

04Running LLMs with Ollama

Ollama is the easiest way to run large language models locally. It manages model downloads, quantization, and provides an OpenAI-compatible API:

[docker-compose.yml]

1services: 
2  ollama: 
3    image: ollama/ollama:latest
4    container_name: ollama
5    restart: unless-stopped
6    volumes: 
7      - ollama_data:/root/.ollama
8    deploy: 
9      resources: 
10        reservations: 
11          devices: 
12            - driver: nvidia
13              count: all
14              capabilities: [gpu]
15    ports: 
16      - "11434:11434"
17
18  # Open WebUI - ChatGPT-like interface for Ollama
19  open-webui: 
20    image: ghcr.io/open-webui/open-webui:main
21    container_name: open-webui
22    restart: unless-stopped
23    volumes: 
24      - webui_data:/app/backend/data
25    environment: 
26      - OLLAMA_BASE_URL=http://ollama:11434
27    ports: 
28      - "3000:8080"
29    depends_on: 
30      - ollama
31
32volumes: 
33  ollama_data: 
34  webui_data:

After starting, pull a model: docker exec ollama ollama pull llama3.1:8b. The 8B model uses about 5GB of VRAM and is excellent for general tasks.

05Practical Use Cases I Actually Use

Beyond chatting with an AI, here are the local AI workflows that have become part of my daily routine: Code review assistant: I pipe git diffs to Ollama for code review suggestions. It catches obvious bugs, suggests naming improvements, and flags potential security issues. Not as good as a human reviewer, but great for solo projects. Document summarization: Feed long PDFs or articles to a local LLM and get concise summaries. I use this for research papers and lengthy documentation. Meeting transcription: Whisper transcribes meeting recordings to text, then I feed the transcript to an LLM for a summary with action items. The entire pipeline runs locally — no sensitive meeting content ever leaves my machine. Image generation: Stable Diffusion for blog post illustrations, social media images, and design mockups. The quality of SDXL models is remarkable for free, local generation. Browse our ai-ml category for Docker Compose configurations of Ollama, Open WebUI, Stable Diffusion, Whisper, and other AI tools. Each configuration includes GPU passthrough settings and recommended model sizes for different hardware tiers.

01Why Run AI Models Locally?

02Hardware Requirements

03GPU Passthrough Setup

04Running LLMs with Ollama

05Practical Use Cases I Actually Use

About the Author