LocalAI API Server
Self-hosted OpenAI-compatible API for running LLMs, image generation, and audio transcription locally with GPU acceleration.
Overview
LocalAI is an open-source, self-hosted alternative to OpenAI's API that enables organizations and developers to run large language models, image generation, and audio transcription services entirely on their own infrastructure. Originally created to address privacy concerns and cost limitations of cloud-based AI services, LocalAI provides complete OpenAI API compatibility while supporting multiple model formats including GGUF, ONNX, and PyTorch models from sources like Hugging Face and the LocalAI model gallery.
This LocalAI deployment leverages NVIDIA GPU acceleration through CUDA 12 support, dramatically improving inference speeds for large language models and enabling real-time processing capabilities. The containerized setup eliminates complex dependency management while providing persistent model storage and configurable resource allocation. GPU acceleration is particularly crucial for LocalAI as it transforms inference times from minutes to seconds for complex language tasks and enables practical real-time applications.
This configuration is ideal for organizations requiring AI capabilities without external dependencies, privacy-conscious developers, researchers needing customizable AI infrastructure, and teams building AI-powered applications that demand predictable costs and latency. The OpenAI-compatible endpoints mean existing applications can switch to LocalAI with minimal code changes, making it valuable for both new projects and migrating from commercial AI services.
Key Features
- Complete OpenAI API compatibility with /v1/chat/completions, /v1/embeddings, and /v1/audio endpoints
- NVIDIA GPU acceleration with CUDA 12 support for dramatically improved inference performance
- Multi-format model support including GGUF, ONNX, PyTorch, and direct Hugging Face model loading
- Built-in model gallery with curated, optimized models for various use cases and languages
- Dynamic model loading and unloading to optimize GPU memory usage across multiple models
- Real-time audio transcription with Whisper model integration and streaming capabilities
- Image generation support with Stable Diffusion models and customizable generation parameters
- Thread-based processing configuration for optimal CPU utilization alongside GPU acceleration
Common Use Cases
- 1Private AI chatbot development for customer support without sending data to external APIs
- 2Local code completion and documentation generation for development teams in air-gapped environments
- 3Research institutions running custom fine-tuned models for specialized domains like medical or legal text
- 4Content creation workflows requiring consistent AI-generated text, images, and audio transcription
- 5Edge AI deployments in manufacturing or healthcare where low latency and data privacy are critical
- 6Cost-optimization for high-volume AI applications currently using expensive cloud-based APIs
- 7Multi-tenant AI services where different clients require isolated model serving with custom configurations
Prerequisites
- NVIDIA GPU with at least 8GB VRAM for running medium-sized language models effectively
- NVIDIA Docker runtime and drivers installed on the host system for GPU passthrough
- Minimum 16GB system RAM plus additional memory for model caching and processing
- At least 50GB available disk space for storing multiple models and generated content
- Docker Compose v2.0 or higher with GPU support enabled
- Understanding of AI model formats (GGUF, ONNX) and parameter tuning for optimal performance
For development & testing. Review security settings, change default credentials, and test thoroughly before production use. See Terms
docker-compose.yml
docker-compose.yml
1services: 2 localai: 3 image: localai/localai:${LOCALAI_IMAGE:-latest-gpu-nvidia-cuda-12}4 container_name: localai5 restart: unless-stopped6 ports: 7 - "${LOCALAI_PORT:-8080}:8080"8 environment: 9 - MODELS_PATH=/models10 - DEBUG=${DEBUG:-false}11 - THREADS=${THREADS:-4}12 volumes: 13 - localai_models:/models14 deploy: 15 resources: 16 reservations: 17 devices: 18 - driver: nvidia19 count: 120 capabilities: [gpu]2122volumes: 23 localai_models: .env Template
.env
1# LocalAI Configuration2LOCALAI_PORT=80803DEBUG=false4THREADS=456# Image variants: latest-gpu-nvidia-cuda-12, latest-gpu-nvidia-cuda-11, latest (CPU only)7LOCALAI_IMAGE=latest-gpu-nvidia-cuda-12Usage Notes
- 1Access LocalAI API at http://localhost:8080
- 2OpenAI-compatible endpoints: /v1/chat/completions, /v1/embeddings, etc.
- 3Download models: curl http://localhost:8080/models/apply -d '{"url": "model-url"}'
- 4For CPU-only: change LOCALAI_IMAGE to 'latest'
- 5Gallery of models at https://localai.io/models/
- 6Drop GGUF models directly into the models volume
Quick Start
terminal
1# 1. Create the compose file2cat > docker-compose.yml << 'EOF'3services:4 localai:5 image: localai/localai:${LOCALAI_IMAGE:-latest-gpu-nvidia-cuda-12}6 container_name: localai7 restart: unless-stopped8 ports:9 - "${LOCALAI_PORT:-8080}:8080"10 environment:11 - MODELS_PATH=/models12 - DEBUG=${DEBUG:-false}13 - THREADS=${THREADS:-4}14 volumes:15 - localai_models:/models16 deploy:17 resources:18 reservations:19 devices:20 - driver: nvidia21 count: 122 capabilities: [gpu]2324volumes:25 localai_models:26EOF2728# 2. Create the .env file29cat > .env << 'EOF'30# LocalAI Configuration31LOCALAI_PORT=808032DEBUG=false33THREADS=43435# Image variants: latest-gpu-nvidia-cuda-12, latest-gpu-nvidia-cuda-11, latest (CPU only)36LOCALAI_IMAGE=latest-gpu-nvidia-cuda-1237EOF3839# 3. Start the services40docker compose up -d4142# 4. View logs43docker compose logs -fOne-Liner
Run this command to download and set up the recipe in one step:
terminal
1curl -fsSL https://docker.recipes/api/recipes/localai-stack/run | bashTroubleshooting
- NVIDIA GPU not detected in container: Verify nvidia-docker2 is installed and Docker daemon restarted after installation
- Models failing to load with CUDA out of memory errors: Reduce THREADS environment variable or switch to smaller quantized models
- API returning 404 for model endpoints: Ensure models are properly downloaded to /models volume and model names match exactly
- Slow inference despite GPU acceleration: Check GPU utilization with nvidia-smi and verify model is actually using GPU backend
- Container fails to start with permission denied on models volume: Ensure Docker has write permissions to the mounted volume directory
- Model download failing from gallery: Verify internet connectivity and try downloading models manually using curl commands
Community Notes
Loading...
Loading notes...
Download Recipe Kit
Get all files in a ready-to-deploy package
Includes docker-compose.yml, .env template, README, and license
Ad Space
Shortcuts: C CopyF FavoriteD Download