LocalAI API Server

intermediate

Self-hosted OpenAI-compatible API for running LLMs, image generation, and audio transcription locally with GPU acceleration.

[i]Overview

LocalAI is an open-source, self-hosted alternative to OpenAI's API that enables organizations and developers to run large language models, image generation, and audio transcription services entirely on their own infrastructure. Originally created to address privacy concerns and cost limitations of cloud-based AI services, LocalAI provides complete OpenAI API compatibility while supporting multiple model formats including GGUF, ONNX, and PyTorch models from sources like Hugging Face and the LocalAI model gallery. This LocalAI deployment leverages NVIDIA GPU acceleration through CUDA 12 support, dramatically improving inference speeds for large language models and enabling real-time processing capabilities. The containerized setup eliminates complex dependency management while providing persistent model storage and configurable resource allocation. GPU acceleration is particularly crucial for LocalAI as it transforms inference times from minutes to seconds for complex language tasks and enables practical real-time applications. This configuration is ideal for organizations requiring AI capabilities without external dependencies, privacy-conscious developers, researchers needing customizable AI infrastructure, and teams building AI-powered applications that demand predictable costs and latency. The OpenAI-compatible endpoints mean existing applications can switch to LocalAI with minimal code changes, making it valuable for both new projects and migrating from commercial AI services.

[*]Key Features

[+]Complete OpenAI API compatibility with /v1/chat/completions, /v1/embeddings, and /v1/audio endpoints
[+]NVIDIA GPU acceleration with CUDA 12 support for dramatically improved inference performance
[+]Multi-format model support including GGUF, ONNX, PyTorch, and direct Hugging Face model loading
[+]Built-in model gallery with curated, optimized models for various use cases and languages
[+]Dynamic model loading and unloading to optimize GPU memory usage across multiple models
[+]Real-time audio transcription with Whisper model integration and streaming capabilities
[+]Image generation support with Stable Diffusion models and customizable generation parameters
[+]Thread-based processing configuration for optimal CPU utilization alongside GPU acceleration

[#]Common Use Cases

[1]Private AI chatbot development for customer support without sending data to external APIs
[2]Local code completion and documentation generation for development teams in air-gapped environments
[3]Research institutions running custom fine-tuned models for specialized domains like medical or legal text
[4]Content creation workflows requiring consistent AI-generated text, images, and audio transcription
[5]Edge AI deployments in manufacturing or healthcare where low latency and data privacy are critical
[6]Cost-optimization for high-volume AI applications currently using expensive cloud-based APIs
[7]Multi-tenant AI services where different clients require isolated model serving with custom configurations

[!]Prerequisites

[!]NVIDIA GPU with at least 8GB VRAM for running medium-sized language models effectively
[!]NVIDIA Docker runtime and drivers installed on the host system for GPU passthrough
[!]Minimum 16GB system RAM plus additional memory for model caching and processing
[!]At least 50GB available disk space for storing multiple models and generated content
[!]Docker Compose v2.0 or higher with GPU support enabled
[!]Understanding of AI model formats (GGUF, ONNX) and parameter tuning for optimal performance

[!]

WARNING: For development & testing. Review security settings, change default credentials, and test thoroughly before production use. See Terms

[$]docker-compose.yml

[docker-compose.yml]

1services: 
2  localai: 
3    image: localai/localai:${LOCALAI_IMAGE:-latest-gpu-nvidia-cuda-12}
4    container_name: localai
5    restart: unless-stopped
6    ports: 
7      - "${LOCALAI_PORT:-8080}:8080"
8    environment: 
9      - MODELS_PATH=/models
10      - DEBUG=${DEBUG:-false}
11      - THREADS=${THREADS:-4}
12    volumes: 
13      - localai_models:/models
14    deploy: 
15      resources: 
16        reservations: 
17          devices: 
18            - driver: nvidia
19              count: 1
20              capabilities: [gpu]
21
22volumes: 
23  localai_models:

[$].env Template

[.env]

1# LocalAI Configuration
2LOCALAI_PORT=8080
3DEBUG=false
4THREADS=4
5
6# Image variants: latest-gpu-nvidia-cuda-12, latest-gpu-nvidia-cuda-11, latest (CPU only)
7LOCALAI_IMAGE=latest-gpu-nvidia-cuda-12

[i]Usage Notes

[1]Access LocalAI API at http://localhost:8080
[2]OpenAI-compatible endpoints: /v1/chat/completions, /v1/embeddings, etc.
[3]Download models: curl http://localhost:8080/models/apply -d '{"url": "model-url"}'
[4]For CPU-only: change LOCALAI_IMAGE to 'latest'
[5]Gallery of models at https://localai.io/models/
[6]Drop GGUF models directly into the models volume

[>]Quick Start

[terminal]

1# 1. Create the compose file
2cat > docker-compose.yml << 'EOF'
3services:
4  localai:
5    image: localai/localai:${LOCALAI_IMAGE:-latest-gpu-nvidia-cuda-12}
6    container_name: localai
7    restart: unless-stopped
8    ports:
9      - "${LOCALAI_PORT:-8080}:8080"
10    environment:
11      - MODELS_PATH=/models
12      - DEBUG=${DEBUG:-false}
13      - THREADS=${THREADS:-4}
14    volumes:
15      - localai_models:/models
16    deploy:
17      resources:
18        reservations:
19          devices:
20            - driver: nvidia
21              count: 1
22              capabilities: [gpu]
23
24volumes:
25  localai_models:
26EOF
27
28# 2. Create the .env file
29cat > .env << 'EOF'
30# LocalAI Configuration
31LOCALAI_PORT=8080
32DEBUG=false
33THREADS=4
34
35# Image variants: latest-gpu-nvidia-cuda-12, latest-gpu-nvidia-cuda-11, latest (CPU only)
36LOCALAI_IMAGE=latest-gpu-nvidia-cuda-12
37EOF
38
39# 3. Start the services
40docker compose up -d
41
42# 4. View logs
43docker compose logs -f

[>]One-Liner

Run this command to download and set up the recipe in one step:

[terminal]

1curl -fsSL https://docker.recipes/api/recipes/localai-stack/run | bash

[?]Troubleshooting

[!]NVIDIA GPU not detected in container: Verify nvidia-docker2 is installed and Docker daemon restarted after installation
[!]Models failing to load with CUDA out of memory errors: Reduce THREADS environment variable or switch to smaller quantized models
[!]API returning 404 for model endpoints: Ensure models are properly downloaded to /models volume and model names match exactly
[!]Slow inference despite GPU acceleration: Check GPU utilization with nvidia-smi and verify model is actually using GPU backend
[!]Container fails to start with permission denied on models volume: Ensure Docker has write permissions to the mounted volume directory
[!]Model download failing from gallery: Verify internet connectivity and try downloading models manually using curl commands

Community Notes

Loading notes...

## Download Recipe Kit

Get all files in a ready-to-deploy package

Includes docker-compose.yml, .env template, README, and license

## Components

localai

## Tags

#localai#llm#openai#api#gpu#self-hosted

## Category

AI & Machine Learning

## Related

Shortcuts: C CopyF FavoriteD Download