vLLM
High-throughput LLM serving with PagedAttention.
Overview
vLLM (versatile Large Language Model) is a high-performance inference engine developed by UC Berkeley that revolutionizes LLM serving through its innovative PagedAttention algorithm. Originally created to address the memory bottlenecks in transformer-based language model serving, vLLM dynamically manages attention key-value tensors using virtual memory techniques similar to operating system paging. This breakthrough enables up to 24x higher serving throughput compared to traditional methods like Hugging Face Transformers while maintaining identical output quality.
vLLM operates as an OpenAI-compatible API server that can serve virtually any transformer-based language model from Hugging Face Hub with minimal configuration changes. The system efficiently manages GPU memory by storing attention keys and values in non-contiguous memory blocks, eliminating the need for pre-allocation of large contiguous memory chunks. This approach dramatically reduces memory fragmentation and enables higher batch sizes, leading to unprecedented throughput improvements for LLM inference workloads.
This configuration is ideal for AI researchers, ML engineers, and organizations requiring high-performance LLM serving infrastructure. The setup supports everything from small 7B parameter models on single GPUs to massive models requiring tensor parallelism across multiple GPUs. With built-in support for quantization techniques like AWQ, GPTQ, and SqueezeLLM, vLLM enables deployment of large models even on hardware with limited VRAM while maintaining excellent performance characteristics.
Key Features
- PagedAttention algorithm delivering up to 24x higher throughput than standard transformer implementations
- OpenAI-compatible API endpoints supporting chat completions, completions, and embeddings
- Tensor parallelism support for distributing large models across multiple GPUs using --tensor-parallel-size
- Integrated quantization support for AWQ, GPTQ, and SqueezeLLM to reduce VRAM requirements
- Dynamic batching with continuous batching optimization for improved request handling efficiency
- Native Hugging Face Hub integration with automatic model downloading and caching
- GPU memory optimization through non-contiguous KV cache management eliminating fragmentation
- Support for popular model architectures including LLaMA, Mistral, CodeLlama, and Vicuna
Common Use Cases
- 1High-throughput chatbot backends requiring concurrent handling of hundreds of user requests
- 2Code generation services needing fast response times for developer IDE integrations
- 3Research environments comparing multiple LLM outputs with consistent serving infrastructure
- 4Production AI applications requiring OpenAI API compatibility without vendor lock-in
- 5Multi-tenant SaaS platforms serving different models to various customer segments
- 6Real-time content generation systems for marketing, writing assistance, or creative applications
- 7Edge AI deployments where memory efficiency is critical for running large models on limited hardware
Prerequisites
- NVIDIA GPU with CUDA support and sufficient VRAM for your target model (minimum 8GB for 7B models)
- Docker with NVIDIA Container Runtime installed and configured for GPU access
- Hugging Face account and token for accessing gated models like LLaMA-2 or Code Llama
- At least 32GB system RAM recommended for optimal performance with larger models
- Understanding of transformer model architectures and parameter counts for appropriate resource allocation
- Familiarity with OpenAI API format for integration with existing applications
For development & testing. Review security settings, change default credentials, and test thoroughly before production use. See Terms
docker-compose.yml
docker-compose.yml
1services: 2 vllm: 3 image: vllm/vllm-openai:latest4 container_name: vllm5 restart: unless-stopped6 command: --model ${MODEL_NAME} --port 80007 environment: 8 HUGGING_FACE_HUB_TOKEN: ${HF_TOKEN}9 volumes: 10 - vllm_cache:/root/.cache/huggingface11 ports: 12 - "8000:8000"13 deploy: 14 resources: 15 reservations: 16 devices: 17 - driver: nvidia18 count: all19 capabilities: [gpu]2021volumes: 22 vllm_cache: .env Template
.env
1MODEL_NAME=meta-llama/Llama-2-7b-chat-hf2HF_TOKEN=your_huggingface_tokenUsage Notes
- 1Docs: https://docs.vllm.ai/
- 2OpenAI-compatible API at http://localhost:8000/v1/chat/completions
- 3PagedAttention enables 24x higher throughput than HF Transformers
- 4Tensor parallelism: --tensor-parallel-size N for multi-GPU
- 5Supports AWQ, GPTQ, SqueezeLLM quantization for reduced VRAM
- 6HF token required for gated models like Llama-2
Quick Start
terminal
1# 1. Create the compose file2cat > docker-compose.yml << 'EOF'3services:4 vllm:5 image: vllm/vllm-openai:latest6 container_name: vllm7 restart: unless-stopped8 command: --model ${MODEL_NAME} --port 80009 environment:10 HUGGING_FACE_HUB_TOKEN: ${HF_TOKEN}11 volumes:12 - vllm_cache:/root/.cache/huggingface13 ports:14 - "8000:8000"15 deploy:16 resources:17 reservations:18 devices:19 - driver: nvidia20 count: all21 capabilities: [gpu]2223volumes:24 vllm_cache:25EOF2627# 2. Create the .env file28cat > .env << 'EOF'29MODEL_NAME=meta-llama/Llama-2-7b-chat-hf30HF_TOKEN=your_huggingface_token31EOF3233# 3. Start the services34docker compose up -d3536# 4. View logs37docker compose logs -fOne-Liner
Run this command to download and set up the recipe in one step:
terminal
1curl -fsSL https://docker.recipes/api/recipes/vllm/run | bashTroubleshooting
- CUDA out of memory errors: Reduce model size, enable quantization with --quantization awq, or increase tensor parallelism across multiple GPUs
- Model download failures: Verify HF_TOKEN environment variable is set correctly and has access to the requested model repository
- Slow inference despite high-end hardware: Check GPU utilization and consider adjusting --max-num-batched-tokens or --max-num-seqs parameters
- Container fails to start with GPU detection errors: Ensure nvidia-container-runtime is properly installed and Docker daemon restarted
- OpenAI API compatibility issues: Verify client libraries support the /v1/chat/completions endpoint format and adjust request formatting
- High memory usage during model loading: Use --load-format pt for faster loading or --download-dir to specify custom model cache location
Community Notes
Loading...
Loading notes...
Download Recipe Kit
Get all files in a ready-to-deploy package
Includes docker-compose.yml, .env template, README, and license
Ad Space
Shortcuts: C CopyF FavoriteD Download