docker.recipes

vLLM

intermediate

High-throughput LLM serving with PagedAttention.

Overview

vLLM (versatile Large Language Model) is a high-performance inference engine developed by UC Berkeley that revolutionizes LLM serving through its innovative PagedAttention algorithm. Originally created to address the memory bottlenecks in transformer-based language model serving, vLLM dynamically manages attention key-value tensors using virtual memory techniques similar to operating system paging. This breakthrough enables up to 24x higher serving throughput compared to traditional methods like Hugging Face Transformers while maintaining identical output quality. vLLM operates as an OpenAI-compatible API server that can serve virtually any transformer-based language model from Hugging Face Hub with minimal configuration changes. The system efficiently manages GPU memory by storing attention keys and values in non-contiguous memory blocks, eliminating the need for pre-allocation of large contiguous memory chunks. This approach dramatically reduces memory fragmentation and enables higher batch sizes, leading to unprecedented throughput improvements for LLM inference workloads. This configuration is ideal for AI researchers, ML engineers, and organizations requiring high-performance LLM serving infrastructure. The setup supports everything from small 7B parameter models on single GPUs to massive models requiring tensor parallelism across multiple GPUs. With built-in support for quantization techniques like AWQ, GPTQ, and SqueezeLLM, vLLM enables deployment of large models even on hardware with limited VRAM while maintaining excellent performance characteristics.

Key Features

  • PagedAttention algorithm delivering up to 24x higher throughput than standard transformer implementations
  • OpenAI-compatible API endpoints supporting chat completions, completions, and embeddings
  • Tensor parallelism support for distributing large models across multiple GPUs using --tensor-parallel-size
  • Integrated quantization support for AWQ, GPTQ, and SqueezeLLM to reduce VRAM requirements
  • Dynamic batching with continuous batching optimization for improved request handling efficiency
  • Native Hugging Face Hub integration with automatic model downloading and caching
  • GPU memory optimization through non-contiguous KV cache management eliminating fragmentation
  • Support for popular model architectures including LLaMA, Mistral, CodeLlama, and Vicuna

Common Use Cases

  • 1High-throughput chatbot backends requiring concurrent handling of hundreds of user requests
  • 2Code generation services needing fast response times for developer IDE integrations
  • 3Research environments comparing multiple LLM outputs with consistent serving infrastructure
  • 4Production AI applications requiring OpenAI API compatibility without vendor lock-in
  • 5Multi-tenant SaaS platforms serving different models to various customer segments
  • 6Real-time content generation systems for marketing, writing assistance, or creative applications
  • 7Edge AI deployments where memory efficiency is critical for running large models on limited hardware

Prerequisites

  • NVIDIA GPU with CUDA support and sufficient VRAM for your target model (minimum 8GB for 7B models)
  • Docker with NVIDIA Container Runtime installed and configured for GPU access
  • Hugging Face account and token for accessing gated models like LLaMA-2 or Code Llama
  • At least 32GB system RAM recommended for optimal performance with larger models
  • Understanding of transformer model architectures and parameter counts for appropriate resource allocation
  • Familiarity with OpenAI API format for integration with existing applications

For development & testing. Review security settings, change default credentials, and test thoroughly before production use. See Terms

docker-compose.yml

docker-compose.yml
1services:
2 vllm:
3 image: vllm/vllm-openai:latest
4 container_name: vllm
5 restart: unless-stopped
6 command: --model ${MODEL_NAME} --port 8000
7 environment:
8 HUGGING_FACE_HUB_TOKEN: ${HF_TOKEN}
9 volumes:
10 - vllm_cache:/root/.cache/huggingface
11 ports:
12 - "8000:8000"
13 deploy:
14 resources:
15 reservations:
16 devices:
17 - driver: nvidia
18 count: all
19 capabilities: [gpu]
20
21volumes:
22 vllm_cache:

.env Template

.env
1MODEL_NAME=meta-llama/Llama-2-7b-chat-hf
2HF_TOKEN=your_huggingface_token

Usage Notes

  1. 1Docs: https://docs.vllm.ai/
  2. 2OpenAI-compatible API at http://localhost:8000/v1/chat/completions
  3. 3PagedAttention enables 24x higher throughput than HF Transformers
  4. 4Tensor parallelism: --tensor-parallel-size N for multi-GPU
  5. 5Supports AWQ, GPTQ, SqueezeLLM quantization for reduced VRAM
  6. 6HF token required for gated models like Llama-2

Quick Start

terminal
1# 1. Create the compose file
2cat > docker-compose.yml << 'EOF'
3services:
4 vllm:
5 image: vllm/vllm-openai:latest
6 container_name: vllm
7 restart: unless-stopped
8 command: --model ${MODEL_NAME} --port 8000
9 environment:
10 HUGGING_FACE_HUB_TOKEN: ${HF_TOKEN}
11 volumes:
12 - vllm_cache:/root/.cache/huggingface
13 ports:
14 - "8000:8000"
15 deploy:
16 resources:
17 reservations:
18 devices:
19 - driver: nvidia
20 count: all
21 capabilities: [gpu]
22
23volumes:
24 vllm_cache:
25EOF
26
27# 2. Create the .env file
28cat > .env << 'EOF'
29MODEL_NAME=meta-llama/Llama-2-7b-chat-hf
30HF_TOKEN=your_huggingface_token
31EOF
32
33# 3. Start the services
34docker compose up -d
35
36# 4. View logs
37docker compose logs -f

One-Liner

Run this command to download and set up the recipe in one step:

terminal
1curl -fsSL https://docker.recipes/api/recipes/vllm/run | bash

Troubleshooting

  • CUDA out of memory errors: Reduce model size, enable quantization with --quantization awq, or increase tensor parallelism across multiple GPUs
  • Model download failures: Verify HF_TOKEN environment variable is set correctly and has access to the requested model repository
  • Slow inference despite high-end hardware: Check GPU utilization and consider adjusting --max-num-batched-tokens or --max-num-seqs parameters
  • Container fails to start with GPU detection errors: Ensure nvidia-container-runtime is properly installed and Docker daemon restarted
  • OpenAI API compatibility issues: Verify client libraries support the /v1/chat/completions endpoint format and adjust request formatting
  • High memory usage during model loading: Use --load-format pt for faster loading or --download-dir to specify custom model cache location

Community Notes

Loading...
Loading notes...

Download Recipe Kit

Get all files in a ready-to-deploy package

Includes docker-compose.yml, .env template, README, and license

Ad Space