vLLM

intermediate

High-throughput LLM serving with PagedAttention.

[i]Overview

vLLM (versatile Large Language Model) is a high-performance inference engine developed by UC Berkeley that revolutionizes LLM serving through its innovative PagedAttention algorithm. Originally created to address the memory bottlenecks in transformer-based language model serving, vLLM dynamically manages attention key-value tensors using virtual memory techniques similar to operating system paging. This breakthrough enables up to 24x higher serving throughput compared to traditional methods like Hugging Face Transformers while maintaining identical output quality. vLLM operates as an OpenAI-compatible API server that can serve virtually any transformer-based language model from Hugging Face Hub with minimal configuration changes. The system efficiently manages GPU memory by storing attention keys and values in non-contiguous memory blocks, eliminating the need for pre-allocation of large contiguous memory chunks. This approach dramatically reduces memory fragmentation and enables higher batch sizes, leading to unprecedented throughput improvements for LLM inference workloads. This configuration is ideal for AI researchers, ML engineers, and organizations requiring high-performance LLM serving infrastructure. The setup supports everything from small 7B parameter models on single GPUs to massive models requiring tensor parallelism across multiple GPUs. With built-in support for quantization techniques like AWQ, GPTQ, and SqueezeLLM, vLLM enables deployment of large models even on hardware with limited VRAM while maintaining excellent performance characteristics.

[*]Key Features

[+]PagedAttention algorithm delivering up to 24x higher throughput than standard transformer implementations
[+]OpenAI-compatible API endpoints supporting chat completions, completions, and embeddings
[+]Tensor parallelism support for distributing large models across multiple GPUs using --tensor-parallel-size
[+]Integrated quantization support for AWQ, GPTQ, and SqueezeLLM to reduce VRAM requirements
[+]Dynamic batching with continuous batching optimization for improved request handling efficiency
[+]Native Hugging Face Hub integration with automatic model downloading and caching
[+]GPU memory optimization through non-contiguous KV cache management eliminating fragmentation
[+]Support for popular model architectures including LLaMA, Mistral, CodeLlama, and Vicuna

[#]Common Use Cases

[1]High-throughput chatbot backends requiring concurrent handling of hundreds of user requests
[2]Code generation services needing fast response times for developer IDE integrations
[3]Research environments comparing multiple LLM outputs with consistent serving infrastructure
[4]Production AI applications requiring OpenAI API compatibility without vendor lock-in
[5]Multi-tenant SaaS platforms serving different models to various customer segments
[6]Real-time content generation systems for marketing, writing assistance, or creative applications
[7]Edge AI deployments where memory efficiency is critical for running large models on limited hardware

[!]Prerequisites

[!]NVIDIA GPU with CUDA support and sufficient VRAM for your target model (minimum 8GB for 7B models)
[!]Docker with NVIDIA Container Runtime installed and configured for GPU access
[!]Hugging Face account and token for accessing gated models like LLaMA-2 or Code Llama
[!]At least 32GB system RAM recommended for optimal performance with larger models
[!]Understanding of transformer model architectures and parameter counts for appropriate resource allocation
[!]Familiarity with OpenAI API format for integration with existing applications

[!]

WARNING: For development & testing. Review security settings, change default credentials, and test thoroughly before production use. See Terms

[$]docker-compose.yml

[docker-compose.yml]

1services: 
2  vllm: 
3    image: vllm/vllm-openai:latest
4    container_name: vllm
5    restart: unless-stopped
6    command: --model ${MODEL_NAME} --port 8000
7    environment: 
8      HUGGING_FACE_HUB_TOKEN: ${HF_TOKEN}
9    volumes: 
10      - vllm_cache:/root/.cache/huggingface
11    ports: 
12      - "8000:8000"
13    deploy: 
14      resources: 
15        reservations: 
16          devices: 
17            - driver: nvidia
18              count: all
19              capabilities: [gpu]
20
21volumes: 
22  vllm_cache:

[$].env Template

[.env]

1MODEL_NAME=meta-llama/Llama-2-7b-chat-hf
2HF_TOKEN=your_huggingface_token

[i]Usage Notes

[1]Docs: https://docs.vllm.ai/
[2]OpenAI-compatible API at http://localhost:8000/v1/chat/completions
[3]PagedAttention enables 24x higher throughput than HF Transformers
[4]Tensor parallelism: --tensor-parallel-size N for multi-GPU
[5]Supports AWQ, GPTQ, SqueezeLLM quantization for reduced VRAM
[6]HF token required for gated models like Llama-2

[>]Quick Start

[terminal]

1# 1. Create the compose file
2cat > docker-compose.yml << 'EOF'
3services:
4  vllm:
5    image: vllm/vllm-openai:latest
6    container_name: vllm
7    restart: unless-stopped
8    command: --model ${MODEL_NAME} --port 8000
9    environment:
10      HUGGING_FACE_HUB_TOKEN: ${HF_TOKEN}
11    volumes:
12      - vllm_cache:/root/.cache/huggingface
13    ports:
14      - "8000:8000"
15    deploy:
16      resources:
17        reservations:
18          devices:
19            - driver: nvidia
20              count: all
21              capabilities: [gpu]
22
23volumes:
24  vllm_cache:
25EOF
26
27# 2. Create the .env file
28cat > .env << 'EOF'
29MODEL_NAME=meta-llama/Llama-2-7b-chat-hf
30HF_TOKEN=your_huggingface_token
31EOF
32
33# 3. Start the services
34docker compose up -d
35
36# 4. View logs
37docker compose logs -f

[>]One-Liner

Run this command to download and set up the recipe in one step:

[terminal]

1curl -fsSL https://docker.recipes/api/recipes/vllm/run | bash

[?]Troubleshooting

[!]CUDA out of memory errors: Reduce model size, enable quantization with --quantization awq, or increase tensor parallelism across multiple GPUs
[!]Model download failures: Verify HF_TOKEN environment variable is set correctly and has access to the requested model repository
[!]Slow inference despite high-end hardware: Check GPU utilization and consider adjusting --max-num-batched-tokens or --max-num-seqs parameters
[!]Container fails to start with GPU detection errors: Ensure nvidia-container-runtime is properly installed and Docker daemon restarted
[!]OpenAI API compatibility issues: Verify client libraries support the /v1/chat/completions endpoint format and adjust request formatting
[!]High memory usage during model loading: Use --load-format pt for faster loading or --download-dir to specify custom model cache location

Community Notes

Loading notes...

## Download Recipe Kit

Get all files in a ready-to-deploy package

Includes docker-compose.yml, .env template, README, and license

## Components

vllm

## Tags

#vllm#llm#inference#serving

## Category

AI & Machine Learning

## Related

Shortcuts: C CopyF FavoriteD Download