HuggingFace Text Generation Inference
High-performance text generation server for LLMs.
Overview
HuggingFace Text Generation Inference (TGI) is a production-grade inference server specifically designed for serving large language models at scale. Developed by Hugging Face, TGI transforms resource-intensive LLMs into high-performance APIs that can handle thousands of concurrent requests through advanced techniques like continuous batching, tensor parallelism, and dynamic quantization. The server supports popular model architectures including Llama, Mistral, CodeLlama, and Falcon while providing OpenAI-compatible endpoints for easy integration.
This Docker deployment creates a GPU-accelerated inference server that automatically downloads and optimizes your chosen model from the Hugging Face Hub. TGI handles the complex orchestration of model loading, memory management, and request batching while exposing simple REST endpoints for text generation. The server implements sophisticated optimization strategies including FlashAttention, custom CUDA kernels, and automatic mixed precision to maximize throughput while minimizing latency.
This configuration is perfect for AI startups, research teams, and enterprises needing to deploy LLMs in production without the complexity of building inference infrastructure from scratch. The setup provides enterprise-grade performance with features like health monitoring, graceful scaling, and support for both chat and completion endpoints, making it suitable for everything from chatbots to code generation services.
Key Features
- Continuous batching for maximum GPU utilization and reduced latency per request
- Dynamic quantization support including GPTQ, AWQ, and EETQ for memory-efficient inference
- OpenAI-compatible API endpoints supporting both /v1/completions and /v1/chat/completions
- Automatic FlashAttention integration for memory-efficient attention computation
- Tensor parallelism for distributing large models across multiple GPUs
- Built-in token streaming for real-time text generation responses
- Custom CUDA kernels optimized for popular model architectures like Llama and Mistral
- Automatic model downloading and caching from Hugging Face Hub with authentication support
Common Use Cases
- 1AI-powered customer support chatbots requiring low-latency responses at scale
- 2Code generation services for developer tools and IDE integrations
- 3Content creation platforms needing high-throughput text generation capabilities
- 4Research environments testing and comparing different LLM architectures
- 5Enterprise applications requiring on-premises LLM deployment for data privacy
- 6Educational platforms providing AI tutoring with custom fine-tuned models
- 7SaaS applications integrating text generation features without external API dependencies
Prerequisites
- NVIDIA GPU with at least 16GB VRAM for 7B parameter models (or 8GB for quantized versions)
- NVIDIA Container Toolkit installed and configured for GPU access in Docker
- Valid Hugging Face account and API token for accessing gated models like Llama or Mistral
- Sufficient disk space for model caching (7B models require ~15GB, 13B models ~25GB)
- Docker Compose version 2.3+ with GPU device support enabled
- Basic understanding of LLM terminology and HTTP API integration
For development & testing. Review security settings, change default credentials, and test thoroughly before production use. See Terms
docker-compose.yml
docker-compose.yml
1services: 2 tgi: 3 image: ghcr.io/huggingface/text-generation-inference:latest4 container_name: tgi5 restart: unless-stopped6 command: --model-id ${MODEL_ID}7 environment: 8 HUGGING_FACE_HUB_TOKEN: ${HF_TOKEN}9 volumes: 10 - tgi_cache:/data11 ports: 12 - "8080:80"13 deploy: 14 resources: 15 reservations: 16 devices: 17 - driver: nvidia18 count: all19 capabilities: [gpu]2021volumes: 22 tgi_cache: .env Template
.env
1MODEL_ID=tiiuae/falcon-7b-instruct2HF_TOKEN=your_huggingface_tokenUsage Notes
- 1Docs: https://huggingface.co/docs/text-generation-inference/
- 2API at http://localhost:8080 - OpenAI-compatible /v1/chat/completions
- 3HF token required for gated models (Llama, Mistral, etc.)
- 4Supports continuous batching, quantization (GPTQ, AWQ, EETQ)
- 5Health check: curl http://localhost:8080/health
- 6GPU required - 7B model needs ~16GB VRAM, use quantized for less
Quick Start
terminal
1# 1. Create the compose file2cat > docker-compose.yml << 'EOF'3services:4 tgi:5 image: ghcr.io/huggingface/text-generation-inference:latest6 container_name: tgi7 restart: unless-stopped8 command: --model-id ${MODEL_ID}9 environment:10 HUGGING_FACE_HUB_TOKEN: ${HF_TOKEN}11 volumes:12 - tgi_cache:/data13 ports:14 - "8080:80"15 deploy:16 resources:17 reservations:18 devices:19 - driver: nvidia20 count: all21 capabilities: [gpu]2223volumes:24 tgi_cache:25EOF2627# 2. Create the .env file28cat > .env << 'EOF'29MODEL_ID=tiiuae/falcon-7b-instruct30HF_TOKEN=your_huggingface_token31EOF3233# 3. Start the services34docker compose up -d3536# 4. View logs37docker compose logs -fOne-Liner
Run this command to download and set up the recipe in one step:
terminal
1curl -fsSL https://docker.recipes/api/recipes/huggingface-tgi/run | bashTroubleshooting
- Container exits with 'CUDA out of memory' error: Reduce model size, enable quantization with --quantize flag, or use a smaller model variant
- Model download fails with 403 Forbidden: Verify HF_TOKEN is set correctly and your account has access to the requested model
- GPU not detected during startup: Ensure nvidia-container-toolkit is installed and Docker daemon is restarted after installation
- API returns 'Model not loaded' error: Check container logs for download progress, some large models take 10-15 minutes to initialize
- High memory usage during model loading: Add --max-batch-prefill-tokens parameter to limit batch size during inference
- Connection refused on port 8080: Wait for health check to pass at /health endpoint before sending requests, initialization can take several minutes
Community Notes
Loading...
Loading notes...
Download Recipe Kit
Get all files in a ready-to-deploy package
Includes docker-compose.yml, .env template, README, and license
Ad Space
Shortcuts: C CopyF FavoriteD Download