docker.recipes

HuggingFace Text Generation Inference

intermediate

High-performance text generation server for LLMs.

Overview

HuggingFace Text Generation Inference (TGI) is a production-grade inference server specifically designed for serving large language models at scale. Developed by Hugging Face, TGI transforms resource-intensive LLMs into high-performance APIs that can handle thousands of concurrent requests through advanced techniques like continuous batching, tensor parallelism, and dynamic quantization. The server supports popular model architectures including Llama, Mistral, CodeLlama, and Falcon while providing OpenAI-compatible endpoints for easy integration. This Docker deployment creates a GPU-accelerated inference server that automatically downloads and optimizes your chosen model from the Hugging Face Hub. TGI handles the complex orchestration of model loading, memory management, and request batching while exposing simple REST endpoints for text generation. The server implements sophisticated optimization strategies including FlashAttention, custom CUDA kernels, and automatic mixed precision to maximize throughput while minimizing latency. This configuration is perfect for AI startups, research teams, and enterprises needing to deploy LLMs in production without the complexity of building inference infrastructure from scratch. The setup provides enterprise-grade performance with features like health monitoring, graceful scaling, and support for both chat and completion endpoints, making it suitable for everything from chatbots to code generation services.

Key Features

  • Continuous batching for maximum GPU utilization and reduced latency per request
  • Dynamic quantization support including GPTQ, AWQ, and EETQ for memory-efficient inference
  • OpenAI-compatible API endpoints supporting both /v1/completions and /v1/chat/completions
  • Automatic FlashAttention integration for memory-efficient attention computation
  • Tensor parallelism for distributing large models across multiple GPUs
  • Built-in token streaming for real-time text generation responses
  • Custom CUDA kernels optimized for popular model architectures like Llama and Mistral
  • Automatic model downloading and caching from Hugging Face Hub with authentication support

Common Use Cases

  • 1AI-powered customer support chatbots requiring low-latency responses at scale
  • 2Code generation services for developer tools and IDE integrations
  • 3Content creation platforms needing high-throughput text generation capabilities
  • 4Research environments testing and comparing different LLM architectures
  • 5Enterprise applications requiring on-premises LLM deployment for data privacy
  • 6Educational platforms providing AI tutoring with custom fine-tuned models
  • 7SaaS applications integrating text generation features without external API dependencies

Prerequisites

  • NVIDIA GPU with at least 16GB VRAM for 7B parameter models (or 8GB for quantized versions)
  • NVIDIA Container Toolkit installed and configured for GPU access in Docker
  • Valid Hugging Face account and API token for accessing gated models like Llama or Mistral
  • Sufficient disk space for model caching (7B models require ~15GB, 13B models ~25GB)
  • Docker Compose version 2.3+ with GPU device support enabled
  • Basic understanding of LLM terminology and HTTP API integration

For development & testing. Review security settings, change default credentials, and test thoroughly before production use. See Terms

docker-compose.yml

docker-compose.yml
1services:
2 tgi:
3 image: ghcr.io/huggingface/text-generation-inference:latest
4 container_name: tgi
5 restart: unless-stopped
6 command: --model-id ${MODEL_ID}
7 environment:
8 HUGGING_FACE_HUB_TOKEN: ${HF_TOKEN}
9 volumes:
10 - tgi_cache:/data
11 ports:
12 - "8080:80"
13 deploy:
14 resources:
15 reservations:
16 devices:
17 - driver: nvidia
18 count: all
19 capabilities: [gpu]
20
21volumes:
22 tgi_cache:

.env Template

.env
1MODEL_ID=tiiuae/falcon-7b-instruct
2HF_TOKEN=your_huggingface_token

Usage Notes

  1. 1Docs: https://huggingface.co/docs/text-generation-inference/
  2. 2API at http://localhost:8080 - OpenAI-compatible /v1/chat/completions
  3. 3HF token required for gated models (Llama, Mistral, etc.)
  4. 4Supports continuous batching, quantization (GPTQ, AWQ, EETQ)
  5. 5Health check: curl http://localhost:8080/health
  6. 6GPU required - 7B model needs ~16GB VRAM, use quantized for less

Quick Start

terminal
1# 1. Create the compose file
2cat > docker-compose.yml << 'EOF'
3services:
4 tgi:
5 image: ghcr.io/huggingface/text-generation-inference:latest
6 container_name: tgi
7 restart: unless-stopped
8 command: --model-id ${MODEL_ID}
9 environment:
10 HUGGING_FACE_HUB_TOKEN: ${HF_TOKEN}
11 volumes:
12 - tgi_cache:/data
13 ports:
14 - "8080:80"
15 deploy:
16 resources:
17 reservations:
18 devices:
19 - driver: nvidia
20 count: all
21 capabilities: [gpu]
22
23volumes:
24 tgi_cache:
25EOF
26
27# 2. Create the .env file
28cat > .env << 'EOF'
29MODEL_ID=tiiuae/falcon-7b-instruct
30HF_TOKEN=your_huggingface_token
31EOF
32
33# 3. Start the services
34docker compose up -d
35
36# 4. View logs
37docker compose logs -f

One-Liner

Run this command to download and set up the recipe in one step:

terminal
1curl -fsSL https://docker.recipes/api/recipes/huggingface-tgi/run | bash

Troubleshooting

  • Container exits with 'CUDA out of memory' error: Reduce model size, enable quantization with --quantize flag, or use a smaller model variant
  • Model download fails with 403 Forbidden: Verify HF_TOKEN is set correctly and your account has access to the requested model
  • GPU not detected during startup: Ensure nvidia-container-toolkit is installed and Docker daemon is restarted after installation
  • API returns 'Model not loaded' error: Check container logs for download progress, some large models take 10-15 minutes to initialize
  • High memory usage during model loading: Add --max-batch-prefill-tokens parameter to limit batch size during inference
  • Connection refused on port 8080: Wait for health check to pass at /health endpoint before sending requests, initialization can take several minutes

Community Notes

Loading...
Loading notes...

Download Recipe Kit

Get all files in a ready-to-deploy package

Includes docker-compose.yml, .env template, README, and license

Ad Space