Text Generation WebUI

intermediate

Gradio web UI for running large language models like LLaMA, GPT-J, and others locally

[i]Overview

Text Generation WebUI, developed by oobabooga, is a powerful Gradio-based web interface that democratizes access to large language models by allowing users to run sophisticated AI models like LLaMA, GPT-J, Alpaca, and Vicuna locally on their own hardware. Originally created as an open-source alternative to expensive cloud-based AI services, this tool has become the go-to solution for researchers, developers, and AI enthusiasts who need full control over their language model deployments without the privacy concerns or usage limitations of commercial APIs. The WebUI supports multiple model formats including GGUF, GPTQ, AWQ, and EXL2, making it compatible with quantized models that can run efficiently on consumer-grade GPUs. This Docker configuration leverages NVIDIA GPU acceleration to provide optimal performance for inference tasks while maintaining the flexibility to experiment with different models and parameters. The stack combines the WebUI's intuitive interface with REST API endpoints, enabling both interactive chat sessions and programmatic access for integration with external applications. Organizations and individual users benefit from this setup by gaining complete ownership of their AI infrastructure, eliminating per-token costs, ensuring data privacy, and having the freedom to fine-tune models with custom datasets. This combination is particularly valuable for teams developing AI-powered applications, researchers conducting experiments with language models, and privacy-conscious users who need to process sensitive text data without sending it to third-party services.

[*]Key Features

[+]Multi-format model support including GGUF, GPTQ, AWQ, and EXL2 quantized models for efficient GPU memory usage
[+]Dual interface design with Gradio web UI for interactive use and REST API endpoints for programmatic integration
[+]Advanced sampling parameters control including temperature, top-p, top-k, and repetition penalty adjustments
[+]Character card system for roleplay scenarios with persistent conversation contexts and custom personas
[+]LoRA (Low-Rank Adaptation) support for loading and switching between fine-tuned model variants
[+]Extension system allowing custom plugins for specialized text processing and generation tasks
[+]GPU memory optimization with automatic model sharding across multiple NVIDIA GPUs
[+]Preset management for saving and sharing optimal generation parameters for specific use cases

[#]Common Use Cases

[1]AI research labs running experiments with different language models and fine-tuning approaches
[2]Software development teams integrating local LLM capabilities into applications without API dependencies
[3]Content creators and writers using AI assistance for brainstorming, editing, and creative writing projects
[4]Educational institutions teaching AI concepts while maintaining data privacy and controlling costs
[5]Healthcare organizations processing sensitive patient data with HIPAA compliance requirements
[6]Legal firms analyzing documents and generating legal content without exposing confidential information
[7]Game developers creating dynamic NPC dialogue systems and procedural narrative content

[!]Prerequisites

[!]NVIDIA GPU with at least 8GB VRAM for running 7B parameter models, 16GB+ recommended for larger models
[!]Docker and Docker Compose with NVIDIA Container Toolkit installed and configured
[!]CUDA 11.8 or newer drivers compatible with your GPU hardware
[!]At least 32GB system RAM for loading large models and handling inference operations
[!]50GB+ free disk space for model storage, as popular models range from 4GB to 30GB each
[!]Understanding of language model concepts and familiarity with HuggingFace model repository structure

[!]

WARNING: For development & testing. Review security settings, change default credentials, and test thoroughly before production use. See Terms

[$]docker-compose.yml

[docker-compose.yml]

1services: 
2  text-generation-webui: 
3    image: ghcr.io/oobabooga/text-generation-webui:latest
4    container_name: text-generation-webui
5    restart: unless-stopped
6    ports: 
7      - "${WEBUI_PORT:-7860}:7860"
8      - "${API_PORT:-5000}:5000"
9      - "${API_STREAM_PORT:-5005}:5005"
10    volumes: 
11      - ./models:/app/models
12      - ./loras:/app/loras
13      - ./prompts:/app/prompts
14      - ./presets:/app/presets
15      - ./characters:/app/characters
16      - ./extensions:/app/extensions
17    environment: 
18      - NVIDIA_VISIBLE_DEVICES=all
19      - CLI_ARGS=${CLI_ARGS:---listen --api}
20    deploy: 
21      resources: 
22        reservations: 
23          devices: 
24            - driver: nvidia
25              count: all
26              capabilities: [gpu]

[$].env Template

[.env]

1# Text Generation WebUI Configuration
2WEBUI_PORT=7860
3API_PORT=5000
4API_STREAM_PORT=5005
5
6# CLI arguments
7# --listen: Listen on all interfaces
8# --api: Enable API
9# --extensions: Load extensions
10CLI_ARGS=--listen --api

[i]Usage Notes

[1]Requires NVIDIA GPU with CUDA support
[2]WebUI at http://localhost:7860
[3]API endpoint at http://localhost:5000
[4]Download models to ./models folder (HuggingFace)
[5]Supports GGUF, GPTQ, AWQ, EXL2 formats
[6]Character cards for roleplay in ./characters

[>]Quick Start

[terminal]

1# 1. Create the compose file
2cat > docker-compose.yml << 'EOF'
3services:
4  text-generation-webui:
5    image: ghcr.io/oobabooga/text-generation-webui:latest
6    container_name: text-generation-webui
7    restart: unless-stopped
8    ports:
9      - "${WEBUI_PORT:-7860}:7860"
10      - "${API_PORT:-5000}:5000"
11      - "${API_STREAM_PORT:-5005}:5005"
12    volumes:
13      - ./models:/app/models
14      - ./loras:/app/loras
15      - ./prompts:/app/prompts
16      - ./presets:/app/presets
17      - ./characters:/app/characters
18      - ./extensions:/app/extensions
19    environment:
20      - NVIDIA_VISIBLE_DEVICES=all
21      - CLI_ARGS=${CLI_ARGS:---listen --api}
22    deploy:
23      resources:
24        reservations:
25          devices:
26            - driver: nvidia
27              count: all
28              capabilities: [gpu]
29EOF
30
31# 2. Create the .env file
32cat > .env << 'EOF'
33# Text Generation WebUI Configuration
34WEBUI_PORT=7860
35API_PORT=5000
36API_STREAM_PORT=5005
37
38# CLI arguments
39# --listen: Listen on all interfaces
40# --api: Enable API
41# --extensions: Load extensions
42CLI_ARGS=--listen --api
43EOF
44
45# 3. Start the services
46docker compose up -d
47
48# 4. View logs
49docker compose logs -f

[>]One-Liner

Run this command to download and set up the recipe in one step:

[terminal]

1curl -fsSL https://docker.recipes/api/recipes/text-generation-webui/run | bash

[?]Troubleshooting

[!]CUDA out of memory errors: Reduce model size, enable CPU offloading in settings, or use more aggressive quantization like 4-bit GPTQ models
[!]Models fail to load with 'unsupported format' error: Ensure model files are in supported formats (GGUF, safetensors) and not corrupted during download
[!]WebUI shows 'No CUDA devices available': Verify nvidia-docker2 installation and restart Docker daemon, check NVIDIA_VISIBLE_DEVICES environment variable
[!]API requests timeout or return empty responses: Adjust the CLI_ARGS to include proper API flags and increase timeout values in client applications
[!]Character cards not appearing in interface: Ensure JSON character files are properly formatted and placed in the mounted characters directory with correct permissions
[!]Extensions not loading properly: Check extension compatibility with current WebUI version and verify Python dependencies are installed in container

Community Notes

Loading notes...

## Download Recipe Kit

Get all files in a ready-to-deploy package

Includes docker-compose.yml, .env template, README, and license

## Components

oobabooga/text-generation-webui

## Tags

#ai#llm#text-generation#llama#gpt#gpu

## Category

AI & Machine Learning

## Related

Shortcuts: C CopyF FavoriteD Download