Whisper (OpenAI Speech-to-Text)
OpenAI's Whisper automatic speech recognition model with web API
Overview
Whisper is OpenAI's groundbreaking automatic speech recognition (ASR) system that was trained on 680,000 hours of multilingual and multitask supervised data collected from the web. Released as an open-source model in September 2022, Whisper demonstrates remarkable robustness to accents, background noise, and technical language while supporting transcription in 99 languages and translation from those languages into English. The model approaches human-level robustness and accuracy on English speech recognition tasks.
This Docker stack combines the original OpenAI Whisper implementation with an optional faster-whisper alternative, both wrapped in RESTful web services. The onerahmet/openai-whisper-asr-webservice provides a complete HTTP API around Whisper models, while faster-whisper offers a more efficient implementation using CTranslate2 for improved inference speed and reduced memory usage. Both services expose OpenAPI-compliant endpoints that accept audio files and return high-quality transcriptions with timestamps and confidence scores.
Developers building voice-enabled applications, content creators processing multilingual audio content, and researchers working with speech data will find this stack invaluable. The containerized approach eliminates the complexity of managing Python environments, CUDA dependencies, and model downloads while providing production-ready APIs that can handle concurrent transcription requests. The inclusion of both standard and optimized Whisper implementations allows teams to choose between maximum compatibility and performance optimization based on their specific requirements.
Key Features
- Multiple Whisper model sizes from tiny (39MB) to large (1550MB) with configurable accuracy-speed tradeoffs
- Support for 99 languages including English, Spanish, French, German, Chinese, Japanese, Arabic, and many others
- RESTful API with file upload endpoints accepting WAV, MP3, FLAC, M4A, and other common audio formats
- OpenAPI/Swagger documentation interface for interactive API testing and integration
- Automatic model caching with persistent volume storage to avoid repeated downloads
- GPU acceleration support with NVIDIA CUDA for significantly faster transcription processing
- Timestamp-accurate transcription output with word-level timing information
- Alternative faster-whisper implementation offering 4x speed improvement and 2x memory reduction
Common Use Cases
- 1Podcast and video content creators automating subtitle generation and show notes creation
- 2Customer service teams transcribing call recordings for quality assurance and sentiment analysis
- 3Medical professionals converting patient consultations and dictated notes into searchable text records
- 4Educational institutions creating accessible transcripts for lectures and online course content
- 5Legal firms processing depositions, court recordings, and client interview documentation
- 6Journalists and researchers transcribing multilingual interviews and field recordings
- 7Voice-enabled application developers integrating speech-to-text functionality into mobile and web apps
Prerequisites
- NVIDIA GPU with at least 4GB VRAM for large models (2GB minimum for base model)
- Docker Engine 20.10+ with NVIDIA Container Toolkit installed for GPU support
- Minimum 8GB system RAM (16GB recommended for concurrent transcription requests)
- At least 5GB free disk space for model storage (varies by selected Whisper model size)
- Port 9000 available for Whisper API service (8000 for faster-whisper alternative)
- Basic understanding of REST API concepts and audio file format requirements
For development & testing. Review security settings, change default credentials, and test thoroughly before production use. See Terms
docker-compose.yml
docker-compose.yml
1services: 2 whisper: 3 image: onerahmet/openai-whisper-asr-webservice:latest4 container_name: whisper5 restart: unless-stopped6 ports: 7 - "${WHISPER_PORT:-9000}:9000"8 environment: 9 - ASR_MODEL=${ASR_MODEL:-base}10 - ASR_ENGINE=${ASR_ENGINE:-openai_whisper}11 volumes: 12 - ./models:/root/.cache/whisper13 deploy: 14 resources: 15 reservations: 16 devices: 17 - driver: nvidia18 count: all19 capabilities: [gpu]2021 # Alternative: Faster Whisper (more efficient)22 # faster-whisper:23 # image: fedirz/faster-whisper-server:latest24 # container_name: faster-whisper25 # restart: unless-stopped26 # ports:27 # - "8000:8000"28 # environment:29 # - WHISPER__MODEL=base30 # - WHISPER__DEVICE=cuda31 # deploy:32 # resources:33 # reservations:34 # devices:35 # - driver: nvidia36 # count: all37 # capabilities: [gpu].env Template
.env
1# Whisper Configuration2WHISPER_PORT=900034# Model size: tiny, base, small, medium, large, large-v2, large-v35ASR_MODEL=base67# Engine: openai_whisper or faster_whisper8ASR_ENGINE=openai_whisperUsage Notes
- 1API endpoint at http://localhost:9000
- 2Upload audio files via POST /asr
- 3Larger models = better accuracy but more VRAM
- 4GPU recommended for faster transcription
- 5Supports 99 languages
- 6OpenAPI docs at http://localhost:9000/docs
Quick Start
terminal
1# 1. Create the compose file2cat > docker-compose.yml << 'EOF'3services:4 whisper:5 image: onerahmet/openai-whisper-asr-webservice:latest6 container_name: whisper7 restart: unless-stopped8 ports:9 - "${WHISPER_PORT:-9000}:9000"10 environment:11 - ASR_MODEL=${ASR_MODEL:-base}12 - ASR_ENGINE=${ASR_ENGINE:-openai_whisper}13 volumes:14 - ./models:/root/.cache/whisper15 deploy:16 resources:17 reservations:18 devices:19 - driver: nvidia20 count: all21 capabilities: [gpu]2223 # Alternative: Faster Whisper (more efficient)24 # faster-whisper:25 # image: fedirz/faster-whisper-server:latest26 # container_name: faster-whisper27 # restart: unless-stopped28 # ports:29 # - "8000:8000"30 # environment:31 # - WHISPER__MODEL=base32 # - WHISPER__DEVICE=cuda33 # deploy:34 # resources:35 # reservations:36 # devices:37 # - driver: nvidia38 # count: all39 # capabilities: [gpu]40EOF4142# 2. Create the .env file43cat > .env << 'EOF'44# Whisper Configuration45WHISPER_PORT=90004647# Model size: tiny, base, small, medium, large, large-v2, large-v348ASR_MODEL=base4950# Engine: openai_whisper or faster_whisper51ASR_ENGINE=openai_whisper52EOF5354# 3. Start the services55docker compose up -d5657# 4. View logs58docker compose logs -fOne-Liner
Run this command to download and set up the recipe in one step:
terminal
1curl -fsSL https://docker.recipes/api/recipes/whisper/run | bashTroubleshooting
- CUDA out of memory errors: Reduce ASR_MODEL to smaller size (tiny, base, small) or increase GPU memory allocation
- Models downloading repeatedly on restart: Ensure ./models volume mount has proper write permissions for container user
- Transcription requests timing out: Increase client timeout settings as large audio files can take several minutes to process
- API returns 422 validation errors: Verify audio file format is supported and file size is under service limits
- Container fails to start with GPU errors: Confirm NVIDIA Container Toolkit installation and docker-compose GPU configuration syntax
- Poor transcription quality on noisy audio: Preprocess audio files to reduce background noise or use larger Whisper model for better accuracy
Community Notes
Loading...
Loading notes...
Download Recipe Kit
Get all files in a ready-to-deploy package
Includes docker-compose.yml, .env template, README, and license
Components
whisperfaster-whisper
Tags
#ai#speech-to-text#transcription#audio#whisper
Category
AI & Machine LearningAd Space
Shortcuts: C CopyF FavoriteD Download