docker.recipes

Whisper (OpenAI Speech-to-Text)

intermediate

OpenAI's Whisper automatic speech recognition model with web API

Overview

Whisper is OpenAI's groundbreaking automatic speech recognition (ASR) system that was trained on 680,000 hours of multilingual and multitask supervised data collected from the web. Released as an open-source model in September 2022, Whisper demonstrates remarkable robustness to accents, background noise, and technical language while supporting transcription in 99 languages and translation from those languages into English. The model approaches human-level robustness and accuracy on English speech recognition tasks. This Docker stack combines the original OpenAI Whisper implementation with an optional faster-whisper alternative, both wrapped in RESTful web services. The onerahmet/openai-whisper-asr-webservice provides a complete HTTP API around Whisper models, while faster-whisper offers a more efficient implementation using CTranslate2 for improved inference speed and reduced memory usage. Both services expose OpenAPI-compliant endpoints that accept audio files and return high-quality transcriptions with timestamps and confidence scores. Developers building voice-enabled applications, content creators processing multilingual audio content, and researchers working with speech data will find this stack invaluable. The containerized approach eliminates the complexity of managing Python environments, CUDA dependencies, and model downloads while providing production-ready APIs that can handle concurrent transcription requests. The inclusion of both standard and optimized Whisper implementations allows teams to choose between maximum compatibility and performance optimization based on their specific requirements.

Key Features

  • Multiple Whisper model sizes from tiny (39MB) to large (1550MB) with configurable accuracy-speed tradeoffs
  • Support for 99 languages including English, Spanish, French, German, Chinese, Japanese, Arabic, and many others
  • RESTful API with file upload endpoints accepting WAV, MP3, FLAC, M4A, and other common audio formats
  • OpenAPI/Swagger documentation interface for interactive API testing and integration
  • Automatic model caching with persistent volume storage to avoid repeated downloads
  • GPU acceleration support with NVIDIA CUDA for significantly faster transcription processing
  • Timestamp-accurate transcription output with word-level timing information
  • Alternative faster-whisper implementation offering 4x speed improvement and 2x memory reduction

Common Use Cases

  • 1Podcast and video content creators automating subtitle generation and show notes creation
  • 2Customer service teams transcribing call recordings for quality assurance and sentiment analysis
  • 3Medical professionals converting patient consultations and dictated notes into searchable text records
  • 4Educational institutions creating accessible transcripts for lectures and online course content
  • 5Legal firms processing depositions, court recordings, and client interview documentation
  • 6Journalists and researchers transcribing multilingual interviews and field recordings
  • 7Voice-enabled application developers integrating speech-to-text functionality into mobile and web apps

Prerequisites

  • NVIDIA GPU with at least 4GB VRAM for large models (2GB minimum for base model)
  • Docker Engine 20.10+ with NVIDIA Container Toolkit installed for GPU support
  • Minimum 8GB system RAM (16GB recommended for concurrent transcription requests)
  • At least 5GB free disk space for model storage (varies by selected Whisper model size)
  • Port 9000 available for Whisper API service (8000 for faster-whisper alternative)
  • Basic understanding of REST API concepts and audio file format requirements

For development & testing. Review security settings, change default credentials, and test thoroughly before production use. See Terms

docker-compose.yml

docker-compose.yml
1services:
2 whisper:
3 image: onerahmet/openai-whisper-asr-webservice:latest
4 container_name: whisper
5 restart: unless-stopped
6 ports:
7 - "${WHISPER_PORT:-9000}:9000"
8 environment:
9 - ASR_MODEL=${ASR_MODEL:-base}
10 - ASR_ENGINE=${ASR_ENGINE:-openai_whisper}
11 volumes:
12 - ./models:/root/.cache/whisper
13 deploy:
14 resources:
15 reservations:
16 devices:
17 - driver: nvidia
18 count: all
19 capabilities: [gpu]
20
21 # Alternative: Faster Whisper (more efficient)
22 # faster-whisper:
23 # image: fedirz/faster-whisper-server:latest
24 # container_name: faster-whisper
25 # restart: unless-stopped
26 # ports:
27 # - "8000:8000"
28 # environment:
29 # - WHISPER__MODEL=base
30 # - WHISPER__DEVICE=cuda
31 # deploy:
32 # resources:
33 # reservations:
34 # devices:
35 # - driver: nvidia
36 # count: all
37 # capabilities: [gpu]

.env Template

.env
1# Whisper Configuration
2WHISPER_PORT=9000
3
4# Model size: tiny, base, small, medium, large, large-v2, large-v3
5ASR_MODEL=base
6
7# Engine: openai_whisper or faster_whisper
8ASR_ENGINE=openai_whisper

Usage Notes

  1. 1API endpoint at http://localhost:9000
  2. 2Upload audio files via POST /asr
  3. 3Larger models = better accuracy but more VRAM
  4. 4GPU recommended for faster transcription
  5. 5Supports 99 languages
  6. 6OpenAPI docs at http://localhost:9000/docs

Quick Start

terminal
1# 1. Create the compose file
2cat > docker-compose.yml << 'EOF'
3services:
4 whisper:
5 image: onerahmet/openai-whisper-asr-webservice:latest
6 container_name: whisper
7 restart: unless-stopped
8 ports:
9 - "${WHISPER_PORT:-9000}:9000"
10 environment:
11 - ASR_MODEL=${ASR_MODEL:-base}
12 - ASR_ENGINE=${ASR_ENGINE:-openai_whisper}
13 volumes:
14 - ./models:/root/.cache/whisper
15 deploy:
16 resources:
17 reservations:
18 devices:
19 - driver: nvidia
20 count: all
21 capabilities: [gpu]
22
23 # Alternative: Faster Whisper (more efficient)
24 # faster-whisper:
25 # image: fedirz/faster-whisper-server:latest
26 # container_name: faster-whisper
27 # restart: unless-stopped
28 # ports:
29 # - "8000:8000"
30 # environment:
31 # - WHISPER__MODEL=base
32 # - WHISPER__DEVICE=cuda
33 # deploy:
34 # resources:
35 # reservations:
36 # devices:
37 # - driver: nvidia
38 # count: all
39 # capabilities: [gpu]
40EOF
41
42# 2. Create the .env file
43cat > .env << 'EOF'
44# Whisper Configuration
45WHISPER_PORT=9000
46
47# Model size: tiny, base, small, medium, large, large-v2, large-v3
48ASR_MODEL=base
49
50# Engine: openai_whisper or faster_whisper
51ASR_ENGINE=openai_whisper
52EOF
53
54# 3. Start the services
55docker compose up -d
56
57# 4. View logs
58docker compose logs -f

One-Liner

Run this command to download and set up the recipe in one step:

terminal
1curl -fsSL https://docker.recipes/api/recipes/whisper/run | bash

Troubleshooting

  • CUDA out of memory errors: Reduce ASR_MODEL to smaller size (tiny, base, small) or increase GPU memory allocation
  • Models downloading repeatedly on restart: Ensure ./models volume mount has proper write permissions for container user
  • Transcription requests timing out: Increase client timeout settings as large audio files can take several minutes to process
  • API returns 422 validation errors: Verify audio file format is supported and file size is under service limits
  • Container fails to start with GPU errors: Confirm NVIDIA Container Toolkit installation and docker-compose GPU configuration syntax
  • Poor transcription quality on noisy audio: Preprocess audio files to reduce background noise or use larger Whisper model for better accuracy

Community Notes

Loading...
Loading notes...

Download Recipe Kit

Get all files in a ready-to-deploy package

Includes docker-compose.yml, .env template, README, and license

Components

whisperfaster-whisper

Tags

#ai#speech-to-text#transcription#audio#whisper

Category

AI & Machine Learning
Ad Space