BentoML
ML model serving and deployment framework.
Overview
BentoML is an open-source machine learning model serving framework that simplifies the deployment of ML models to production environments. Originally developed by Atalaya Tech and first released in 2019, BentoML addresses the critical gap between model development and deployment by providing a unified platform for packaging, versioning, and serving ML models from various frameworks including scikit-learn, PyTorch, TensorFlow, and XGBoost. The framework emphasizes performance optimization through features like adaptive batching, model parallelization, and automatic scaling capabilities.
This Docker deployment creates a containerized BentoML model server that can host and serve multiple ML models simultaneously. The configuration establishes a persistent environment where models can be built, packaged into 'bentos' (deployment artifacts), and served through REST APIs with automatic OpenAPI documentation generation. BentoML handles the complex infrastructure concerns like request batching, concurrent processing, and resource management while providing a clean Python API for model integration.
Data scientists and ML engineers working in production environments will find this setup particularly valuable when transitioning from Jupyter notebooks to scalable model serving infrastructure. The containerized approach ensures consistent model behavior across different deployment targets while BentoML's built-in monitoring and logging capabilities provide visibility into model performance and usage patterns in production workloads.
Key Features
- Multi-framework model support with unified APIs for scikit-learn, PyTorch, TensorFlow, XGBoost, and other popular ML libraries
- Adaptive micro-batching that automatically groups individual requests to optimize GPU utilization and inference throughput
- Built-in model versioning and artifact management with immutable bento packaging for reproducible deployments
- Automatic OpenAPI schema generation with interactive Swagger UI for testing and documentation of model endpoints
- High-performance async serving architecture with configurable worker processes and resource allocation
- Custom runner framework for advanced model serving patterns including ensemble models and multi-stage pipelines
- Integrated metrics collection and logging with support for Prometheus monitoring and distributed tracing
- Production-ready features including health checks, graceful shutdowns, and automatic request timeout handling
Common Use Cases
- 1ML model serving for recommendation engines in e-commerce platforms requiring low-latency inference
- 2Computer vision model deployment for real-time image classification and object detection applications
- 3NLP model hosting for chatbots, sentiment analysis, and text processing services in enterprise applications
- 4Financial services fraud detection models requiring high-throughput batch processing and real-time scoring
- 5A/B testing environments for comparing multiple model versions with traffic splitting capabilities
- 6Edge deployment preparation where models need containerized packaging for Kubernetes or cloud-native platforms
- 7Research environments requiring rapid prototyping and deployment of experimental ML models with version control
Prerequisites
- Docker Engine 20.10+ and Docker Compose v2 with at least 4GB available RAM for model loading and inference
- Basic understanding of machine learning model serving concepts and REST API consumption patterns
- Python development environment for building and testing models before containerized deployment
- Port 3000 available on the host system for the BentoML API server and Swagger documentation interface
- Familiarity with at least one supported ML framework (scikit-learn, PyTorch, TensorFlow) for model integration
- Understanding of Docker volume management for persistent model storage and bento artifact organization
For development & testing. Review security settings, change default credentials, and test thoroughly before production use. See Terms
docker-compose.yml
docker-compose.yml
1services: 2 bentoml: 3 image: bentoml/model-server:latest4 container_name: bentoml5 restart: unless-stopped6 volumes: 7 - bentoml_home:/home/bentoml8 - ./bentos:/bentos9 ports: 10 - "3000:3000"11 environment: 12 BENTOML_HOME: /home/bentoml1314volumes: 15 bentoml_home: .env Template
.env
1# Build bento with: bentoml buildUsage Notes
- 1Docs: https://docs.bentoml.org/
- 2API at http://localhost:3000, Swagger UI at http://localhost:3000/docs
- 3Build bento: bentoml build - creates from service.py + bentofile.yaml
- 4Containerize: bentoml containerize my_service:latest
- 5Save models: bentoml.sklearn.save_model('model', trained_model)
- 6Adaptive batching and auto-scaling built-in for production
Quick Start
terminal
1# 1. Create the compose file2cat > docker-compose.yml << 'EOF'3services:4 bentoml:5 image: bentoml/model-server:latest6 container_name: bentoml7 restart: unless-stopped8 volumes:9 - bentoml_home:/home/bentoml10 - ./bentos:/bentos11 ports:12 - "3000:3000"13 environment:14 BENTOML_HOME: /home/bentoml1516volumes:17 bentoml_home:18EOF1920# 2. Create the .env file21cat > .env << 'EOF'22# Build bento with: bentoml build23EOF2425# 3. Start the services26docker compose up -d2728# 4. View logs29docker compose logs -fOne-Liner
Run this command to download and set up the recipe in one step:
terminal
1curl -fsSL https://docker.recipes/api/recipes/bentoml/run | bashTroubleshooting
- ImportError: No module named 'bentoml': Ensure your model building environment has BentoML installed with pip install bentoml before creating bento packages
- Port 3000 already in use: Change the port mapping in docker-compose.yml from '3000:3000' to '3001:3000' or stop conflicting services
- Model loading timeout errors: Increase container memory allocation and add BENTOML_RUNNER_TIMEOUT environment variable with higher values for large models
- Permission denied accessing /bentos volume: Fix directory ownership with sudo chown -R 1000:1000 ./bentos or create the directory before container startup
- High memory usage during inference: Configure adaptive batching parameters in your service definition or limit concurrent requests with BENTOML_MAX_CONCURRENCY
- Swagger UI not displaying model schema: Verify your service.py includes proper input/output type annotations and rebuild the bento with bentoml build
Community Notes
Loading...
Loading notes...
Download Recipe Kit
Get all files in a ready-to-deploy package
Includes docker-compose.yml, .env template, README, and license
Ad Space
Shortcuts: C CopyF FavoriteD Download