Determined AI
Deep learning training platform with experiment tracking.
Overview
Determined AI is an open-source machine learning training platform designed to simplify and scale deep learning workflows. Built by former Google researchers, it addresses the complexities of distributed training, experiment management, and hyperparameter optimization that plague modern ML teams. The platform automatically handles checkpointing, fault tolerance, and resource allocation while providing sophisticated experiment tracking capabilities that rival commercial solutions like Weights & Biases or Neptune.
This stack combines Determined's master-agent architecture with PostgreSQL as the metadata store. The determined-master service orchestrates experiments, manages the web interface, and stores all experiment metadata, metrics, and configurations in PostgreSQL. The determined-agent handles actual workload execution, automatically pulling training code and managing containerized experiments. PostgreSQL's ACID compliance and JSON support make it ideal for storing complex experiment configurations, hyperparameter spaces, and time-series metrics data that Determined generates.
This configuration targets ML engineering teams and research organizations running serious deep learning workloads who need reproducible experiments and efficient resource utilization. Unlike notebook-based workflows or simple training scripts, Determined enforces best practices around experiment versioning, data loading, and model checkpointing. The PostgreSQL backend ensures experiment history survives system restarts and provides robust querying capabilities for analyzing training runs across multiple projects.
Key Features
- Adaptive hyperparameter search algorithms (ASHA, Population Based Training) that intelligently terminate poor-performing trials early
- Automatic distributed training with built-in support for data parallelism and model parallelism across multiple GPUs
- Fault-tolerant training with automatic checkpointing and resume capabilities when nodes fail
- Fair-share cluster scheduling with preemption support for multi-user environments
- Web-based experiment comparison interface with real-time metrics visualization and hyperparameter analysis
- PostgreSQL-backed experiment metadata storage with full JSON support for complex configuration tracking
- Docker-based experiment isolation ensuring reproducible training environments
- CLI-driven workflow supporting both interactive development and CI/CD integration
Common Use Cases
- 1Computer vision teams training large models (ResNet, EfficientNet, Vision Transformers) requiring distributed GPU training
- 2NLP research groups running transformer fine-tuning experiments with extensive hyperparameter sweeps
- 3Autonomous vehicle companies managing hundreds of simultaneous model training jobs across GPU clusters
- 4Pharmaceutical research using deep learning for drug discovery with complex molecular property prediction models
- 5Financial institutions training fraud detection models requiring rigorous experiment tracking for regulatory compliance
- 6Academic research labs needing shared GPU resources with fair scheduling among multiple PhD students and projects
- 7MLOps teams transitioning from ad-hoc training scripts to production-grade experiment management platforms
Prerequisites
- Minimum 4GB RAM for PostgreSQL and Determined master services combined
- Docker daemon with access to GPU runtime (nvidia-docker2) if training GPU-accelerated models
- Understanding of machine learning training loops and familiarity with PyTorch or TensorFlow
- Network access to pull training data and Docker images during experiment execution
- Basic knowledge of YAML configuration files for defining Determined experiment specifications
- Port 8080 available for the Determined web interface and API access
For development & testing. Review security settings, change default credentials, and test thoroughly before production use. See Terms
docker-compose.yml
docker-compose.yml
1services: 2 determined-master: 3 image: determinedai/determined-master:latest4 container_name: determined-master5 restart: unless-stopped6 environment: 7 DET_DB_HOST: postgres8 DET_DB_NAME: determined9 DET_DB_USER: determined10 DET_DB_PASSWORD: determined11 ports: 12 - "8080:8080"13 depends_on: 14 - postgres15 networks: 16 - determined1718 determined-agent: 19 image: determinedai/determined-agent:latest20 container_name: determined-agent21 restart: unless-stopped22 environment: 23 DET_MASTER_HOST: determined-master24 DET_MASTER_PORT: 808025 volumes: 26 - /var/run/docker.sock:/var/run/docker.sock27 depends_on: 28 - determined-master29 networks: 30 - determined3132 postgres: 33 image: postgres:16-alpine34 container_name: determined-postgres35 environment: 36 POSTGRES_DB: determined37 POSTGRES_USER: determined38 POSTGRES_PASSWORD: determined39 volumes: 40 - postgres_data:/var/lib/postgresql/data41 networks: 42 - determined4344volumes: 45 postgres_data: 4647networks: 48 determined: 49 driver: bridge.env Template
.env
1# Configure via web UIUsage Notes
- 1Docs: https://docs.determined.ai/
- 2Web UI at http://localhost:8080 - default login: admin (no password)
- 3CLI: pip install determined, then det experiment create config.yaml .
- 4Distributed training with automatic checkpointing and fault tolerance
- 5Hyperparameter search: grid, random, adaptive (ASHA, PBT)
- 6GPU cluster management with fair-share scheduling
Individual Services(3 services)
Copy individual services to mix and match with your existing compose files.
determined-master
determined-master:
image: determinedai/determined-master:latest
container_name: determined-master
restart: unless-stopped
environment:
DET_DB_HOST: postgres
DET_DB_NAME: determined
DET_DB_USER: determined
DET_DB_PASSWORD: determined
ports:
- "8080:8080"
depends_on:
- postgres
networks:
- determined
determined-agent
determined-agent:
image: determinedai/determined-agent:latest
container_name: determined-agent
restart: unless-stopped
environment:
DET_MASTER_HOST: determined-master
DET_MASTER_PORT: 8080
volumes:
- /var/run/docker.sock:/var/run/docker.sock
depends_on:
- determined-master
networks:
- determined
postgres
postgres:
image: postgres:16-alpine
container_name: determined-postgres
environment:
POSTGRES_DB: determined
POSTGRES_USER: determined
POSTGRES_PASSWORD: determined
volumes:
- postgres_data:/var/lib/postgresql/data
networks:
- determined
Quick Start
terminal
1# 1. Create the compose file2cat > docker-compose.yml << 'EOF'3services:4 determined-master:5 image: determinedai/determined-master:latest6 container_name: determined-master7 restart: unless-stopped8 environment:9 DET_DB_HOST: postgres10 DET_DB_NAME: determined11 DET_DB_USER: determined12 DET_DB_PASSWORD: determined13 ports:14 - "8080:8080"15 depends_on:16 - postgres17 networks:18 - determined1920 determined-agent:21 image: determinedai/determined-agent:latest22 container_name: determined-agent23 restart: unless-stopped24 environment:25 DET_MASTER_HOST: determined-master26 DET_MASTER_PORT: 808027 volumes:28 - /var/run/docker.sock:/var/run/docker.sock29 depends_on:30 - determined-master31 networks:32 - determined3334 postgres:35 image: postgres:16-alpine36 container_name: determined-postgres37 environment:38 POSTGRES_DB: determined39 POSTGRES_USER: determined40 POSTGRES_PASSWORD: determined41 volumes:42 - postgres_data:/var/lib/postgresql/data43 networks:44 - determined4546volumes:47 postgres_data:4849networks:50 determined:51 driver: bridge52EOF5354# 2. Create the .env file55cat > .env << 'EOF'56# Configure via web UI57EOF5859# 3. Start the services60docker compose up -d6162# 4. View logs63docker compose logs -fOne-Liner
Run this command to download and set up the recipe in one step:
terminal
1curl -fsSL https://docker.recipes/api/recipes/determined-ai/run | bashTroubleshooting
- Agent shows 'failed to connect to master' errors: Verify determined-master container is running and port 8080 is accessible from agent container
- Experiments stuck in 'QUEUED' state indefinitely: Check agent logs for Docker socket permissions and ensure /var/run/docker.sock is properly mounted
- PostgreSQL connection failures during master startup: Wait for postgres container to fully initialize before master starts, or add healthcheck dependencies
- Web UI shows 'Internal Server Error' on experiment pages: Check postgres container disk space and connection limits in PostgreSQL configuration
- Training experiments fail with 'image pull' errors: Ensure determined-agent has access to pull experiment Docker images from configured registries
- Hyperparameter search creates too many concurrent trials: Adjust 'max_concurrent_trials' in experiment configuration to match available resources
Community Notes
Loading...
Loading notes...
Download Recipe Kit
Get all files in a ready-to-deploy package
Includes docker-compose.yml, .env template, README, and license
Components
determinedpostgres
Tags
#determined#training#deep-learning#experiments
Category
AI & Machine LearningAd Space
Shortcuts: C CopyF FavoriteD Download