docker.recipes

Determined AI

advanced

Deep learning training platform with experiment tracking.

Overview

Determined AI is an open-source machine learning training platform designed to simplify and scale deep learning workflows. Built by former Google researchers, it addresses the complexities of distributed training, experiment management, and hyperparameter optimization that plague modern ML teams. The platform automatically handles checkpointing, fault tolerance, and resource allocation while providing sophisticated experiment tracking capabilities that rival commercial solutions like Weights & Biases or Neptune. This stack combines Determined's master-agent architecture with PostgreSQL as the metadata store. The determined-master service orchestrates experiments, manages the web interface, and stores all experiment metadata, metrics, and configurations in PostgreSQL. The determined-agent handles actual workload execution, automatically pulling training code and managing containerized experiments. PostgreSQL's ACID compliance and JSON support make it ideal for storing complex experiment configurations, hyperparameter spaces, and time-series metrics data that Determined generates. This configuration targets ML engineering teams and research organizations running serious deep learning workloads who need reproducible experiments and efficient resource utilization. Unlike notebook-based workflows or simple training scripts, Determined enforces best practices around experiment versioning, data loading, and model checkpointing. The PostgreSQL backend ensures experiment history survives system restarts and provides robust querying capabilities for analyzing training runs across multiple projects.

Key Features

  • Adaptive hyperparameter search algorithms (ASHA, Population Based Training) that intelligently terminate poor-performing trials early
  • Automatic distributed training with built-in support for data parallelism and model parallelism across multiple GPUs
  • Fault-tolerant training with automatic checkpointing and resume capabilities when nodes fail
  • Fair-share cluster scheduling with preemption support for multi-user environments
  • Web-based experiment comparison interface with real-time metrics visualization and hyperparameter analysis
  • PostgreSQL-backed experiment metadata storage with full JSON support for complex configuration tracking
  • Docker-based experiment isolation ensuring reproducible training environments
  • CLI-driven workflow supporting both interactive development and CI/CD integration

Common Use Cases

  • 1Computer vision teams training large models (ResNet, EfficientNet, Vision Transformers) requiring distributed GPU training
  • 2NLP research groups running transformer fine-tuning experiments with extensive hyperparameter sweeps
  • 3Autonomous vehicle companies managing hundreds of simultaneous model training jobs across GPU clusters
  • 4Pharmaceutical research using deep learning for drug discovery with complex molecular property prediction models
  • 5Financial institutions training fraud detection models requiring rigorous experiment tracking for regulatory compliance
  • 6Academic research labs needing shared GPU resources with fair scheduling among multiple PhD students and projects
  • 7MLOps teams transitioning from ad-hoc training scripts to production-grade experiment management platforms

Prerequisites

  • Minimum 4GB RAM for PostgreSQL and Determined master services combined
  • Docker daemon with access to GPU runtime (nvidia-docker2) if training GPU-accelerated models
  • Understanding of machine learning training loops and familiarity with PyTorch or TensorFlow
  • Network access to pull training data and Docker images during experiment execution
  • Basic knowledge of YAML configuration files for defining Determined experiment specifications
  • Port 8080 available for the Determined web interface and API access

For development & testing. Review security settings, change default credentials, and test thoroughly before production use. See Terms

docker-compose.yml

docker-compose.yml
1services:
2 determined-master:
3 image: determinedai/determined-master:latest
4 container_name: determined-master
5 restart: unless-stopped
6 environment:
7 DET_DB_HOST: postgres
8 DET_DB_NAME: determined
9 DET_DB_USER: determined
10 DET_DB_PASSWORD: determined
11 ports:
12 - "8080:8080"
13 depends_on:
14 - postgres
15 networks:
16 - determined
17
18 determined-agent:
19 image: determinedai/determined-agent:latest
20 container_name: determined-agent
21 restart: unless-stopped
22 environment:
23 DET_MASTER_HOST: determined-master
24 DET_MASTER_PORT: 8080
25 volumes:
26 - /var/run/docker.sock:/var/run/docker.sock
27 depends_on:
28 - determined-master
29 networks:
30 - determined
31
32 postgres:
33 image: postgres:16-alpine
34 container_name: determined-postgres
35 environment:
36 POSTGRES_DB: determined
37 POSTGRES_USER: determined
38 POSTGRES_PASSWORD: determined
39 volumes:
40 - postgres_data:/var/lib/postgresql/data
41 networks:
42 - determined
43
44volumes:
45 postgres_data:
46
47networks:
48 determined:
49 driver: bridge

.env Template

.env
1# Configure via web UI

Usage Notes

  1. 1Docs: https://docs.determined.ai/
  2. 2Web UI at http://localhost:8080 - default login: admin (no password)
  3. 3CLI: pip install determined, then det experiment create config.yaml .
  4. 4Distributed training with automatic checkpointing and fault tolerance
  5. 5Hyperparameter search: grid, random, adaptive (ASHA, PBT)
  6. 6GPU cluster management with fair-share scheduling

Individual Services(3 services)

Copy individual services to mix and match with your existing compose files.

determined-master
determined-master:
  image: determinedai/determined-master:latest
  container_name: determined-master
  restart: unless-stopped
  environment:
    DET_DB_HOST: postgres
    DET_DB_NAME: determined
    DET_DB_USER: determined
    DET_DB_PASSWORD: determined
  ports:
    - "8080:8080"
  depends_on:
    - postgres
  networks:
    - determined
determined-agent
determined-agent:
  image: determinedai/determined-agent:latest
  container_name: determined-agent
  restart: unless-stopped
  environment:
    DET_MASTER_HOST: determined-master
    DET_MASTER_PORT: 8080
  volumes:
    - /var/run/docker.sock:/var/run/docker.sock
  depends_on:
    - determined-master
  networks:
    - determined
postgres
postgres:
  image: postgres:16-alpine
  container_name: determined-postgres
  environment:
    POSTGRES_DB: determined
    POSTGRES_USER: determined
    POSTGRES_PASSWORD: determined
  volumes:
    - postgres_data:/var/lib/postgresql/data
  networks:
    - determined

Quick Start

terminal
1# 1. Create the compose file
2cat > docker-compose.yml << 'EOF'
3services:
4 determined-master:
5 image: determinedai/determined-master:latest
6 container_name: determined-master
7 restart: unless-stopped
8 environment:
9 DET_DB_HOST: postgres
10 DET_DB_NAME: determined
11 DET_DB_USER: determined
12 DET_DB_PASSWORD: determined
13 ports:
14 - "8080:8080"
15 depends_on:
16 - postgres
17 networks:
18 - determined
19
20 determined-agent:
21 image: determinedai/determined-agent:latest
22 container_name: determined-agent
23 restart: unless-stopped
24 environment:
25 DET_MASTER_HOST: determined-master
26 DET_MASTER_PORT: 8080
27 volumes:
28 - /var/run/docker.sock:/var/run/docker.sock
29 depends_on:
30 - determined-master
31 networks:
32 - determined
33
34 postgres:
35 image: postgres:16-alpine
36 container_name: determined-postgres
37 environment:
38 POSTGRES_DB: determined
39 POSTGRES_USER: determined
40 POSTGRES_PASSWORD: determined
41 volumes:
42 - postgres_data:/var/lib/postgresql/data
43 networks:
44 - determined
45
46volumes:
47 postgres_data:
48
49networks:
50 determined:
51 driver: bridge
52EOF
53
54# 2. Create the .env file
55cat > .env << 'EOF'
56# Configure via web UI
57EOF
58
59# 3. Start the services
60docker compose up -d
61
62# 4. View logs
63docker compose logs -f

One-Liner

Run this command to download and set up the recipe in one step:

terminal
1curl -fsSL https://docker.recipes/api/recipes/determined-ai/run | bash

Troubleshooting

  • Agent shows 'failed to connect to master' errors: Verify determined-master container is running and port 8080 is accessible from agent container
  • Experiments stuck in 'QUEUED' state indefinitely: Check agent logs for Docker socket permissions and ensure /var/run/docker.sock is properly mounted
  • PostgreSQL connection failures during master startup: Wait for postgres container to fully initialize before master starts, or add healthcheck dependencies
  • Web UI shows 'Internal Server Error' on experiment pages: Check postgres container disk space and connection limits in PostgreSQL configuration
  • Training experiments fail with 'image pull' errors: Ensure determined-agent has access to pull experiment Docker images from configured registries
  • Hyperparameter search creates too many concurrent trials: Adjust 'max_concurrent_trials' in experiment configuration to match available resources

Community Notes

Loading...
Loading notes...

Download Recipe Kit

Get all files in a ready-to-deploy package

Includes docker-compose.yml, .env template, README, and license

Ad Space