docker.recipes

DVC + Iterative Studio + MinIO

advanced

Data version control with experiment tracking and S3-compatible storage.

Overview

DVC (Data Version Control) is an open-source version control system specifically designed for machine learning projects, treating datasets and models as first-class citizens alongside code. Originally developed by Dmitry Petrov in 2017, DVC extends Git's capabilities to handle large files and complex ML pipelines, enabling reproducible experiments and collaborative data science workflows. Unlike traditional version control systems that struggle with large binary files, DVC uses a content-addressable storage approach that tracks metadata in Git while storing actual data in remote storage backends. This stack combines DVC with Iterative Studio for experiment visualization and tracking, MinIO as S3-compatible object storage for data artifacts, and Gitea for Git repository hosting. Together, these components create a complete MLOps platform that handles code versioning, data versioning, experiment tracking, and artifact storage in a unified workflow. The integration allows data scientists to version datasets alongside code, track experiment metrics and parameters, visualize results through Iterative Studio's web interface, and store large model files and datasets efficiently in MinIO's high-performance object storage. This combination is particularly valuable for teams transitioning from ad-hoc ML workflows to structured MLOps practices, providing enterprise-grade capabilities without the complexity and cost of cloud-based solutions. Data science teams benefit from having full control over their infrastructure while maintaining compatibility with existing Git workflows and S3-based tools, making it an ideal choice for organizations with data sovereignty requirements or those seeking to avoid cloud vendor lock-in.

Key Features

  • Git-based pipeline definition with automatic dependency tracking and cache invalidation
  • S3-compatible data storage with MinIO providing high-performance object storage for datasets and model artifacts
  • Iterative Studio integration for experiment comparison, metric visualization, and model performance tracking
  • DVC remote storage configuration using MinIO as backend with automatic bucket initialization
  • Gitea-hosted repositories with built-in CI/CD actions for automated DVC pipeline execution
  • Content-addressable storage deduplication reducing storage costs for similar datasets
  • Pipeline reproducibility with locked dependencies and parameterized experiment configuration
  • Multi-stage ML pipeline support with automatic artifact caching between pipeline stages

Common Use Cases

  • 1Machine learning teams versioning datasets and models with full experiment reproducibility
  • 2Data science organizations requiring on-premises MLOps infrastructure for regulatory compliance
  • 3Research institutions managing large-scale dataset collections with collaborative access controls
  • 4MLOps teams implementing continuous integration for machine learning model training pipelines
  • 5Startups building ML products needing cost-effective alternative to cloud-based MLOps platforms
  • 6Enterprise data science departments with data sovereignty requirements and air-gapped environments
  • 7Academic research groups sharing reproducible ML experiments and dataset lineage tracking

Prerequisites

  • Minimum 4GB RAM (2GB+ for MinIO, 512MB+ for Gitea, 1GB+ for DVC operations and Postgres)
  • Docker Engine 20.10+ and Docker Compose v2 for container orchestration
  • Available ports 3000, 2222, 9000, and 9001 for Gitea, SSH, MinIO API, and MinIO console
  • Basic understanding of Git workflows and machine learning experiment management
  • Python environment with DVC client installed for local repository operations
  • At least 10GB free disk space for initial data volumes and ML artifacts storage

For development & testing. Review security settings, change default credentials, and test thoroughly before production use. See Terms

docker-compose.yml

docker-compose.yml
1services:
2 gitea:
3 image: gitea/gitea:latest
4 environment:
5 - USER_UID=1000
6 - USER_GID=1000
7 - GITEA__database__DB_TYPE=postgres
8 - GITEA__database__HOST=postgres:5432
9 - GITEA__database__NAME=gitea
10 - GITEA__database__USER=${POSTGRES_USER}
11 - GITEA__database__PASSWD=${POSTGRES_PASSWORD}
12 volumes:
13 - gitea-data:/data
14 - /etc/timezone:/etc/timezone:ro
15 - /etc/localtime:/etc/localtime:ro
16 ports:
17 - "3000:3000"
18 - "2222:22"
19 depends_on:
20 - postgres
21 networks:
22 - dvc-network
23 restart: unless-stopped
24
25 postgres:
26 image: postgres:15
27 environment:
28 - POSTGRES_USER=${POSTGRES_USER}
29 - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
30 - POSTGRES_DB=gitea
31 volumes:
32 - postgres-data:/var/lib/postgresql/data
33 networks:
34 - dvc-network
35 restart: unless-stopped
36
37 minio:
38 image: minio/minio:latest
39 command: server /data --console-address ":9001"
40 environment:
41 - MINIO_ROOT_USER=${MINIO_ACCESS_KEY}
42 - MINIO_ROOT_PASSWORD=${MINIO_SECRET_KEY}
43 volumes:
44 - minio-data:/data
45 ports:
46 - "9000:9000"
47 - "9001:9001"
48 networks:
49 - dvc-network
50 restart: unless-stopped
51
52 minio-init:
53 image: minio/mc:latest
54 entrypoint: >
55 /bin/sh -c "
56 sleep 5;
57 mc alias set myminio http: //minio:9000 ${MINIO_ACCESS_KEY} ${MINIO_SECRET_KEY};
58 mc mb myminio/dvc-storage --ignore-existing;
59 exit 0;
60 "
61 depends_on:
62 - minio
63 networks:
64 - dvc-network
65
66volumes:
67 gitea-data:
68 postgres-data:
69 minio-data:
70
71networks:
72 dvc-network:
73 driver: bridge

.env Template

.env
1# DVC Studio
2POSTGRES_USER=gitea
3POSTGRES_PASSWORD=secure_postgres_password
4
5# MinIO for DVC storage
6MINIO_ACCESS_KEY=dvcaccesskey
7MINIO_SECRET_KEY=secure_minio_secret
8
9# DVC remote config:
10# dvc remote add -d myremote s3://dvc-storage
11# dvc remote modify myremote endpointurl http://localhost:9000

Usage Notes

  1. 1Gitea at http://localhost:3000
  2. 2MinIO console at http://localhost:9001
  3. 3Configure DVC with S3 remote
  4. 4Push data with dvc push
  5. 5Track experiments with Git

Individual Services(4 services)

Copy individual services to mix and match with your existing compose files.

gitea
gitea:
  image: gitea/gitea:latest
  environment:
    - USER_UID=1000
    - USER_GID=1000
    - GITEA__database__DB_TYPE=postgres
    - GITEA__database__HOST=postgres:5432
    - GITEA__database__NAME=gitea
    - GITEA__database__USER=${POSTGRES_USER}
    - GITEA__database__PASSWD=${POSTGRES_PASSWORD}
  volumes:
    - gitea-data:/data
    - /etc/timezone:/etc/timezone:ro
    - /etc/localtime:/etc/localtime:ro
  ports:
    - "3000:3000"
    - "2222:22"
  depends_on:
    - postgres
  networks:
    - dvc-network
  restart: unless-stopped
postgres
postgres:
  image: postgres:15
  environment:
    - POSTGRES_USER=${POSTGRES_USER}
    - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
    - POSTGRES_DB=gitea
  volumes:
    - postgres-data:/var/lib/postgresql/data
  networks:
    - dvc-network
  restart: unless-stopped
minio
minio:
  image: minio/minio:latest
  command: server /data --console-address ":9001"
  environment:
    - MINIO_ROOT_USER=${MINIO_ACCESS_KEY}
    - MINIO_ROOT_PASSWORD=${MINIO_SECRET_KEY}
  volumes:
    - minio-data:/data
  ports:
    - "9000:9000"
    - "9001:9001"
  networks:
    - dvc-network
  restart: unless-stopped
minio-init
minio-init:
  image: minio/mc:latest
  entrypoint: |
    /bin/sh -c " sleep 5; mc alias set myminio http://minio:9000 ${MINIO_ACCESS_KEY} ${MINIO_SECRET_KEY}; mc mb myminio/dvc-storage --ignore-existing; exit 0; "
  depends_on:
    - minio
  networks:
    - dvc-network

Quick Start

terminal
1# 1. Create the compose file
2cat > docker-compose.yml << 'EOF'
3services:
4 gitea:
5 image: gitea/gitea:latest
6 environment:
7 - USER_UID=1000
8 - USER_GID=1000
9 - GITEA__database__DB_TYPE=postgres
10 - GITEA__database__HOST=postgres:5432
11 - GITEA__database__NAME=gitea
12 - GITEA__database__USER=${POSTGRES_USER}
13 - GITEA__database__PASSWD=${POSTGRES_PASSWORD}
14 volumes:
15 - gitea-data:/data
16 - /etc/timezone:/etc/timezone:ro
17 - /etc/localtime:/etc/localtime:ro
18 ports:
19 - "3000:3000"
20 - "2222:22"
21 depends_on:
22 - postgres
23 networks:
24 - dvc-network
25 restart: unless-stopped
26
27 postgres:
28 image: postgres:15
29 environment:
30 - POSTGRES_USER=${POSTGRES_USER}
31 - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
32 - POSTGRES_DB=gitea
33 volumes:
34 - postgres-data:/var/lib/postgresql/data
35 networks:
36 - dvc-network
37 restart: unless-stopped
38
39 minio:
40 image: minio/minio:latest
41 command: server /data --console-address ":9001"
42 environment:
43 - MINIO_ROOT_USER=${MINIO_ACCESS_KEY}
44 - MINIO_ROOT_PASSWORD=${MINIO_SECRET_KEY}
45 volumes:
46 - minio-data:/data
47 ports:
48 - "9000:9000"
49 - "9001:9001"
50 networks:
51 - dvc-network
52 restart: unless-stopped
53
54 minio-init:
55 image: minio/mc:latest
56 entrypoint: >
57 /bin/sh -c "
58 sleep 5;
59 mc alias set myminio http://minio:9000 ${MINIO_ACCESS_KEY} ${MINIO_SECRET_KEY};
60 mc mb myminio/dvc-storage --ignore-existing;
61 exit 0;
62 "
63 depends_on:
64 - minio
65 networks:
66 - dvc-network
67
68volumes:
69 gitea-data:
70 postgres-data:
71 minio-data:
72
73networks:
74 dvc-network:
75 driver: bridge
76EOF
77
78# 2. Create the .env file
79cat > .env << 'EOF'
80# DVC Studio
81POSTGRES_USER=gitea
82POSTGRES_PASSWORD=secure_postgres_password
83
84# MinIO for DVC storage
85MINIO_ACCESS_KEY=dvcaccesskey
86MINIO_SECRET_KEY=secure_minio_secret
87
88# DVC remote config:
89# dvc remote add -d myremote s3://dvc-storage
90# dvc remote modify myremote endpointurl http://localhost:9000
91EOF
92
93# 3. Start the services
94docker compose up -d
95
96# 4. View logs
97docker compose logs -f

One-Liner

Run this command to download and set up the recipe in one step:

terminal
1curl -fsSL https://docker.recipes/api/recipes/dvc-studio-complete/run | bash

Troubleshooting

  • DVC push fails with S3 credentials error: Verify MINIO_ACCESS_KEY and MINIO_SECRET_KEY environment variables match your DVC remote configuration
  • Gitea repository clone fails over SSH: Check that port 2222 is accessible and SSH keys are properly configured in Gitea user settings
  • MinIO bucket access denied during DVC operations: Ensure the minio-init container completed successfully and dvc-storage bucket was created
  • Iterative Studio cannot connect to Git repository: Verify Gitea is accessible from Studio and webhook URLs are properly configured for experiment tracking
  • DVC pipeline fails with dependency resolution errors: Check that all pipeline stages have correct dependencies defined and cached artifacts are not corrupted
  • PostgreSQL connection refused during Gitea startup: Wait for postgres container to fully initialize before Gitea attempts database connection, increase depends_on wait time if necessary

Community Notes

Loading...
Loading notes...

Download Recipe Kit

Get all files in a ready-to-deploy package

Includes docker-compose.yml, .env template, README, and license

Ad Space