DVC + Iterative Studio + MinIO
Data version control with experiment tracking and S3-compatible storage.
Overview
DVC (Data Version Control) is an open-source version control system specifically designed for machine learning projects, treating datasets and models as first-class citizens alongside code. Originally developed by Dmitry Petrov in 2017, DVC extends Git's capabilities to handle large files and complex ML pipelines, enabling reproducible experiments and collaborative data science workflows. Unlike traditional version control systems that struggle with large binary files, DVC uses a content-addressable storage approach that tracks metadata in Git while storing actual data in remote storage backends. This stack combines DVC with Iterative Studio for experiment visualization and tracking, MinIO as S3-compatible object storage for data artifacts, and Gitea for Git repository hosting. Together, these components create a complete MLOps platform that handles code versioning, data versioning, experiment tracking, and artifact storage in a unified workflow. The integration allows data scientists to version datasets alongside code, track experiment metrics and parameters, visualize results through Iterative Studio's web interface, and store large model files and datasets efficiently in MinIO's high-performance object storage. This combination is particularly valuable for teams transitioning from ad-hoc ML workflows to structured MLOps practices, providing enterprise-grade capabilities without the complexity and cost of cloud-based solutions. Data science teams benefit from having full control over their infrastructure while maintaining compatibility with existing Git workflows and S3-based tools, making it an ideal choice for organizations with data sovereignty requirements or those seeking to avoid cloud vendor lock-in.
Key Features
- Git-based pipeline definition with automatic dependency tracking and cache invalidation
- S3-compatible data storage with MinIO providing high-performance object storage for datasets and model artifacts
- Iterative Studio integration for experiment comparison, metric visualization, and model performance tracking
- DVC remote storage configuration using MinIO as backend with automatic bucket initialization
- Gitea-hosted repositories with built-in CI/CD actions for automated DVC pipeline execution
- Content-addressable storage deduplication reducing storage costs for similar datasets
- Pipeline reproducibility with locked dependencies and parameterized experiment configuration
- Multi-stage ML pipeline support with automatic artifact caching between pipeline stages
Common Use Cases
- 1Machine learning teams versioning datasets and models with full experiment reproducibility
- 2Data science organizations requiring on-premises MLOps infrastructure for regulatory compliance
- 3Research institutions managing large-scale dataset collections with collaborative access controls
- 4MLOps teams implementing continuous integration for machine learning model training pipelines
- 5Startups building ML products needing cost-effective alternative to cloud-based MLOps platforms
- 6Enterprise data science departments with data sovereignty requirements and air-gapped environments
- 7Academic research groups sharing reproducible ML experiments and dataset lineage tracking
Prerequisites
- Minimum 4GB RAM (2GB+ for MinIO, 512MB+ for Gitea, 1GB+ for DVC operations and Postgres)
- Docker Engine 20.10+ and Docker Compose v2 for container orchestration
- Available ports 3000, 2222, 9000, and 9001 for Gitea, SSH, MinIO API, and MinIO console
- Basic understanding of Git workflows and machine learning experiment management
- Python environment with DVC client installed for local repository operations
- At least 10GB free disk space for initial data volumes and ML artifacts storage
For development & testing. Review security settings, change default credentials, and test thoroughly before production use. See Terms
docker-compose.yml
docker-compose.yml
1services: 2 gitea: 3 image: gitea/gitea:latest4 environment: 5 - USER_UID=10006 - USER_GID=10007 - GITEA__database__DB_TYPE=postgres8 - GITEA__database__HOST=postgres:54329 - GITEA__database__NAME=gitea10 - GITEA__database__USER=${POSTGRES_USER}11 - GITEA__database__PASSWD=${POSTGRES_PASSWORD}12 volumes: 13 - gitea-data:/data14 - /etc/timezone:/etc/timezone:ro15 - /etc/localtime:/etc/localtime:ro16 ports: 17 - "3000:3000"18 - "2222:22"19 depends_on: 20 - postgres21 networks: 22 - dvc-network23 restart: unless-stopped2425 postgres: 26 image: postgres:1527 environment: 28 - POSTGRES_USER=${POSTGRES_USER}29 - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}30 - POSTGRES_DB=gitea31 volumes: 32 - postgres-data:/var/lib/postgresql/data33 networks: 34 - dvc-network35 restart: unless-stopped3637 minio: 38 image: minio/minio:latest39 command: server /data --console-address ":9001"40 environment: 41 - MINIO_ROOT_USER=${MINIO_ACCESS_KEY}42 - MINIO_ROOT_PASSWORD=${MINIO_SECRET_KEY}43 volumes: 44 - minio-data:/data45 ports: 46 - "9000:9000"47 - "9001:9001"48 networks: 49 - dvc-network50 restart: unless-stopped5152 minio-init: 53 image: minio/mc:latest54 entrypoint: >55 /bin/sh -c "56 sleep 5;57 mc alias set myminio http: //minio:9000 ${MINIO_ACCESS_KEY} ${MINIO_SECRET_KEY};58 mc mb myminio/dvc-storage --ignore-existing;59 exit 0;60 "61 depends_on: 62 - minio63 networks: 64 - dvc-network6566volumes: 67 gitea-data: 68 postgres-data: 69 minio-data: 7071networks: 72 dvc-network: 73 driver: bridge.env Template
.env
1# DVC Studio2POSTGRES_USER=gitea3POSTGRES_PASSWORD=secure_postgres_password45# MinIO for DVC storage6MINIO_ACCESS_KEY=dvcaccesskey7MINIO_SECRET_KEY=secure_minio_secret89# DVC remote config:10# dvc remote add -d myremote s3://dvc-storage11# dvc remote modify myremote endpointurl http://localhost:9000Usage Notes
- 1Gitea at http://localhost:3000
- 2MinIO console at http://localhost:9001
- 3Configure DVC with S3 remote
- 4Push data with dvc push
- 5Track experiments with Git
Individual Services(4 services)
Copy individual services to mix and match with your existing compose files.
gitea
gitea:
image: gitea/gitea:latest
environment:
- USER_UID=1000
- USER_GID=1000
- GITEA__database__DB_TYPE=postgres
- GITEA__database__HOST=postgres:5432
- GITEA__database__NAME=gitea
- GITEA__database__USER=${POSTGRES_USER}
- GITEA__database__PASSWD=${POSTGRES_PASSWORD}
volumes:
- gitea-data:/data
- /etc/timezone:/etc/timezone:ro
- /etc/localtime:/etc/localtime:ro
ports:
- "3000:3000"
- "2222:22"
depends_on:
- postgres
networks:
- dvc-network
restart: unless-stopped
postgres
postgres:
image: postgres:15
environment:
- POSTGRES_USER=${POSTGRES_USER}
- POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
- POSTGRES_DB=gitea
volumes:
- postgres-data:/var/lib/postgresql/data
networks:
- dvc-network
restart: unless-stopped
minio
minio:
image: minio/minio:latest
command: server /data --console-address ":9001"
environment:
- MINIO_ROOT_USER=${MINIO_ACCESS_KEY}
- MINIO_ROOT_PASSWORD=${MINIO_SECRET_KEY}
volumes:
- minio-data:/data
ports:
- "9000:9000"
- "9001:9001"
networks:
- dvc-network
restart: unless-stopped
minio-init
minio-init:
image: minio/mc:latest
entrypoint: |
/bin/sh -c " sleep 5; mc alias set myminio http://minio:9000 ${MINIO_ACCESS_KEY} ${MINIO_SECRET_KEY}; mc mb myminio/dvc-storage --ignore-existing; exit 0; "
depends_on:
- minio
networks:
- dvc-network
Quick Start
terminal
1# 1. Create the compose file2cat > docker-compose.yml << 'EOF'3services:4 gitea:5 image: gitea/gitea:latest6 environment:7 - USER_UID=10008 - USER_GID=10009 - GITEA__database__DB_TYPE=postgres10 - GITEA__database__HOST=postgres:543211 - GITEA__database__NAME=gitea12 - GITEA__database__USER=${POSTGRES_USER}13 - GITEA__database__PASSWD=${POSTGRES_PASSWORD}14 volumes:15 - gitea-data:/data16 - /etc/timezone:/etc/timezone:ro17 - /etc/localtime:/etc/localtime:ro18 ports:19 - "3000:3000"20 - "2222:22"21 depends_on:22 - postgres23 networks:24 - dvc-network25 restart: unless-stopped2627 postgres:28 image: postgres:1529 environment:30 - POSTGRES_USER=${POSTGRES_USER}31 - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}32 - POSTGRES_DB=gitea33 volumes:34 - postgres-data:/var/lib/postgresql/data35 networks:36 - dvc-network37 restart: unless-stopped3839 minio:40 image: minio/minio:latest41 command: server /data --console-address ":9001"42 environment:43 - MINIO_ROOT_USER=${MINIO_ACCESS_KEY}44 - MINIO_ROOT_PASSWORD=${MINIO_SECRET_KEY}45 volumes:46 - minio-data:/data47 ports:48 - "9000:9000"49 - "9001:9001"50 networks:51 - dvc-network52 restart: unless-stopped5354 minio-init:55 image: minio/mc:latest56 entrypoint: >57 /bin/sh -c "58 sleep 5;59 mc alias set myminio http://minio:9000 ${MINIO_ACCESS_KEY} ${MINIO_SECRET_KEY};60 mc mb myminio/dvc-storage --ignore-existing;61 exit 0;62 "63 depends_on:64 - minio65 networks:66 - dvc-network6768volumes:69 gitea-data:70 postgres-data:71 minio-data:7273networks:74 dvc-network:75 driver: bridge76EOF7778# 2. Create the .env file79cat > .env << 'EOF'80# DVC Studio81POSTGRES_USER=gitea82POSTGRES_PASSWORD=secure_postgres_password8384# MinIO for DVC storage85MINIO_ACCESS_KEY=dvcaccesskey86MINIO_SECRET_KEY=secure_minio_secret8788# DVC remote config:89# dvc remote add -d myremote s3://dvc-storage90# dvc remote modify myremote endpointurl http://localhost:900091EOF9293# 3. Start the services94docker compose up -d9596# 4. View logs97docker compose logs -fOne-Liner
Run this command to download and set up the recipe in one step:
terminal
1curl -fsSL https://docker.recipes/api/recipes/dvc-studio-complete/run | bashTroubleshooting
- DVC push fails with S3 credentials error: Verify MINIO_ACCESS_KEY and MINIO_SECRET_KEY environment variables match your DVC remote configuration
- Gitea repository clone fails over SSH: Check that port 2222 is accessible and SSH keys are properly configured in Gitea user settings
- MinIO bucket access denied during DVC operations: Ensure the minio-init container completed successfully and dvc-storage bucket was created
- Iterative Studio cannot connect to Git repository: Verify Gitea is accessible from Studio and webhook URLs are properly configured for experiment tracking
- DVC pipeline fails with dependency resolution errors: Check that all pipeline stages have correct dependencies defined and cached artifacts are not corrupted
- PostgreSQL connection refused during Gitea startup: Wait for postgres container to fully initialize before Gitea attempts database connection, increase depends_on wait time if necessary
Community Notes
Loading...
Loading notes...
Download Recipe Kit
Get all files in a ready-to-deploy package
Includes docker-compose.yml, .env template, README, and license
Components
dvcminiogiteaiterative-studio
Tags
#dvc#data-versioning#experiments#ml-ops#git
Category
AI & Machine LearningAd Space
Shortcuts: C CopyF FavoriteD Download