docker.recipes

Ray Cluster

advanced

Distributed computing framework for ML.

Overview

Ray is an open-source distributed computing framework developed by UC Berkeley's RISELab, designed to scale Python applications from single machines to large clusters. Ray provides a unified API for distributed computing that handles parallel processing, distributed training, hyperparameter tuning, and reinforcement learning workloads. The framework abstracts away the complexity of distributed systems while maintaining high performance through its actor-based model and efficient shared memory architecture. This Ray cluster configuration establishes a distributed computing environment with a head node managing cluster coordination and worker nodes executing distributed tasks. The head node runs Ray's Global Control Store (GCS) and dashboard, while worker nodes connect automatically to form a cohesive computing cluster. Ray's architecture enables efficient task scheduling, automatic fault tolerance, and dynamic resource allocation across all nodes in the cluster. Data scientists, ML engineers, and researchers working with computationally intensive Python workloads benefit most from Ray clusters. Organizations processing large datasets, training complex machine learning models, or running hyperparameter optimization experiments will find Ray's distributed capabilities essential. The framework particularly excels in environments requiring both high-throughput batch processing and low-latency model serving, making it valuable for production ML pipelines and research workflows.

Key Features

  • Distributed actor-based computing model with automatic task scheduling across cluster nodes
  • Ray Dashboard for real-time cluster monitoring, task visualization, and resource utilization tracking
  • Auto-scaling worker nodes with dynamic resource allocation and fault tolerance
  • Ray Serve integration for distributed model serving with automatic load balancing
  • Ray Tune support for distributed hyperparameter optimization and neural architecture search
  • Shared object store using Apache Arrow for zero-copy data sharing between processes
  • Ray Data for distributed data processing with support for ML preprocessing pipelines
  • Built-in support for reinforcement learning workloads through Ray RLlib integration

Common Use Cases

  • 1Distributed machine learning model training across multiple GPUs and nodes
  • 2Large-scale hyperparameter tuning for deep learning experiments with Ray Tune
  • 3High-throughput batch inference for computer vision and NLP models
  • 4Distributed data preprocessing and ETL pipelines for ML workflows
  • 5Reinforcement learning training environments with parallel episode collection
  • 6Real-time model serving with automatic scaling based on request volume
  • 7Monte Carlo simulations and financial risk modeling across distributed compute resources

Prerequisites

  • Minimum 8GB RAM per node (16GB+ recommended for ML workloads)
  • Docker Engine 20.10+ with Docker Compose v2 support
  • Network connectivity between nodes for cluster communication on port 6379
  • Python programming knowledge for Ray application development
  • Understanding of distributed computing concepts and parallel processing
  • For GPU workloads: NVIDIA Docker runtime and compatible GPU drivers

For development & testing. Review security settings, change default credentials, and test thoroughly before production use. See Terms

docker-compose.yml

docker-compose.yml
1services:
2 ray-head:
3 image: rayproject/ray:latest
4 container_name: ray-head
5 command: ray start --head --dashboard-host=0.0.0.0 --block
6 ports:
7 - "8265:8265"
8 - "6379:6379"
9 networks:
10 - ray
11
12 ray-worker:
13 image: rayproject/ray:latest
14 command: ray start --address=ray-head:6379 --block
15 deploy:
16 replicas: 2
17 depends_on:
18 - ray-head
19 networks:
20 - ray
21
22networks:
23 ray:
24 driver: bridge

.env Template

.env
1# No additional config needed

Usage Notes

  1. 1Docs: https://docs.ray.io/
  2. 2Dashboard at http://localhost:8265 - cluster status and logs
  3. 3Connect from Python: ray.init('ray://localhost:10001')
  4. 4Scale workers via deploy.replicas in compose file
  5. 5Ray Serve for model serving, Ray Tune for hyperparameter tuning
  6. 6GPU support: add device reservations to worker service

Individual Services(2 services)

Copy individual services to mix and match with your existing compose files.

ray-head
ray-head:
  image: rayproject/ray:latest
  container_name: ray-head
  command: ray start --head --dashboard-host=0.0.0.0 --block
  ports:
    - "8265:8265"
    - "6379:6379"
  networks:
    - ray
ray-worker
ray-worker:
  image: rayproject/ray:latest
  command: ray start --address=ray-head:6379 --block
  deploy:
    replicas: 2
  depends_on:
    - ray-head
  networks:
    - ray

Quick Start

terminal
1# 1. Create the compose file
2cat > docker-compose.yml << 'EOF'
3services:
4 ray-head:
5 image: rayproject/ray:latest
6 container_name: ray-head
7 command: ray start --head --dashboard-host=0.0.0.0 --block
8 ports:
9 - "8265:8265"
10 - "6379:6379"
11 networks:
12 - ray
13
14 ray-worker:
15 image: rayproject/ray:latest
16 command: ray start --address=ray-head:6379 --block
17 deploy:
18 replicas: 2
19 depends_on:
20 - ray-head
21 networks:
22 - ray
23
24networks:
25 ray:
26 driver: bridge
27EOF
28
29# 2. Create the .env file
30cat > .env << 'EOF'
31# No additional config needed
32EOF
33
34# 3. Start the services
35docker compose up -d
36
37# 4. View logs
38docker compose logs -f

One-Liner

Run this command to download and set up the recipe in one step:

terminal
1curl -fsSL https://docker.recipes/api/recipes/ray/run | bash

Troubleshooting

  • ray-worker containers failing to start: Ensure ray-head container is fully initialized before workers attempt connection
  • Dashboard not accessible on port 8265: Check firewall settings and verify ray-head container has --dashboard-host=0.0.0.0 parameter
  • Workers not appearing in cluster: Verify network connectivity between containers and correct Redis port 6379 accessibility
  • Out of memory errors during large computations: Increase Docker container memory limits or reduce Ray worker memory allocation
  • Task scheduling delays or failures: Monitor Ray Dashboard for resource bottlenecks and consider scaling worker replicas
  • Connection refused errors from Python clients: Ensure Ray client connects to correct head node address and GCS port is available

Community Notes

Loading...
Loading notes...

Download Recipe Kit

Get all files in a ready-to-deploy package

Includes docker-compose.yml, .env template, README, and license

Ad Space