docker.recipes

Ray Distributed Computing

intermediate

Ray cluster for distributed ML and Python workloads.

Overview

Ray is an open-source distributed computing framework originally developed at UC Berkeley's RISELab that transforms single-machine Python code into distributed applications without significant rewrites. The ray-head component serves as the cluster coordinator, managing resource allocation, task scheduling, and providing the global control store through Redis, while also hosting the web-based dashboard for monitoring cluster health and job execution. Ray workers connect to the head node to form a compute cluster that can scale Python workloads across multiple machines, supporting everything from simple parallel processing to complex machine learning pipelines with Ray's ecosystem libraries like Ray Tune for hyperparameter optimization and Ray Serve for model deployment. This containerized Ray cluster architecture enables teams to build scalable distributed systems using familiar Python syntax, where functions decorated with @ray.remote automatically become distributable across the cluster. The head node acts as the single point of coordination, maintaining the global state and scheduling tasks to available workers, while workers execute the actual computation and report back results. This setup eliminates the complexity typically associated with distributed computing frameworks like Spark or Dask, making parallel processing accessible to Python developers without deep distributed systems expertise. Data scientists and ML engineers working with computationally intensive workloads will find this stack particularly valuable, as Ray excels at hyperparameter tuning, distributed training, and reinforcement learning scenarios that require dynamic task graphs. The containerized deployment makes it ideal for teams needing reproducible distributed computing environments across development and production, while the shared memory configuration optimizes performance for data-heavy operations common in machine learning workflows.

Key Features

  • Built-in Redis-backed global control store for distributed state management and coordination
  • Interactive Ray Dashboard with real-time cluster metrics, job monitoring, and resource utilization graphs
  • Automatic task scheduling with work-stealing algorithms for optimal resource utilization across workers
  • Native support for nested parallelism allowing @ray.remote functions to call other remote functions
  • Object store with automatic serialization handling for NumPy arrays, Pandas DataFrames, and custom Python objects
  • Dynamic actor model supporting stateful distributed classes that persist across multiple function calls
  • Built-in support for GPU scheduling and allocation when CUDA-enabled Ray images are used
  • Ray Client interface enabling remote cluster access from Jupyter notebooks and local Python scripts

Common Use Cases

  • 1Hyperparameter optimization for machine learning models using Ray Tune across multiple parameter combinations
  • 2Distributed training of deep learning models that don't fit on a single GPU or require data parallelism
  • 3Monte Carlo simulations and financial modeling requiring thousands of parallel scenario calculations
  • 4Reinforcement learning environments where multiple agents train simultaneously across distributed workers
  • 5Large-scale data preprocessing and ETL pipelines that benefit from parallel processing of data chunks
  • 6Computer vision workloads like batch image processing or video analysis across distributed compute resources
  • 7Scientific computing applications requiring parallel execution of CPU-intensive numerical computations

Prerequisites

  • Minimum 4GB RAM per container with additional 2GB shared memory allocation for efficient object sharing
  • Docker host with sufficient CPU cores to benefit from distributed processing (recommended 4+ cores)
  • Available ports 8265 for Ray Dashboard access and 10001 for Ray Client connections
  • Basic understanding of Python decorators and async programming concepts for effective Ray usage
  • Familiarity with distributed computing concepts like task scheduling and shared state management
  • Network connectivity between containers for Redis communication on port 6379 and worker registration

For development & testing. Review security settings, change default credentials, and test thoroughly before production use. See Terms

docker-compose.yml

docker-compose.yml
1services:
2 ray-head:
3 image: rayproject/ray:latest
4 container_name: ray-head
5 restart: unless-stopped
6 ports:
7 - "${RAY_DASHBOARD_PORT:-8265}:8265"
8 - "${RAY_CLIENT_PORT:-10001}:10001"
9 command: ray start --head --dashboard-host=0.0.0.0 --block
10 shm_size: ${SHM_SIZE:-2gb}
11 networks:
12 - ray-network
13
14 ray-worker-1:
15 image: rayproject/ray:latest
16 container_name: ray-worker-1
17 restart: unless-stopped
18 command: ray start --address=ray-head:6379 --block
19 shm_size: ${SHM_SIZE:-2gb}
20 depends_on:
21 - ray-head
22 networks:
23 - ray-network
24
25 ray-worker-2:
26 image: rayproject/ray:latest
27 container_name: ray-worker-2
28 restart: unless-stopped
29 command: ray start --address=ray-head:6379 --block
30 shm_size: ${SHM_SIZE:-2gb}
31 depends_on:
32 - ray-head
33 networks:
34 - ray-network
35
36networks:
37 ray-network:
38 driver: bridge

.env Template

.env
1# Ray Cluster
2RAY_DASHBOARD_PORT=8265
3RAY_CLIENT_PORT=10001
4SHM_SIZE=2gb

Usage Notes

  1. 1Docs: https://docs.ray.io/
  2. 2Ray Dashboard at http://localhost:8265 - jobs, actors, metrics
  3. 3Connect from Python: ray.init('ray://localhost:10001')
  4. 4Scale by adding more ray-worker-N services in compose file
  5. 5Use @ray.remote decorator for distributed functions
  6. 6Ray Tune, Ray Serve, and Ray Data for ML workflows

Individual Services(3 services)

Copy individual services to mix and match with your existing compose files.

ray-head
ray-head:
  image: rayproject/ray:latest
  container_name: ray-head
  restart: unless-stopped
  ports:
    - ${RAY_DASHBOARD_PORT:-8265}:8265
    - ${RAY_CLIENT_PORT:-10001}:10001
  command: ray start --head --dashboard-host=0.0.0.0 --block
  shm_size: ${SHM_SIZE:-2gb}
  networks:
    - ray-network
ray-worker-1
ray-worker-1:
  image: rayproject/ray:latest
  container_name: ray-worker-1
  restart: unless-stopped
  command: ray start --address=ray-head:6379 --block
  shm_size: ${SHM_SIZE:-2gb}
  depends_on:
    - ray-head
  networks:
    - ray-network
ray-worker-2
ray-worker-2:
  image: rayproject/ray:latest
  container_name: ray-worker-2
  restart: unless-stopped
  command: ray start --address=ray-head:6379 --block
  shm_size: ${SHM_SIZE:-2gb}
  depends_on:
    - ray-head
  networks:
    - ray-network

Quick Start

terminal
1# 1. Create the compose file
2cat > docker-compose.yml << 'EOF'
3services:
4 ray-head:
5 image: rayproject/ray:latest
6 container_name: ray-head
7 restart: unless-stopped
8 ports:
9 - "${RAY_DASHBOARD_PORT:-8265}:8265"
10 - "${RAY_CLIENT_PORT:-10001}:10001"
11 command: ray start --head --dashboard-host=0.0.0.0 --block
12 shm_size: ${SHM_SIZE:-2gb}
13 networks:
14 - ray-network
15
16 ray-worker-1:
17 image: rayproject/ray:latest
18 container_name: ray-worker-1
19 restart: unless-stopped
20 command: ray start --address=ray-head:6379 --block
21 shm_size: ${SHM_SIZE:-2gb}
22 depends_on:
23 - ray-head
24 networks:
25 - ray-network
26
27 ray-worker-2:
28 image: rayproject/ray:latest
29 container_name: ray-worker-2
30 restart: unless-stopped
31 command: ray start --address=ray-head:6379 --block
32 shm_size: ${SHM_SIZE:-2gb}
33 depends_on:
34 - ray-head
35 networks:
36 - ray-network
37
38networks:
39 ray-network:
40 driver: bridge
41EOF
42
43# 2. Create the .env file
44cat > .env << 'EOF'
45# Ray Cluster
46RAY_DASHBOARD_PORT=8265
47RAY_CLIENT_PORT=10001
48SHM_SIZE=2gb
49EOF
50
51# 3. Start the services
52docker compose up -d
53
54# 4. View logs
55docker compose logs -f

One-Liner

Run this command to download and set up the recipe in one step:

terminal
1curl -fsSL https://docker.recipes/api/recipes/ray-cluster/run | bash

Troubleshooting

  • Workers failing to connect with 'redis.exceptions.ConnectionError': Ensure ray-head container is fully started before workers attempt connection, increase depends_on timeout
  • Out of shared memory errors during large object operations: Increase SHM_SIZE environment variable beyond default 2GB based on your data sizes
  • Ray Dashboard showing 'Cluster not found' or connection timeouts: Verify RAY_DASHBOARD_PORT is properly mapped and dashboard-host=0.0.0.0 is set in head node command
  • Tasks hanging indefinitely without completion: Check worker logs for memory exhaustion or increase worker memory limits to handle task requirements
  • Ray Client connection refused from external Python scripts: Confirm RAY_CLIENT_PORT mapping and use ray.init('ray://docker-host-ip:10001') with actual host IP
  • Object serialization failures with custom classes: Ensure all custom modules are available on worker nodes or use Ray's runtime_env feature to distribute dependencies

Community Notes

Loading...
Loading notes...

Download Recipe Kit

Get all files in a ready-to-deploy package

Includes docker-compose.yml, .env template, README, and license

Components

ray-headray-worker

Tags

#ray#distributed#ml#python#cluster

Category

AI & Machine Learning
Ad Space