Ray Distributed Computing

intermediate

Ray cluster for distributed ML and Python workloads.

[i]Overview

Ray is an open-source distributed computing framework originally developed at UC Berkeley's RISELab that transforms single-machine Python code into distributed applications without significant rewrites. The ray-head component serves as the cluster coordinator, managing resource allocation, task scheduling, and providing the global control store through Redis, while also hosting the web-based dashboard for monitoring cluster health and job execution. Ray workers connect to the head node to form a compute cluster that can scale Python workloads across multiple machines, supporting everything from simple parallel processing to complex machine learning pipelines with Ray's ecosystem libraries like Ray Tune for hyperparameter optimization and Ray Serve for model deployment. This containerized Ray cluster architecture enables teams to build scalable distributed systems using familiar Python syntax, where functions decorated with @ray.remote automatically become distributable across the cluster. The head node acts as the single point of coordination, maintaining the global state and scheduling tasks to available workers, while workers execute the actual computation and report back results. This setup eliminates the complexity typically associated with distributed computing frameworks like Spark or Dask, making parallel processing accessible to Python developers without deep distributed systems expertise. Data scientists and ML engineers working with computationally intensive workloads will find this stack particularly valuable, as Ray excels at hyperparameter tuning, distributed training, and reinforcement learning scenarios that require dynamic task graphs. The containerized deployment makes it ideal for teams needing reproducible distributed computing environments across development and production, while the shared memory configuration optimizes performance for data-heavy operations common in machine learning workflows.

[*]Key Features

[+]Built-in Redis-backed global control store for distributed state management and coordination
[+]Interactive Ray Dashboard with real-time cluster metrics, job monitoring, and resource utilization graphs
[+]Automatic task scheduling with work-stealing algorithms for optimal resource utilization across workers
[+]Native support for nested parallelism allowing @ray.remote functions to call other remote functions
[+]Object store with automatic serialization handling for NumPy arrays, Pandas DataFrames, and custom Python objects
[+]Dynamic actor model supporting stateful distributed classes that persist across multiple function calls
[+]Built-in support for GPU scheduling and allocation when CUDA-enabled Ray images are used
[+]Ray Client interface enabling remote cluster access from Jupyter notebooks and local Python scripts

[#]Common Use Cases

[1]Hyperparameter optimization for machine learning models using Ray Tune across multiple parameter combinations
[2]Distributed training of deep learning models that don't fit on a single GPU or require data parallelism
[3]Monte Carlo simulations and financial modeling requiring thousands of parallel scenario calculations
[4]Reinforcement learning environments where multiple agents train simultaneously across distributed workers
[5]Large-scale data preprocessing and ETL pipelines that benefit from parallel processing of data chunks
[6]Computer vision workloads like batch image processing or video analysis across distributed compute resources
[7]Scientific computing applications requiring parallel execution of CPU-intensive numerical computations

[!]Prerequisites

[!]Minimum 4GB RAM per container with additional 2GB shared memory allocation for efficient object sharing
[!]Docker host with sufficient CPU cores to benefit from distributed processing (recommended 4+ cores)
[!]Available ports 8265 for Ray Dashboard access and 10001 for Ray Client connections
[!]Basic understanding of Python decorators and async programming concepts for effective Ray usage
[!]Familiarity with distributed computing concepts like task scheduling and shared state management
[!]Network connectivity between containers for Redis communication on port 6379 and worker registration

[!]

WARNING: For development & testing. Review security settings, change default credentials, and test thoroughly before production use. See Terms

[$]docker-compose.yml

[docker-compose.yml]

1services: 
2  ray-head: 
3    image: rayproject/ray:latest
4    container_name: ray-head
5    restart: unless-stopped
6    ports: 
7      - "${RAY_DASHBOARD_PORT:-8265}:8265"
8      - "${RAY_CLIENT_PORT:-10001}:10001"
9    command: ray start --head --dashboard-host=0.0.0.0 --block
10    shm_size: ${SHM_SIZE:-2gb}
11    networks: 
12      - ray-network
13
14  ray-worker-1: 
15    image: rayproject/ray:latest
16    container_name: ray-worker-1
17    restart: unless-stopped
18    command: ray start --address=ray-head:6379 --block
19    shm_size: ${SHM_SIZE:-2gb}
20    depends_on: 
21      - ray-head
22    networks: 
23      - ray-network
24
25  ray-worker-2: 
26    image: rayproject/ray:latest
27    container_name: ray-worker-2
28    restart: unless-stopped
29    command: ray start --address=ray-head:6379 --block
30    shm_size: ${SHM_SIZE:-2gb}
31    depends_on: 
32      - ray-head
33    networks: 
34      - ray-network
35
36networks: 
37  ray-network: 
38    driver: bridge

[$].env Template

[.env]

1# Ray Cluster
2RAY_DASHBOARD_PORT=8265
3RAY_CLIENT_PORT=10001
4SHM_SIZE=2gb

[i]Usage Notes

[1]Docs: https://docs.ray.io/
[2]Ray Dashboard at http://localhost:8265 - jobs, actors, metrics
[3]Connect from Python: ray.init('ray://localhost:10001')
[4]Scale by adding more ray-worker-N services in compose file
[5]Use @ray.remote decorator for distributed functions
[6]Ray Tune, Ray Serve, and Ray Data for ML workflows

Individual Services(3 services)

Copy individual services to mix and match with your existing compose files.

ray-head

ray-head:
  image: rayproject/ray:latest
  container_name: ray-head
  restart: unless-stopped
  ports:
    - ${RAY_DASHBOARD_PORT:-8265}:8265
    - ${RAY_CLIENT_PORT:-10001}:10001
  command: ray start --head --dashboard-host=0.0.0.0 --block
  shm_size: ${SHM_SIZE:-2gb}
  networks:
    - ray-network

ray-worker-1

ray-worker-1:
  image: rayproject/ray:latest
  container_name: ray-worker-1
  restart: unless-stopped
  command: ray start --address=ray-head:6379 --block
  shm_size: ${SHM_SIZE:-2gb}
  depends_on:
    - ray-head
  networks:
    - ray-network

ray-worker-2

ray-worker-2:
  image: rayproject/ray:latest
  container_name: ray-worker-2
  restart: unless-stopped
  command: ray start --address=ray-head:6379 --block
  shm_size: ${SHM_SIZE:-2gb}
  depends_on:
    - ray-head
  networks:
    - ray-network

[>]Quick Start

[terminal]

1# 1. Create the compose file
2cat > docker-compose.yml << 'EOF'
3services:
4  ray-head:
5    image: rayproject/ray:latest
6    container_name: ray-head
7    restart: unless-stopped
8    ports:
9      - "${RAY_DASHBOARD_PORT:-8265}:8265"
10      - "${RAY_CLIENT_PORT:-10001}:10001"
11    command: ray start --head --dashboard-host=0.0.0.0 --block
12    shm_size: ${SHM_SIZE:-2gb}
13    networks:
14      - ray-network
15
16  ray-worker-1:
17    image: rayproject/ray:latest
18    container_name: ray-worker-1
19    restart: unless-stopped
20    command: ray start --address=ray-head:6379 --block
21    shm_size: ${SHM_SIZE:-2gb}
22    depends_on:
23      - ray-head
24    networks:
25      - ray-network
26
27  ray-worker-2:
28    image: rayproject/ray:latest
29    container_name: ray-worker-2
30    restart: unless-stopped
31    command: ray start --address=ray-head:6379 --block
32    shm_size: ${SHM_SIZE:-2gb}
33    depends_on:
34      - ray-head
35    networks:
36      - ray-network
37
38networks:
39  ray-network:
40    driver: bridge
41EOF
42
43# 2. Create the .env file
44cat > .env << 'EOF'
45# Ray Cluster
46RAY_DASHBOARD_PORT=8265
47RAY_CLIENT_PORT=10001
48SHM_SIZE=2gb
49EOF
50
51# 3. Start the services
52docker compose up -d
53
54# 4. View logs
55docker compose logs -f

[>]One-Liner

Run this command to download and set up the recipe in one step:

[terminal]

1curl -fsSL https://docker.recipes/api/recipes/ray-cluster/run | bash

[?]Troubleshooting

[!]Workers failing to connect with 'redis.exceptions.ConnectionError': Ensure ray-head container is fully started before workers attempt connection, increase depends_on timeout
[!]Out of shared memory errors during large object operations: Increase SHM_SIZE environment variable beyond default 2GB based on your data sizes
[!]Ray Dashboard showing 'Cluster not found' or connection timeouts: Verify RAY_DASHBOARD_PORT is properly mapped and dashboard-host=0.0.0.0 is set in head node command
[!]Tasks hanging indefinitely without completion: Check worker logs for memory exhaustion or increase worker memory limits to handle task requirements
[!]Ray Client connection refused from external Python scripts: Confirm RAY_CLIENT_PORT mapping and use ray.init('ray://docker-host-ip:10001') with actual host IP
[!]Object serialization failures with custom classes: Ensure all custom modules are available on worker nodes or use Ray's runtime_env feature to distribute dependencies

Community Notes

Loading notes...

## Download Recipe Kit

Get all files in a ready-to-deploy package

Includes docker-compose.yml, .env template, README, and license

## Components

ray-headray-worker

## Tags

#ray#distributed#ml#python#cluster

## Category

AI & Machine Learning

## Related

Shortcuts: C CopyF FavoriteD Download