Ray Cluster

advanced

Distributed computing framework for ML.

[i]Overview

Ray is an open-source distributed computing framework developed by UC Berkeley's RISELab, designed to scale Python applications from single machines to large clusters. Ray provides a unified API for distributed computing that handles parallel processing, distributed training, hyperparameter tuning, and reinforcement learning workloads. The framework abstracts away the complexity of distributed systems while maintaining high performance through its actor-based model and efficient shared memory architecture. This Ray cluster configuration establishes a distributed computing environment with a head node managing cluster coordination and worker nodes executing distributed tasks. The head node runs Ray's Global Control Store (GCS) and dashboard, while worker nodes connect automatically to form a cohesive computing cluster. Ray's architecture enables efficient task scheduling, automatic fault tolerance, and dynamic resource allocation across all nodes in the cluster. Data scientists, ML engineers, and researchers working with computationally intensive Python workloads benefit most from Ray clusters. Organizations processing large datasets, training complex machine learning models, or running hyperparameter optimization experiments will find Ray's distributed capabilities essential. The framework particularly excels in environments requiring both high-throughput batch processing and low-latency model serving, making it valuable for production ML pipelines and research workflows.

[*]Key Features

[+]Distributed actor-based computing model with automatic task scheduling across cluster nodes
[+]Ray Dashboard for real-time cluster monitoring, task visualization, and resource utilization tracking
[+]Auto-scaling worker nodes with dynamic resource allocation and fault tolerance
[+]Ray Serve integration for distributed model serving with automatic load balancing
[+]Ray Tune support for distributed hyperparameter optimization and neural architecture search
[+]Shared object store using Apache Arrow for zero-copy data sharing between processes
[+]Ray Data for distributed data processing with support for ML preprocessing pipelines
[+]Built-in support for reinforcement learning workloads through Ray RLlib integration

[#]Common Use Cases

[1]Distributed machine learning model training across multiple GPUs and nodes
[2]Large-scale hyperparameter tuning for deep learning experiments with Ray Tune
[3]High-throughput batch inference for computer vision and NLP models
[4]Distributed data preprocessing and ETL pipelines for ML workflows
[5]Reinforcement learning training environments with parallel episode collection
[6]Real-time model serving with automatic scaling based on request volume
[7]Monte Carlo simulations and financial risk modeling across distributed compute resources

[!]Prerequisites

[!]Minimum 8GB RAM per node (16GB+ recommended for ML workloads)
[!]Docker Engine 20.10+ with Docker Compose v2 support
[!]Network connectivity between nodes for cluster communication on port 6379
[!]Python programming knowledge for Ray application development
[!]Understanding of distributed computing concepts and parallel processing
[!]For GPU workloads: NVIDIA Docker runtime and compatible GPU drivers

[!]

WARNING: For development & testing. Review security settings, change default credentials, and test thoroughly before production use. See Terms

[$]docker-compose.yml

[docker-compose.yml]

1services: 
2  ray-head: 
3    image: rayproject/ray:latest
4    container_name: ray-head
5    command: ray start --head --dashboard-host=0.0.0.0 --block
6    ports: 
7      - "8265:8265"
8      - "6379:6379"
9    networks: 
10      - ray
11
12  ray-worker: 
13    image: rayproject/ray:latest
14    command: ray start --address=ray-head:6379 --block
15    deploy: 
16      replicas: 2
17    depends_on: 
18      - ray-head
19    networks: 
20      - ray
21
22networks: 
23  ray: 
24    driver: bridge

[$].env Template

[.env]

1# No additional config needed

[i]Usage Notes

[1]Docs: https://docs.ray.io/
[2]Dashboard at http://localhost:8265 - cluster status and logs
[3]Connect from Python: ray.init('ray://localhost:10001')
[4]Scale workers via deploy.replicas in compose file
[5]Ray Serve for model serving, Ray Tune for hyperparameter tuning
[6]GPU support: add device reservations to worker service

Individual Services(2 services)

Copy individual services to mix and match with your existing compose files.

ray-head

ray-head:
  image: rayproject/ray:latest
  container_name: ray-head
  command: ray start --head --dashboard-host=0.0.0.0 --block
  ports:
    - "8265:8265"
    - "6379:6379"
  networks:
    - ray

ray-worker

ray-worker:
  image: rayproject/ray:latest
  command: ray start --address=ray-head:6379 --block
  deploy:
    replicas: 2
  depends_on:
    - ray-head
  networks:
    - ray

[>]Quick Start

[terminal]

1# 1. Create the compose file
2cat > docker-compose.yml << 'EOF'
3services:
4  ray-head:
5    image: rayproject/ray:latest
6    container_name: ray-head
7    command: ray start --head --dashboard-host=0.0.0.0 --block
8    ports:
9      - "8265:8265"
10      - "6379:6379"
11    networks:
12      - ray
13
14  ray-worker:
15    image: rayproject/ray:latest
16    command: ray start --address=ray-head:6379 --block
17    deploy:
18      replicas: 2
19    depends_on:
20      - ray-head
21    networks:
22      - ray
23
24networks:
25  ray:
26    driver: bridge
27EOF
28
29# 2. Create the .env file
30cat > .env << 'EOF'
31# No additional config needed
32EOF
33
34# 3. Start the services
35docker compose up -d
36
37# 4. View logs
38docker compose logs -f

[>]One-Liner

Run this command to download and set up the recipe in one step:

[terminal]

1curl -fsSL https://docker.recipes/api/recipes/ray/run | bash

[?]Troubleshooting

[!]ray-worker containers failing to start: Ensure ray-head container is fully initialized before workers attempt connection
[!]Dashboard not accessible on port 8265: Check firewall settings and verify ray-head container has --dashboard-host=0.0.0.0 parameter
[!]Workers not appearing in cluster: Verify network connectivity between containers and correct Redis port 6379 accessibility
[!]Out of memory errors during large computations: Increase Docker container memory limits or reduce Ray worker memory allocation
[!]Task scheduling delays or failures: Monitor Ray Dashboard for resource bottlenecks and consider scaling worker replicas
[!]Connection refused errors from Python clients: Ensure Ray client connects to correct head node address and GCS port is available

Community Notes

Loading notes...

## Download Recipe Kit

Get all files in a ready-to-deploy package

Includes docker-compose.yml, .env template, README, and license

## Components

ray

## Tags

#ray#distributed#ml#computing

## Category

AI & Machine Learning

## Related

Shortcuts: C CopyF FavoriteD Download