Ray Distributed Computing
Ray cluster for distributed ML and Python workloads.
Overview
Ray is an open-source distributed computing framework originally developed at UC Berkeley's RISELab that transforms single-machine Python code into distributed applications without significant rewrites. The ray-head component serves as the cluster coordinator, managing resource allocation, task scheduling, and providing the global control store through Redis, while also hosting the web-based dashboard for monitoring cluster health and job execution. Ray workers connect to the head node to form a compute cluster that can scale Python workloads across multiple machines, supporting everything from simple parallel processing to complex machine learning pipelines with Ray's ecosystem libraries like Ray Tune for hyperparameter optimization and Ray Serve for model deployment.
This containerized Ray cluster architecture enables teams to build scalable distributed systems using familiar Python syntax, where functions decorated with @ray.remote automatically become distributable across the cluster. The head node acts as the single point of coordination, maintaining the global state and scheduling tasks to available workers, while workers execute the actual computation and report back results. This setup eliminates the complexity typically associated with distributed computing frameworks like Spark or Dask, making parallel processing accessible to Python developers without deep distributed systems expertise.
Data scientists and ML engineers working with computationally intensive workloads will find this stack particularly valuable, as Ray excels at hyperparameter tuning, distributed training, and reinforcement learning scenarios that require dynamic task graphs. The containerized deployment makes it ideal for teams needing reproducible distributed computing environments across development and production, while the shared memory configuration optimizes performance for data-heavy operations common in machine learning workflows.
Key Features
- Built-in Redis-backed global control store for distributed state management and coordination
- Interactive Ray Dashboard with real-time cluster metrics, job monitoring, and resource utilization graphs
- Automatic task scheduling with work-stealing algorithms for optimal resource utilization across workers
- Native support for nested parallelism allowing @ray.remote functions to call other remote functions
- Object store with automatic serialization handling for NumPy arrays, Pandas DataFrames, and custom Python objects
- Dynamic actor model supporting stateful distributed classes that persist across multiple function calls
- Built-in support for GPU scheduling and allocation when CUDA-enabled Ray images are used
- Ray Client interface enabling remote cluster access from Jupyter notebooks and local Python scripts
Common Use Cases
- 1Hyperparameter optimization for machine learning models using Ray Tune across multiple parameter combinations
- 2Distributed training of deep learning models that don't fit on a single GPU or require data parallelism
- 3Monte Carlo simulations and financial modeling requiring thousands of parallel scenario calculations
- 4Reinforcement learning environments where multiple agents train simultaneously across distributed workers
- 5Large-scale data preprocessing and ETL pipelines that benefit from parallel processing of data chunks
- 6Computer vision workloads like batch image processing or video analysis across distributed compute resources
- 7Scientific computing applications requiring parallel execution of CPU-intensive numerical computations
Prerequisites
- Minimum 4GB RAM per container with additional 2GB shared memory allocation for efficient object sharing
- Docker host with sufficient CPU cores to benefit from distributed processing (recommended 4+ cores)
- Available ports 8265 for Ray Dashboard access and 10001 for Ray Client connections
- Basic understanding of Python decorators and async programming concepts for effective Ray usage
- Familiarity with distributed computing concepts like task scheduling and shared state management
- Network connectivity between containers for Redis communication on port 6379 and worker registration
For development & testing. Review security settings, change default credentials, and test thoroughly before production use. See Terms
docker-compose.yml
docker-compose.yml
1services: 2 ray-head: 3 image: rayproject/ray:latest4 container_name: ray-head5 restart: unless-stopped6 ports: 7 - "${RAY_DASHBOARD_PORT:-8265}:8265"8 - "${RAY_CLIENT_PORT:-10001}:10001"9 command: ray start --head --dashboard-host=0.0.0.0 --block10 shm_size: ${SHM_SIZE:-2gb}11 networks: 12 - ray-network1314 ray-worker-1: 15 image: rayproject/ray:latest16 container_name: ray-worker-117 restart: unless-stopped18 command: ray start --address=ray-head:6379 --block19 shm_size: ${SHM_SIZE:-2gb}20 depends_on: 21 - ray-head22 networks: 23 - ray-network2425 ray-worker-2: 26 image: rayproject/ray:latest27 container_name: ray-worker-228 restart: unless-stopped29 command: ray start --address=ray-head:6379 --block30 shm_size: ${SHM_SIZE:-2gb}31 depends_on: 32 - ray-head33 networks: 34 - ray-network3536networks: 37 ray-network: 38 driver: bridge.env Template
.env
1# Ray Cluster2RAY_DASHBOARD_PORT=82653RAY_CLIENT_PORT=100014SHM_SIZE=2gbUsage Notes
- 1Docs: https://docs.ray.io/
- 2Ray Dashboard at http://localhost:8265 - jobs, actors, metrics
- 3Connect from Python: ray.init('ray://localhost:10001')
- 4Scale by adding more ray-worker-N services in compose file
- 5Use @ray.remote decorator for distributed functions
- 6Ray Tune, Ray Serve, and Ray Data for ML workflows
Individual Services(3 services)
Copy individual services to mix and match with your existing compose files.
ray-head
ray-head:
image: rayproject/ray:latest
container_name: ray-head
restart: unless-stopped
ports:
- ${RAY_DASHBOARD_PORT:-8265}:8265
- ${RAY_CLIENT_PORT:-10001}:10001
command: ray start --head --dashboard-host=0.0.0.0 --block
shm_size: ${SHM_SIZE:-2gb}
networks:
- ray-network
ray-worker-1
ray-worker-1:
image: rayproject/ray:latest
container_name: ray-worker-1
restart: unless-stopped
command: ray start --address=ray-head:6379 --block
shm_size: ${SHM_SIZE:-2gb}
depends_on:
- ray-head
networks:
- ray-network
ray-worker-2
ray-worker-2:
image: rayproject/ray:latest
container_name: ray-worker-2
restart: unless-stopped
command: ray start --address=ray-head:6379 --block
shm_size: ${SHM_SIZE:-2gb}
depends_on:
- ray-head
networks:
- ray-network
Quick Start
terminal
1# 1. Create the compose file2cat > docker-compose.yml << 'EOF'3services:4 ray-head:5 image: rayproject/ray:latest6 container_name: ray-head7 restart: unless-stopped8 ports:9 - "${RAY_DASHBOARD_PORT:-8265}:8265"10 - "${RAY_CLIENT_PORT:-10001}:10001"11 command: ray start --head --dashboard-host=0.0.0.0 --block12 shm_size: ${SHM_SIZE:-2gb}13 networks:14 - ray-network1516 ray-worker-1:17 image: rayproject/ray:latest18 container_name: ray-worker-119 restart: unless-stopped20 command: ray start --address=ray-head:6379 --block21 shm_size: ${SHM_SIZE:-2gb}22 depends_on:23 - ray-head24 networks:25 - ray-network2627 ray-worker-2:28 image: rayproject/ray:latest29 container_name: ray-worker-230 restart: unless-stopped31 command: ray start --address=ray-head:6379 --block32 shm_size: ${SHM_SIZE:-2gb}33 depends_on:34 - ray-head35 networks:36 - ray-network3738networks:39 ray-network:40 driver: bridge41EOF4243# 2. Create the .env file44cat > .env << 'EOF'45# Ray Cluster46RAY_DASHBOARD_PORT=826547RAY_CLIENT_PORT=1000148SHM_SIZE=2gb49EOF5051# 3. Start the services52docker compose up -d5354# 4. View logs55docker compose logs -fOne-Liner
Run this command to download and set up the recipe in one step:
terminal
1curl -fsSL https://docker.recipes/api/recipes/ray-cluster/run | bashTroubleshooting
- Workers failing to connect with 'redis.exceptions.ConnectionError': Ensure ray-head container is fully started before workers attempt connection, increase depends_on timeout
- Out of shared memory errors during large object operations: Increase SHM_SIZE environment variable beyond default 2GB based on your data sizes
- Ray Dashboard showing 'Cluster not found' or connection timeouts: Verify RAY_DASHBOARD_PORT is properly mapped and dashboard-host=0.0.0.0 is set in head node command
- Tasks hanging indefinitely without completion: Check worker logs for memory exhaustion or increase worker memory limits to handle task requirements
- Ray Client connection refused from external Python scripts: Confirm RAY_CLIENT_PORT mapping and use ray.init('ray://docker-host-ip:10001') with actual host IP
- Object serialization failures with custom classes: Ensure all custom modules are available on worker nodes or use Ray's runtime_env feature to distribute dependencies
Community Notes
Loading...
Loading notes...
Download Recipe Kit
Get all files in a ready-to-deploy package
Includes docker-compose.yml, .env template, README, and license
Ad Space
Shortcuts: C CopyF FavoriteD Download