docker.recipes

Dask Distributed

intermediate

Parallel computing library for analytics.

Overview

Dask is a flexible parallel computing library for Python that scales analytics workloads from single machines to large clusters. Originally developed by Matthew Rocklin in 2014, Dask provides familiar APIs that mirror NumPy, Pandas, and scikit-learn while enabling distributed computing across multiple cores or machines. It uses task graphs to represent computations and can handle datasets that don't fit in memory by processing them in chunks. This distributed setup creates a scheduler-worker architecture where the Dask scheduler coordinates task execution across multiple worker nodes. The scheduler maintains a real-time view of cluster resources, assigns tasks to available workers, and handles fault tolerance automatically. Workers execute the actual computations and communicate results back through the scheduler. The three-worker configuration provides parallel processing capability while maintaining manageable resource usage. Data scientists and ML engineers benefit from this setup because it allows scaling existing Pandas and NumPy code without rewriting algorithms. The distributed architecture handles memory-intensive operations like large DataFrame joins, complex aggregations, and machine learning model training that would otherwise cause out-of-memory errors on single machines.

Key Features

  • Drop-in replacement for Pandas and NumPy with identical API syntax
  • Lazy evaluation system that builds computation graphs before execution
  • Real-time web dashboard showing task progress, memory usage, and worker status
  • Automatic data partitioning and chunk management for large datasets
  • Fault-tolerant task execution with automatic worker failure recovery
  • Dynamic task scheduling that adapts to varying computational loads
  • Memory-efficient streaming computations for datasets larger than RAM
  • Support for custom functions through delayed decorators and futures API

Common Use Cases

  • 1Processing large CSV files or time series data that exceed single machine memory
  • 2Parallel hyperparameter tuning for machine learning models across multiple parameter combinations
  • 3ETL workflows that require transforming millions of records with complex business logic
  • 4Financial risk calculations requiring Monte Carlo simulations across large portfolios
  • 5Geospatial analysis processing satellite imagery or GPS tracking data
  • 6Real-time analytics dashboards aggregating streaming data from multiple sources
  • 7Scientific computing workloads like climate modeling or genomics analysis

Prerequisites

  • Minimum 4GB RAM per worker container plus 2GB for scheduler
  • Python development experience with Pandas, NumPy, or scikit-learn
  • Understanding of parallel computing concepts and data partitioning strategies
  • Network ports 8786 and 8787 available for scheduler communication and dashboard
  • Docker Compose 3.8+ for deploy.replicas scaling functionality
  • Basic knowledge of distributed systems concepts for troubleshooting cluster issues

For development & testing. Review security settings, change default credentials, and test thoroughly before production use. See Terms

docker-compose.yml

docker-compose.yml
1services:
2 dask-scheduler:
3 image: ghcr.io/dask/dask:latest
4 container_name: dask-scheduler
5 command: dask scheduler
6 ports:
7 - "8786:8786"
8 - "8787:8787"
9 networks:
10 - dask
11
12 dask-worker:
13 image: ghcr.io/dask/dask:latest
14 command: dask worker tcp://dask-scheduler:8786
15 deploy:
16 replicas: 3
17 depends_on:
18 - dask-scheduler
19 networks:
20 - dask
21
22networks:
23 dask:
24 driver: bridge

.env Template

.env
1# Scale workers with replicas

Usage Notes

  1. 1Docs: https://docs.dask.org/
  2. 2Dashboard at http://localhost:8787 - real-time task progress
  3. 3Connect: from dask.distributed import Client; client = Client('localhost:8786')
  4. 4Scale workers via deploy.replicas in compose file
  5. 5Drop-in replacement for pandas/numpy with .compute() to execute
  6. 6Supports arrays, dataframes, bags, and delayed computations

Individual Services(2 services)

Copy individual services to mix and match with your existing compose files.

dask-scheduler
dask-scheduler:
  image: ghcr.io/dask/dask:latest
  container_name: dask-scheduler
  command: dask scheduler
  ports:
    - "8786:8786"
    - "8787:8787"
  networks:
    - dask
dask-worker
dask-worker:
  image: ghcr.io/dask/dask:latest
  command: dask worker tcp://dask-scheduler:8786
  deploy:
    replicas: 3
  depends_on:
    - dask-scheduler
  networks:
    - dask

Quick Start

terminal
1# 1. Create the compose file
2cat > docker-compose.yml << 'EOF'
3services:
4 dask-scheduler:
5 image: ghcr.io/dask/dask:latest
6 container_name: dask-scheduler
7 command: dask scheduler
8 ports:
9 - "8786:8786"
10 - "8787:8787"
11 networks:
12 - dask
13
14 dask-worker:
15 image: ghcr.io/dask/dask:latest
16 command: dask worker tcp://dask-scheduler:8786
17 deploy:
18 replicas: 3
19 depends_on:
20 - dask-scheduler
21 networks:
22 - dask
23
24networks:
25 dask:
26 driver: bridge
27EOF
28
29# 2. Create the .env file
30cat > .env << 'EOF'
31# Scale workers with replicas
32EOF
33
34# 3. Start the services
35docker compose up -d
36
37# 4. View logs
38docker compose logs -f

One-Liner

Run this command to download and set up the recipe in one step:

terminal
1curl -fsSL https://docker.recipes/api/recipes/dask/run | bash

Troubleshooting

  • Workers disconnecting frequently: Increase worker memory limits or reduce chunk sizes in your computations
  • KilledWorker errors during computation: Add memory monitoring and implement explicit garbage collection with client.restart()
  • Scheduler dashboard showing 'no workers available': Check network connectivity between scheduler and workers using docker network inspect
  • Tasks hanging in 'processing' state: Enable worker timeouts with distributed.worker.daemon=False configuration
  • Memory usage growing continuously: Use context managers for client connections and call client.close() explicitly
  • Slow task execution compared to single-threaded: Profile your code for serialization bottlenecks and reduce data transfer between workers

Community Notes

Loading...
Loading notes...

Download Recipe Kit

Get all files in a ready-to-deploy package

Includes docker-compose.yml, .env template, README, and license

Ad Space