docker.recipes

Apache Spark Standalone Cluster

advanced

Apache Spark cluster with master and worker nodes for distributed data processing.

Overview

Apache Spark is an open-source unified analytics engine developed at UC Berkeley's AMPLab, designed for large-scale data processing with built-in modules for streaming, SQL, machine learning, and graph processing. Unlike traditional MapReduce frameworks, Spark performs in-memory computing, making it up to 100 times faster for certain applications by keeping data in RAM between operations rather than writing to disk after each step. This standalone cluster deployment combines a Spark master node with multiple worker nodes to create a distributed computing environment capable of processing massive datasets across multiple machines. The master node coordinates job scheduling and resource allocation while worker nodes execute the actual data processing tasks, with the cluster automatically handling data partitioning, fault tolerance, and load balancing across available resources. Data engineers, data scientists, and organizations processing large volumes of structured or unstructured data will benefit from this setup, particularly those running ETL pipelines, machine learning workflows, or real-time analytics. This configuration is ideal for companies transitioning from single-machine data processing to distributed computing, or those needing a development environment that mirrors production Spark clusters without the complexity of full cluster management platforms like YARN or Kubernetes.

Key Features

  • Resilient Distributed Datasets (RDDs) with automatic fault recovery and data lineage tracking
  • In-memory computing capabilities reducing disk I/O by up to 100x compared to MapReduce
  • Built-in Catalyst SQL optimizer for automatic query optimization and code generation
  • Dynamic resource allocation allowing workers to scale compute resources based on workload demands
  • Spark Streaming for processing live data streams with micro-batch processing architecture
  • MLlib machine learning library with distributed algorithms for classification, regression, and clustering
  • GraphX graph processing framework for social network analysis and graph algorithms
  • Multi-language support with native APIs for Scala, Java, Python (PySpark), and R (SparkR)

Common Use Cases

  • 1ETL pipeline processing for transforming terabytes of log data from multiple sources into analytical formats
  • 2Real-time fraud detection systems analyzing credit card transactions and user behavior patterns
  • 3Machine learning model training on large datasets for recommendation engines and predictive analytics
  • 4Financial risk modeling and Monte Carlo simulations requiring distributed parallel processing
  • 5Genomics research processing DNA sequencing data and performing large-scale bioinformatics analysis
  • 6IoT sensor data aggregation and analysis for smart city infrastructure monitoring
  • 7E-commerce clickstream analysis for customer journey mapping and conversion optimization

Prerequisites

  • Minimum 8GB RAM allocated to Docker with at least 4GB available for Spark worker processes
  • Docker Engine 20.10+ and Docker Compose 2.0+ for proper container networking and resource management
  • Understanding of distributed computing concepts including data partitioning and parallel processing
  • Basic knowledge of Spark DataFrame/Dataset APIs and SQL for job submission and data manipulation
  • Familiarity with JVM memory management and garbage collection tuning for optimal performance
  • Network ports 7077 and 8080 available on host system for Spark master and web UI access

For development & testing. Review security settings, change default credentials, and test thoroughly before production use. See Terms

docker-compose.yml

docker-compose.yml
1services:
2 spark-master:
3 image: bitnami/spark:latest
4 container_name: spark-master
5 restart: unless-stopped
6 environment:
7 - SPARK_MODE=master
8 - SPARK_MASTER_HOST=spark-master
9 - SPARK_MASTER_PORT=${SPARK_MASTER_PORT:-7077}
10 - SPARK_MASTER_WEBUI_PORT=${SPARK_WEBUI_PORT:-8080}
11 ports:
12 - "${SPARK_MASTER_PORT:-7077}:7077"
13 - "${SPARK_WEBUI_PORT:-8080}:8080"
14 volumes:
15 - spark_data:/opt/bitnami/spark
16 networks:
17 - spark-network
18
19 spark-worker-1:
20 image: bitnami/spark:latest
21 container_name: spark-worker-1
22 restart: unless-stopped
23 environment:
24 - SPARK_MODE=worker
25 - SPARK_MASTER_URL=spark://spark-master:7077
26 - SPARK_WORKER_MEMORY=${SPARK_WORKER_MEMORY:-2g}
27 - SPARK_WORKER_CORES=${SPARK_WORKER_CORES:-2}
28 depends_on:
29 - spark-master
30 networks:
31 - spark-network
32
33 spark-worker-2:
34 image: bitnami/spark:latest
35 container_name: spark-worker-2
36 restart: unless-stopped
37 environment:
38 - SPARK_MODE=worker
39 - SPARK_MASTER_URL=spark://spark-master:7077
40 - SPARK_WORKER_MEMORY=${SPARK_WORKER_MEMORY:-2g}
41 - SPARK_WORKER_CORES=${SPARK_WORKER_CORES:-2}
42 depends_on:
43 - spark-master
44 networks:
45 - spark-network
46
47volumes:
48 spark_data:
49
50networks:
51 spark-network:
52 driver: bridge

.env Template

.env
1# Apache Spark Cluster
2SPARK_MASTER_PORT=7077
3SPARK_WEBUI_PORT=8080
4SPARK_WORKER_MEMORY=2g
5SPARK_WORKER_CORES=2

Usage Notes

  1. 1Spark Master UI at http://localhost:8080
  2. 2Submit jobs to spark://localhost:7077
  3. 3Scale workers by adding more services
  4. 4Great for big data processing

Individual Services(3 services)

Copy individual services to mix and match with your existing compose files.

spark-master
spark-master:
  image: bitnami/spark:latest
  container_name: spark-master
  restart: unless-stopped
  environment:
    - SPARK_MODE=master
    - SPARK_MASTER_HOST=spark-master
    - SPARK_MASTER_PORT=${SPARK_MASTER_PORT:-7077}
    - SPARK_MASTER_WEBUI_PORT=${SPARK_WEBUI_PORT:-8080}
  ports:
    - ${SPARK_MASTER_PORT:-7077}:7077
    - ${SPARK_WEBUI_PORT:-8080}:8080
  volumes:
    - spark_data:/opt/bitnami/spark
  networks:
    - spark-network
spark-worker-1
spark-worker-1:
  image: bitnami/spark:latest
  container_name: spark-worker-1
  restart: unless-stopped
  environment:
    - SPARK_MODE=worker
    - SPARK_MASTER_URL=spark://spark-master:7077
    - SPARK_WORKER_MEMORY=${SPARK_WORKER_MEMORY:-2g}
    - SPARK_WORKER_CORES=${SPARK_WORKER_CORES:-2}
  depends_on:
    - spark-master
  networks:
    - spark-network
spark-worker-2
spark-worker-2:
  image: bitnami/spark:latest
  container_name: spark-worker-2
  restart: unless-stopped
  environment:
    - SPARK_MODE=worker
    - SPARK_MASTER_URL=spark://spark-master:7077
    - SPARK_WORKER_MEMORY=${SPARK_WORKER_MEMORY:-2g}
    - SPARK_WORKER_CORES=${SPARK_WORKER_CORES:-2}
  depends_on:
    - spark-master
  networks:
    - spark-network

Quick Start

terminal
1# 1. Create the compose file
2cat > docker-compose.yml << 'EOF'
3services:
4 spark-master:
5 image: bitnami/spark:latest
6 container_name: spark-master
7 restart: unless-stopped
8 environment:
9 - SPARK_MODE=master
10 - SPARK_MASTER_HOST=spark-master
11 - SPARK_MASTER_PORT=${SPARK_MASTER_PORT:-7077}
12 - SPARK_MASTER_WEBUI_PORT=${SPARK_WEBUI_PORT:-8080}
13 ports:
14 - "${SPARK_MASTER_PORT:-7077}:7077"
15 - "${SPARK_WEBUI_PORT:-8080}:8080"
16 volumes:
17 - spark_data:/opt/bitnami/spark
18 networks:
19 - spark-network
20
21 spark-worker-1:
22 image: bitnami/spark:latest
23 container_name: spark-worker-1
24 restart: unless-stopped
25 environment:
26 - SPARK_MODE=worker
27 - SPARK_MASTER_URL=spark://spark-master:7077
28 - SPARK_WORKER_MEMORY=${SPARK_WORKER_MEMORY:-2g}
29 - SPARK_WORKER_CORES=${SPARK_WORKER_CORES:-2}
30 depends_on:
31 - spark-master
32 networks:
33 - spark-network
34
35 spark-worker-2:
36 image: bitnami/spark:latest
37 container_name: spark-worker-2
38 restart: unless-stopped
39 environment:
40 - SPARK_MODE=worker
41 - SPARK_MASTER_URL=spark://spark-master:7077
42 - SPARK_WORKER_MEMORY=${SPARK_WORKER_MEMORY:-2g}
43 - SPARK_WORKER_CORES=${SPARK_WORKER_CORES:-2}
44 depends_on:
45 - spark-master
46 networks:
47 - spark-network
48
49volumes:
50 spark_data:
51
52networks:
53 spark-network:
54 driver: bridge
55EOF
56
57# 2. Create the .env file
58cat > .env << 'EOF'
59# Apache Spark Cluster
60SPARK_MASTER_PORT=7077
61SPARK_WEBUI_PORT=8080
62SPARK_WORKER_MEMORY=2g
63SPARK_WORKER_CORES=2
64EOF
65
66# 3. Start the services
67docker compose up -d
68
69# 4. View logs
70docker compose logs -f

One-Liner

Run this command to download and set up the recipe in one step:

terminal
1curl -fsSL https://docker.recipes/api/recipes/apache-spark-standalone/run | bash

Troubleshooting

  • java.lang.OutOfMemoryError in worker containers: Increase SPARK_WORKER_MEMORY environment variable or reduce SPARK_WORKER_CORES to balance memory per core allocation
  • Workers failing to connect to master with 'Connection refused' errors: Verify spark-master container is fully started before workers attempt connection, add health checks or startup delays
  • Spark jobs hanging in RUNNING state indefinitely: Check executor memory settings and ensure sufficient resources are available, may need to increase driver memory or reduce executor instances
  • ClassNotFoundException when submitting custom applications: Mount application JAR files to /opt/bitnami/spark/jars directory in all containers or use spark-submit with --jars parameter
  • Web UI showing 'Application Not Found' errors: Ensure Spark History Server is configured with shared storage, or check that application completed successfully without exceptions
  • Serialization errors with custom objects: Verify all custom classes implement Serializable interface and are available on classpath of all worker nodes

Community Notes

Loading...
Loading notes...

Download Recipe Kit

Get all files in a ready-to-deploy package

Includes docker-compose.yml, .env template, README, and license

Components

spark-masterspark-workerspark-history

Tags

#spark#big-data#distributed#analytics#etl

Category

Database Stacks
Ad Space