Apache Spark Standalone Cluster

advanced

Apache Spark cluster with master and worker nodes for distributed data processing.

[i]Overview

Apache Spark is an open-source unified analytics engine developed at UC Berkeley's AMPLab, designed for large-scale data processing with built-in modules for streaming, SQL, machine learning, and graph processing. Unlike traditional MapReduce frameworks, Spark performs in-memory computing, making it up to 100 times faster for certain applications by keeping data in RAM between operations rather than writing to disk after each step. This standalone cluster deployment combines a Spark master node with multiple worker nodes to create a distributed computing environment capable of processing massive datasets across multiple machines. The master node coordinates job scheduling and resource allocation while worker nodes execute the actual data processing tasks, with the cluster automatically handling data partitioning, fault tolerance, and load balancing across available resources. Data engineers, data scientists, and organizations processing large volumes of structured or unstructured data will benefit from this setup, particularly those running ETL pipelines, machine learning workflows, or real-time analytics. This configuration is ideal for companies transitioning from single-machine data processing to distributed computing, or those needing a development environment that mirrors production Spark clusters without the complexity of full cluster management platforms like YARN or Kubernetes.

[*]Key Features

[+]Resilient Distributed Datasets (RDDs) with automatic fault recovery and data lineage tracking
[+]In-memory computing capabilities reducing disk I/O by up to 100x compared to MapReduce
[+]Built-in Catalyst SQL optimizer for automatic query optimization and code generation
[+]Dynamic resource allocation allowing workers to scale compute resources based on workload demands
[+]Spark Streaming for processing live data streams with micro-batch processing architecture
[+]MLlib machine learning library with distributed algorithms for classification, regression, and clustering
[+]GraphX graph processing framework for social network analysis and graph algorithms
[+]Multi-language support with native APIs for Scala, Java, Python (PySpark), and R (SparkR)

[#]Common Use Cases

[1]ETL pipeline processing for transforming terabytes of log data from multiple sources into analytical formats
[2]Real-time fraud detection systems analyzing credit card transactions and user behavior patterns
[3]Machine learning model training on large datasets for recommendation engines and predictive analytics
[4]Financial risk modeling and Monte Carlo simulations requiring distributed parallel processing
[5]Genomics research processing DNA sequencing data and performing large-scale bioinformatics analysis
[6]IoT sensor data aggregation and analysis for smart city infrastructure monitoring
[7]E-commerce clickstream analysis for customer journey mapping and conversion optimization

[!]Prerequisites

[!]Minimum 8GB RAM allocated to Docker with at least 4GB available for Spark worker processes
[!]Docker Engine 20.10+ and Docker Compose 2.0+ for proper container networking and resource management
[!]Understanding of distributed computing concepts including data partitioning and parallel processing
[!]Basic knowledge of Spark DataFrame/Dataset APIs and SQL for job submission and data manipulation
[!]Familiarity with JVM memory management and garbage collection tuning for optimal performance
[!]Network ports 7077 and 8080 available on host system for Spark master and web UI access

[!]

WARNING: For development & testing. Review security settings, change default credentials, and test thoroughly before production use. See Terms

[$]docker-compose.yml

[docker-compose.yml]

1services: 
2  spark-master: 
3    image: bitnami/spark:latest
4    container_name: spark-master
5    restart: unless-stopped
6    environment: 
7      - SPARK_MODE=master
8      - SPARK_MASTER_HOST=spark-master
9      - SPARK_MASTER_PORT=${SPARK_MASTER_PORT:-7077}
10      - SPARK_MASTER_WEBUI_PORT=${SPARK_WEBUI_PORT:-8080}
11    ports: 
12      - "${SPARK_MASTER_PORT:-7077}:7077"
13      - "${SPARK_WEBUI_PORT:-8080}:8080"
14    volumes: 
15      - spark_data:/opt/bitnami/spark
16    networks: 
17      - spark-network
18
19  spark-worker-1: 
20    image: bitnami/spark:latest
21    container_name: spark-worker-1
22    restart: unless-stopped
23    environment: 
24      - SPARK_MODE=worker
25      - SPARK_MASTER_URL=spark://spark-master:7077
26      - SPARK_WORKER_MEMORY=${SPARK_WORKER_MEMORY:-2g}
27      - SPARK_WORKER_CORES=${SPARK_WORKER_CORES:-2}
28    depends_on: 
29      - spark-master
30    networks: 
31      - spark-network
32
33  spark-worker-2: 
34    image: bitnami/spark:latest
35    container_name: spark-worker-2
36    restart: unless-stopped
37    environment: 
38      - SPARK_MODE=worker
39      - SPARK_MASTER_URL=spark://spark-master:7077
40      - SPARK_WORKER_MEMORY=${SPARK_WORKER_MEMORY:-2g}
41      - SPARK_WORKER_CORES=${SPARK_WORKER_CORES:-2}
42    depends_on: 
43      - spark-master
44    networks: 
45      - spark-network
46
47volumes: 
48  spark_data: 
49
50networks: 
51  spark-network: 
52    driver: bridge

[$].env Template

[.env]

1# Apache Spark Cluster
2SPARK_MASTER_PORT=7077
3SPARK_WEBUI_PORT=8080
4SPARK_WORKER_MEMORY=2g
5SPARK_WORKER_CORES=2

[i]Usage Notes

[1]Spark Master UI at http://localhost:8080
[2]Submit jobs to spark://localhost:7077
[3]Scale workers by adding more services
[4]Great for big data processing

Individual Services(3 services)

Copy individual services to mix and match with your existing compose files.

spark-master

spark-master:
  image: bitnami/spark:latest
  container_name: spark-master
  restart: unless-stopped
  environment:
    - SPARK_MODE=master
    - SPARK_MASTER_HOST=spark-master
    - SPARK_MASTER_PORT=${SPARK_MASTER_PORT:-7077}
    - SPARK_MASTER_WEBUI_PORT=${SPARK_WEBUI_PORT:-8080}
  ports:
    - ${SPARK_MASTER_PORT:-7077}:7077
    - ${SPARK_WEBUI_PORT:-8080}:8080
  volumes:
    - spark_data:/opt/bitnami/spark
  networks:
    - spark-network

spark-worker-1

spark-worker-1:
  image: bitnami/spark:latest
  container_name: spark-worker-1
  restart: unless-stopped
  environment:
    - SPARK_MODE=worker
    - SPARK_MASTER_URL=spark://spark-master:7077
    - SPARK_WORKER_MEMORY=${SPARK_WORKER_MEMORY:-2g}
    - SPARK_WORKER_CORES=${SPARK_WORKER_CORES:-2}
  depends_on:
    - spark-master
  networks:
    - spark-network

spark-worker-2

spark-worker-2:
  image: bitnami/spark:latest
  container_name: spark-worker-2
  restart: unless-stopped
  environment:
    - SPARK_MODE=worker
    - SPARK_MASTER_URL=spark://spark-master:7077
    - SPARK_WORKER_MEMORY=${SPARK_WORKER_MEMORY:-2g}
    - SPARK_WORKER_CORES=${SPARK_WORKER_CORES:-2}
  depends_on:
    - spark-master
  networks:
    - spark-network

[>]Quick Start

[terminal]

1# 1. Create the compose file
2cat > docker-compose.yml << 'EOF'
3services:
4  spark-master:
5    image: bitnami/spark:latest
6    container_name: spark-master
7    restart: unless-stopped
8    environment:
9      - SPARK_MODE=master
10      - SPARK_MASTER_HOST=spark-master
11      - SPARK_MASTER_PORT=${SPARK_MASTER_PORT:-7077}
12      - SPARK_MASTER_WEBUI_PORT=${SPARK_WEBUI_PORT:-8080}
13    ports:
14      - "${SPARK_MASTER_PORT:-7077}:7077"
15      - "${SPARK_WEBUI_PORT:-8080}:8080"
16    volumes:
17      - spark_data:/opt/bitnami/spark
18    networks:
19      - spark-network
20
21  spark-worker-1:
22    image: bitnami/spark:latest
23    container_name: spark-worker-1
24    restart: unless-stopped
25    environment:
26      - SPARK_MODE=worker
27      - SPARK_MASTER_URL=spark://spark-master:7077
28      - SPARK_WORKER_MEMORY=${SPARK_WORKER_MEMORY:-2g}
29      - SPARK_WORKER_CORES=${SPARK_WORKER_CORES:-2}
30    depends_on:
31      - spark-master
32    networks:
33      - spark-network
34
35  spark-worker-2:
36    image: bitnami/spark:latest
37    container_name: spark-worker-2
38    restart: unless-stopped
39    environment:
40      - SPARK_MODE=worker
41      - SPARK_MASTER_URL=spark://spark-master:7077
42      - SPARK_WORKER_MEMORY=${SPARK_WORKER_MEMORY:-2g}
43      - SPARK_WORKER_CORES=${SPARK_WORKER_CORES:-2}
44    depends_on:
45      - spark-master
46    networks:
47      - spark-network
48
49volumes:
50  spark_data:
51
52networks:
53  spark-network:
54    driver: bridge
55EOF
56
57# 2. Create the .env file
58cat > .env << 'EOF'
59# Apache Spark Cluster
60SPARK_MASTER_PORT=7077
61SPARK_WEBUI_PORT=8080
62SPARK_WORKER_MEMORY=2g
63SPARK_WORKER_CORES=2
64EOF
65
66# 3. Start the services
67docker compose up -d
68
69# 4. View logs
70docker compose logs -f

[>]One-Liner

Run this command to download and set up the recipe in one step:

[terminal]

1curl -fsSL https://docker.recipes/api/recipes/apache-spark-standalone/run | bash

[?]Troubleshooting

[!]java.lang.OutOfMemoryError in worker containers: Increase SPARK_WORKER_MEMORY environment variable or reduce SPARK_WORKER_CORES to balance memory per core allocation
[!]Workers failing to connect to master with 'Connection refused' errors: Verify spark-master container is fully started before workers attempt connection, add health checks or startup delays
[!]Spark jobs hanging in RUNNING state indefinitely: Check executor memory settings and ensure sufficient resources are available, may need to increase driver memory or reduce executor instances
[!]ClassNotFoundException when submitting custom applications: Mount application JAR files to /opt/bitnami/spark/jars directory in all containers or use spark-submit with --jars parameter
[!]Web UI showing 'Application Not Found' errors: Ensure Spark History Server is configured with shared storage, or check that application completed successfully without exceptions
[!]Serialization errors with custom objects: Verify all custom classes implement Serializable interface and are available on classpath of all worker nodes

Community Notes

Loading notes...

## Download Recipe Kit

Get all files in a ready-to-deploy package

Includes docker-compose.yml, .env template, README, and license

## Components

spark-masterspark-workerspark-history

## Tags

#spark#big-data#distributed#analytics#etl

## Category

Database Stacks

## Related

Shortcuts: C CopyF FavoriteD Download