docker.recipes

Apache Spark

advanced

Distributed computing engine for big data processing.

Overview

Apache Spark is an open-source unified analytics engine designed for large-scale data processing and distributed computing. Originally developed at UC Berkeley's AMPLab in 2009, Spark has become the de facto standard for big data processing, offering in-memory computing capabilities that can be up to 100x faster than traditional Hadoop MapReduce. Spark provides high-level APIs in Java, Scala, Python, and R, along with an optimized engine that supports general computation graphs for data analysis. This configuration deploys a complete Spark cluster using the master-worker architecture with Bitnami's optimized Spark containers. The setup includes one Spark master node responsible for cluster management, resource allocation, and job scheduling, paired with multiple worker nodes that execute the actual data processing tasks. The master node exposes both the cluster management interface and the Spark context endpoint, while workers automatically register with the master and contribute their configured CPU cores and memory resources to the cluster. This distributed setup enables horizontal scaling and fault tolerance for processing large datasets across multiple nodes. Data engineers, machine learning practitioners, and analytics teams working with big data workloads will find this stack invaluable for batch processing, real-time streaming analytics, and distributed machine learning tasks. The combination of containerized deployment with Spark's unified analytics engine makes it ideal for organizations needing to process terabytes of data, perform complex ETL operations, or train machine learning models at scale without the complexity of managing bare-metal Spark installations.

Key Features

  • Distributed in-memory computing with automatic data caching across worker nodes
  • Unified analytics engine supporting SQL queries, streaming data, machine learning (MLlib), and graph processing (GraphX)
  • Fault-tolerant RDD (Resilient Distributed Dataset) computation with automatic recovery from node failures
  • Dynamic resource allocation with configurable worker memory (2GB) and CPU cores (2 cores) per worker
  • Built-in Catalyst SQL optimizer for query performance optimization and code generation
  • Support for multiple data sources including HDFS, S3, Cassandra, HBase, and structured streaming
  • Web-based Spark UI for real-time monitoring of jobs, stages, tasks, and cluster resources
  • Horizontal scaling through Docker Compose replicas with automatic worker registration to master

Common Use Cases

  • 1Large-scale ETL pipelines processing terabytes of log data, customer records, or financial transactions
  • 2Real-time streaming analytics for IoT sensor data, clickstream analysis, or fraud detection systems
  • 3Distributed machine learning model training using MLlib for recommendation engines or predictive analytics
  • 4Interactive data exploration and analytics using Spark SQL for business intelligence and reporting
  • 5Graph processing and social network analysis using GraphX for relationship mapping and community detection
  • 6Data lake processing for transforming raw data into structured formats for downstream analytics
  • 7High-performance computing clusters for research institutions processing scientific datasets

Prerequisites

  • Minimum 8GB RAM (4GB+ for Spark master and workers, plus host OS overhead)
  • Multi-core CPU (4+ cores recommended for meaningful parallel processing)
  • Available ports 8080 (Spark Master UI) and 7077 (Spark cluster communication)
  • Understanding of distributed computing concepts and Spark programming model (RDDs, DataFrames, Datasets)
  • Familiarity with at least one Spark-supported language (Python/PySpark, Scala, Java, or R)
  • Knowledge of data formats commonly used with Spark (Parquet, JSON, CSV, Avro)

For development & testing. Review security settings, change default credentials, and test thoroughly before production use. See Terms

docker-compose.yml

docker-compose.yml
1services:
2 spark-master:
3 image: bitnami/spark:latest
4 container_name: spark-master
5 environment:
6 SPARK_MODE: master
7 SPARK_MASTER_HOST: spark-master
8 ports:
9 - "8080:8080"
10 - "7077:7077"
11 networks:
12 - spark
13
14 spark-worker:
15 image: bitnami/spark:latest
16 environment:
17 SPARK_MODE: worker
18 SPARK_MASTER_URL: spark://spark-master:7077
19 SPARK_WORKER_MEMORY: 2G
20 SPARK_WORKER_CORES: 2
21 deploy:
22 replicas: 2
23 depends_on:
24 - spark-master
25 networks:
26 - spark
27
28networks:
29 spark:
30 driver: bridge

.env Template

.env
1# Adjust SPARK_WORKER_MEMORY and SPARK_WORKER_CORES

Usage Notes

  1. 1Docs: https://spark.apache.org/docs/latest/
  2. 2Master UI at http://localhost:8080 - cluster and job monitoring
  3. 3Submit jobs: spark-submit --master spark://localhost:7077 app.py
  4. 4Scale workers via deploy.replicas, adjust memory/cores per worker
  5. 5PySpark: SparkSession.builder.master('spark://localhost:7077').getOrCreate()
  6. 6Supports SQL, streaming, ML (MLlib), and graph processing

Individual Services(2 services)

Copy individual services to mix and match with your existing compose files.

spark-master
spark-master:
  image: bitnami/spark:latest
  container_name: spark-master
  environment:
    SPARK_MODE: master
    SPARK_MASTER_HOST: spark-master
  ports:
    - "8080:8080"
    - "7077:7077"
  networks:
    - spark
spark-worker
spark-worker:
  image: bitnami/spark:latest
  environment:
    SPARK_MODE: worker
    SPARK_MASTER_URL: spark://spark-master:7077
    SPARK_WORKER_MEMORY: 2G
    SPARK_WORKER_CORES: 2
  deploy:
    replicas: 2
  depends_on:
    - spark-master
  networks:
    - spark

Quick Start

terminal
1# 1. Create the compose file
2cat > docker-compose.yml << 'EOF'
3services:
4 spark-master:
5 image: bitnami/spark:latest
6 container_name: spark-master
7 environment:
8 SPARK_MODE: master
9 SPARK_MASTER_HOST: spark-master
10 ports:
11 - "8080:8080"
12 - "7077:7077"
13 networks:
14 - spark
15
16 spark-worker:
17 image: bitnami/spark:latest
18 environment:
19 SPARK_MODE: worker
20 SPARK_MASTER_URL: spark://spark-master:7077
21 SPARK_WORKER_MEMORY: 2G
22 SPARK_WORKER_CORES: 2
23 deploy:
24 replicas: 2
25 depends_on:
26 - spark-master
27 networks:
28 - spark
29
30networks:
31 spark:
32 driver: bridge
33EOF
34
35# 2. Create the .env file
36cat > .env << 'EOF'
37# Adjust SPARK_WORKER_MEMORY and SPARK_WORKER_CORES
38EOF
39
40# 3. Start the services
41docker compose up -d
42
43# 4. View logs
44docker compose logs -f

One-Liner

Run this command to download and set up the recipe in one step:

terminal
1curl -fsSL https://docker.recipes/api/recipes/spark/run | bash

Troubleshooting

  • Workers not appearing in Master UI: Check network connectivity and ensure SPARK_MASTER_URL points to correct master hostname
  • OutOfMemoryError during job execution: Increase SPARK_WORKER_MEMORY environment variable or reduce data partition sizes
  • Jobs failing with 'Task not serializable' error: Ensure all functions and variables used in Spark transformations are serializable
  • Slow performance on small datasets: Spark overhead makes it inefficient for small data; consider increasing data size or reducing parallelism
  • Connection refused on port 7077: Verify spark-master container is running and port 7077 is properly exposed
  • Executor lost errors during long-running jobs: Increase spark.network.timeout and spark.sql.adaptive.coalescePartitions.enabled for stability

Community Notes

Loading...
Loading notes...

Download Recipe Kit

Get all files in a ready-to-deploy package

Includes docker-compose.yml, .env template, README, and license

Ad Space