Apache Spark

advanced

Distributed computing engine for big data processing.

[i]Overview

Apache Spark is an open-source unified analytics engine designed for large-scale data processing and distributed computing. Originally developed at UC Berkeley's AMPLab in 2009, Spark has become the de facto standard for big data processing, offering in-memory computing capabilities that can be up to 100x faster than traditional Hadoop MapReduce. Spark provides high-level APIs in Java, Scala, Python, and R, along with an optimized engine that supports general computation graphs for data analysis. This configuration deploys a complete Spark cluster using the master-worker architecture with Bitnami's optimized Spark containers. The setup includes one Spark master node responsible for cluster management, resource allocation, and job scheduling, paired with multiple worker nodes that execute the actual data processing tasks. The master node exposes both the cluster management interface and the Spark context endpoint, while workers automatically register with the master and contribute their configured CPU cores and memory resources to the cluster. This distributed setup enables horizontal scaling and fault tolerance for processing large datasets across multiple nodes. Data engineers, machine learning practitioners, and analytics teams working with big data workloads will find this stack invaluable for batch processing, real-time streaming analytics, and distributed machine learning tasks. The combination of containerized deployment with Spark's unified analytics engine makes it ideal for organizations needing to process terabytes of data, perform complex ETL operations, or train machine learning models at scale without the complexity of managing bare-metal Spark installations.

[*]Key Features

[+]Distributed in-memory computing with automatic data caching across worker nodes
[+]Unified analytics engine supporting SQL queries, streaming data, machine learning (MLlib), and graph processing (GraphX)
[+]Fault-tolerant RDD (Resilient Distributed Dataset) computation with automatic recovery from node failures
[+]Dynamic resource allocation with configurable worker memory (2GB) and CPU cores (2 cores) per worker
[+]Built-in Catalyst SQL optimizer for query performance optimization and code generation
[+]Support for multiple data sources including HDFS, S3, Cassandra, HBase, and structured streaming
[+]Web-based Spark UI for real-time monitoring of jobs, stages, tasks, and cluster resources
[+]Horizontal scaling through Docker Compose replicas with automatic worker registration to master

[#]Common Use Cases

[1]Large-scale ETL pipelines processing terabytes of log data, customer records, or financial transactions
[2]Real-time streaming analytics for IoT sensor data, clickstream analysis, or fraud detection systems
[3]Distributed machine learning model training using MLlib for recommendation engines or predictive analytics
[4]Interactive data exploration and analytics using Spark SQL for business intelligence and reporting
[5]Graph processing and social network analysis using GraphX for relationship mapping and community detection
[6]Data lake processing for transforming raw data into structured formats for downstream analytics
[7]High-performance computing clusters for research institutions processing scientific datasets

[!]Prerequisites

[!]Minimum 8GB RAM (4GB+ for Spark master and workers, plus host OS overhead)
[!]Multi-core CPU (4+ cores recommended for meaningful parallel processing)
[!]Available ports 8080 (Spark Master UI) and 7077 (Spark cluster communication)
[!]Understanding of distributed computing concepts and Spark programming model (RDDs, DataFrames, Datasets)
[!]Familiarity with at least one Spark-supported language (Python/PySpark, Scala, Java, or R)
[!]Knowledge of data formats commonly used with Spark (Parquet, JSON, CSV, Avro)

[!]

WARNING: For development & testing. Review security settings, change default credentials, and test thoroughly before production use. See Terms

[$]docker-compose.yml

[docker-compose.yml]

1services: 
2  spark-master: 
3    image: bitnami/spark:latest
4    container_name: spark-master
5    environment: 
6      SPARK_MODE: master
7      SPARK_MASTER_HOST: spark-master
8    ports: 
9      - "8080:8080"
10      - "7077:7077"
11    networks: 
12      - spark
13
14  spark-worker: 
15    image: bitnami/spark:latest
16    environment: 
17      SPARK_MODE: worker
18      SPARK_MASTER_URL: spark://spark-master:7077
19      SPARK_WORKER_MEMORY: 2G
20      SPARK_WORKER_CORES: 2
21    deploy: 
22      replicas: 2
23    depends_on: 
24      - spark-master
25    networks: 
26      - spark
27
28networks: 
29  spark: 
30    driver: bridge

[$].env Template

[.env]

1# Adjust SPARK_WORKER_MEMORY and SPARK_WORKER_CORES

[i]Usage Notes

[1]Docs: https://spark.apache.org/docs/latest/
[2]Master UI at http://localhost:8080 - cluster and job monitoring
[3]Submit jobs: spark-submit --master spark://localhost:7077 app.py
[4]Scale workers via deploy.replicas, adjust memory/cores per worker
[5]PySpark: SparkSession.builder.master('spark://localhost:7077').getOrCreate()
[6]Supports SQL, streaming, ML (MLlib), and graph processing

Individual Services(2 services)

Copy individual services to mix and match with your existing compose files.

spark-master

spark-master:
  image: bitnami/spark:latest
  container_name: spark-master
  environment:
    SPARK_MODE: master
    SPARK_MASTER_HOST: spark-master
  ports:
    - "8080:8080"
    - "7077:7077"
  networks:
    - spark

spark-worker

spark-worker:
  image: bitnami/spark:latest
  environment:
    SPARK_MODE: worker
    SPARK_MASTER_URL: spark://spark-master:7077
    SPARK_WORKER_MEMORY: 2G
    SPARK_WORKER_CORES: 2
  deploy:
    replicas: 2
  depends_on:
    - spark-master
  networks:
    - spark

[>]Quick Start

[terminal]

1# 1. Create the compose file
2cat > docker-compose.yml << 'EOF'
3services:
4  spark-master:
5    image: bitnami/spark:latest
6    container_name: spark-master
7    environment:
8      SPARK_MODE: master
9      SPARK_MASTER_HOST: spark-master
10    ports:
11      - "8080:8080"
12      - "7077:7077"
13    networks:
14      - spark
15
16  spark-worker:
17    image: bitnami/spark:latest
18    environment:
19      SPARK_MODE: worker
20      SPARK_MASTER_URL: spark://spark-master:7077
21      SPARK_WORKER_MEMORY: 2G
22      SPARK_WORKER_CORES: 2
23    deploy:
24      replicas: 2
25    depends_on:
26      - spark-master
27    networks:
28      - spark
29
30networks:
31  spark:
32    driver: bridge
33EOF
34
35# 2. Create the .env file
36cat > .env << 'EOF'
37# Adjust SPARK_WORKER_MEMORY and SPARK_WORKER_CORES
38EOF
39
40# 3. Start the services
41docker compose up -d
42
43# 4. View logs
44docker compose logs -f

[>]One-Liner

Run this command to download and set up the recipe in one step:

[terminal]

1curl -fsSL https://docker.recipes/api/recipes/spark/run | bash

[?]Troubleshooting

[!]Workers not appearing in Master UI: Check network connectivity and ensure SPARK_MASTER_URL points to correct master hostname
[!]OutOfMemoryError during job execution: Increase SPARK_WORKER_MEMORY environment variable or reduce data partition sizes
[!]Jobs failing with 'Task not serializable' error: Ensure all functions and variables used in Spark transformations are serializable
[!]Slow performance on small datasets: Spark overhead makes it inefficient for small data; consider increasing data size or reducing parallelism
[!]Connection refused on port 7077: Verify spark-master container is running and port 7077 is properly exposed
[!]Executor lost errors during long-running jobs: Increase spark.network.timeout and spark.sql.adaptive.coalescePartitions.enabled for stability

Community Notes

Loading notes...

## Download Recipe Kit

Get all files in a ready-to-deploy package

Includes docker-compose.yml, .env template, README, and license

## Components

spark

## Tags

#spark#big-data#distributed#analytics

## Category

AI & Machine Learning

## Related

Shortcuts: C CopyF FavoriteD Download