Apache Spark
Distributed computing engine for big data processing.
Overview
Apache Spark is an open-source unified analytics engine designed for large-scale data processing and distributed computing. Originally developed at UC Berkeley's AMPLab in 2009, Spark has become the de facto standard for big data processing, offering in-memory computing capabilities that can be up to 100x faster than traditional Hadoop MapReduce. Spark provides high-level APIs in Java, Scala, Python, and R, along with an optimized engine that supports general computation graphs for data analysis. This configuration deploys a complete Spark cluster using the master-worker architecture with Bitnami's optimized Spark containers. The setup includes one Spark master node responsible for cluster management, resource allocation, and job scheduling, paired with multiple worker nodes that execute the actual data processing tasks. The master node exposes both the cluster management interface and the Spark context endpoint, while workers automatically register with the master and contribute their configured CPU cores and memory resources to the cluster. This distributed setup enables horizontal scaling and fault tolerance for processing large datasets across multiple nodes. Data engineers, machine learning practitioners, and analytics teams working with big data workloads will find this stack invaluable for batch processing, real-time streaming analytics, and distributed machine learning tasks. The combination of containerized deployment with Spark's unified analytics engine makes it ideal for organizations needing to process terabytes of data, perform complex ETL operations, or train machine learning models at scale without the complexity of managing bare-metal Spark installations.
Key Features
- Distributed in-memory computing with automatic data caching across worker nodes
- Unified analytics engine supporting SQL queries, streaming data, machine learning (MLlib), and graph processing (GraphX)
- Fault-tolerant RDD (Resilient Distributed Dataset) computation with automatic recovery from node failures
- Dynamic resource allocation with configurable worker memory (2GB) and CPU cores (2 cores) per worker
- Built-in Catalyst SQL optimizer for query performance optimization and code generation
- Support for multiple data sources including HDFS, S3, Cassandra, HBase, and structured streaming
- Web-based Spark UI for real-time monitoring of jobs, stages, tasks, and cluster resources
- Horizontal scaling through Docker Compose replicas with automatic worker registration to master
Common Use Cases
- 1Large-scale ETL pipelines processing terabytes of log data, customer records, or financial transactions
- 2Real-time streaming analytics for IoT sensor data, clickstream analysis, or fraud detection systems
- 3Distributed machine learning model training using MLlib for recommendation engines or predictive analytics
- 4Interactive data exploration and analytics using Spark SQL for business intelligence and reporting
- 5Graph processing and social network analysis using GraphX for relationship mapping and community detection
- 6Data lake processing for transforming raw data into structured formats for downstream analytics
- 7High-performance computing clusters for research institutions processing scientific datasets
Prerequisites
- Minimum 8GB RAM (4GB+ for Spark master and workers, plus host OS overhead)
- Multi-core CPU (4+ cores recommended for meaningful parallel processing)
- Available ports 8080 (Spark Master UI) and 7077 (Spark cluster communication)
- Understanding of distributed computing concepts and Spark programming model (RDDs, DataFrames, Datasets)
- Familiarity with at least one Spark-supported language (Python/PySpark, Scala, Java, or R)
- Knowledge of data formats commonly used with Spark (Parquet, JSON, CSV, Avro)
For development & testing. Review security settings, change default credentials, and test thoroughly before production use. See Terms
docker-compose.yml
docker-compose.yml
1services: 2 spark-master: 3 image: bitnami/spark:latest4 container_name: spark-master5 environment: 6 SPARK_MODE: master7 SPARK_MASTER_HOST: spark-master8 ports: 9 - "8080:8080"10 - "7077:7077"11 networks: 12 - spark1314 spark-worker: 15 image: bitnami/spark:latest16 environment: 17 SPARK_MODE: worker18 SPARK_MASTER_URL: spark://spark-master:707719 SPARK_WORKER_MEMORY: 2G20 SPARK_WORKER_CORES: 221 deploy: 22 replicas: 223 depends_on: 24 - spark-master25 networks: 26 - spark2728networks: 29 spark: 30 driver: bridge.env Template
.env
1# Adjust SPARK_WORKER_MEMORY and SPARK_WORKER_CORESUsage Notes
- 1Docs: https://spark.apache.org/docs/latest/
- 2Master UI at http://localhost:8080 - cluster and job monitoring
- 3Submit jobs: spark-submit --master spark://localhost:7077 app.py
- 4Scale workers via deploy.replicas, adjust memory/cores per worker
- 5PySpark: SparkSession.builder.master('spark://localhost:7077').getOrCreate()
- 6Supports SQL, streaming, ML (MLlib), and graph processing
Individual Services(2 services)
Copy individual services to mix and match with your existing compose files.
spark-master
spark-master:
image: bitnami/spark:latest
container_name: spark-master
environment:
SPARK_MODE: master
SPARK_MASTER_HOST: spark-master
ports:
- "8080:8080"
- "7077:7077"
networks:
- spark
spark-worker
spark-worker:
image: bitnami/spark:latest
environment:
SPARK_MODE: worker
SPARK_MASTER_URL: spark://spark-master:7077
SPARK_WORKER_MEMORY: 2G
SPARK_WORKER_CORES: 2
deploy:
replicas: 2
depends_on:
- spark-master
networks:
- spark
Quick Start
terminal
1# 1. Create the compose file2cat > docker-compose.yml << 'EOF'3services:4 spark-master:5 image: bitnami/spark:latest6 container_name: spark-master7 environment:8 SPARK_MODE: master9 SPARK_MASTER_HOST: spark-master10 ports:11 - "8080:8080"12 - "7077:7077"13 networks:14 - spark1516 spark-worker:17 image: bitnami/spark:latest18 environment:19 SPARK_MODE: worker20 SPARK_MASTER_URL: spark://spark-master:707721 SPARK_WORKER_MEMORY: 2G22 SPARK_WORKER_CORES: 223 deploy:24 replicas: 225 depends_on:26 - spark-master27 networks:28 - spark2930networks:31 spark:32 driver: bridge33EOF3435# 2. Create the .env file36cat > .env << 'EOF'37# Adjust SPARK_WORKER_MEMORY and SPARK_WORKER_CORES38EOF3940# 3. Start the services41docker compose up -d4243# 4. View logs44docker compose logs -fOne-Liner
Run this command to download and set up the recipe in one step:
terminal
1curl -fsSL https://docker.recipes/api/recipes/spark/run | bashTroubleshooting
- Workers not appearing in Master UI: Check network connectivity and ensure SPARK_MASTER_URL points to correct master hostname
- OutOfMemoryError during job execution: Increase SPARK_WORKER_MEMORY environment variable or reduce data partition sizes
- Jobs failing with 'Task not serializable' error: Ensure all functions and variables used in Spark transformations are serializable
- Slow performance on small datasets: Spark overhead makes it inefficient for small data; consider increasing data size or reducing parallelism
- Connection refused on port 7077: Verify spark-master container is running and port 7077 is properly exposed
- Executor lost errors during long-running jobs: Increase spark.network.timeout and spark.sql.adaptive.coalescePartitions.enabled for stability
Community Notes
Loading...
Loading notes...
Download Recipe Kit
Get all files in a ready-to-deploy package
Includes docker-compose.yml, .env template, README, and license
Ad Space
Shortcuts: C CopyF FavoriteD Download