Apache Spark Standalone Cluster
Apache Spark cluster with master and worker nodes for distributed data processing.
Overview
Apache Spark is an open-source unified analytics engine developed at UC Berkeley's AMPLab, designed for large-scale data processing with built-in modules for streaming, SQL, machine learning, and graph processing. Unlike traditional MapReduce frameworks, Spark performs in-memory computing, making it up to 100 times faster for certain applications by keeping data in RAM between operations rather than writing to disk after each step.
This standalone cluster deployment combines a Spark master node with multiple worker nodes to create a distributed computing environment capable of processing massive datasets across multiple machines. The master node coordinates job scheduling and resource allocation while worker nodes execute the actual data processing tasks, with the cluster automatically handling data partitioning, fault tolerance, and load balancing across available resources.
Data engineers, data scientists, and organizations processing large volumes of structured or unstructured data will benefit from this setup, particularly those running ETL pipelines, machine learning workflows, or real-time analytics. This configuration is ideal for companies transitioning from single-machine data processing to distributed computing, or those needing a development environment that mirrors production Spark clusters without the complexity of full cluster management platforms like YARN or Kubernetes.
Key Features
- Resilient Distributed Datasets (RDDs) with automatic fault recovery and data lineage tracking
- In-memory computing capabilities reducing disk I/O by up to 100x compared to MapReduce
- Built-in Catalyst SQL optimizer for automatic query optimization and code generation
- Dynamic resource allocation allowing workers to scale compute resources based on workload demands
- Spark Streaming for processing live data streams with micro-batch processing architecture
- MLlib machine learning library with distributed algorithms for classification, regression, and clustering
- GraphX graph processing framework for social network analysis and graph algorithms
- Multi-language support with native APIs for Scala, Java, Python (PySpark), and R (SparkR)
Common Use Cases
- 1ETL pipeline processing for transforming terabytes of log data from multiple sources into analytical formats
- 2Real-time fraud detection systems analyzing credit card transactions and user behavior patterns
- 3Machine learning model training on large datasets for recommendation engines and predictive analytics
- 4Financial risk modeling and Monte Carlo simulations requiring distributed parallel processing
- 5Genomics research processing DNA sequencing data and performing large-scale bioinformatics analysis
- 6IoT sensor data aggregation and analysis for smart city infrastructure monitoring
- 7E-commerce clickstream analysis for customer journey mapping and conversion optimization
Prerequisites
- Minimum 8GB RAM allocated to Docker with at least 4GB available for Spark worker processes
- Docker Engine 20.10+ and Docker Compose 2.0+ for proper container networking and resource management
- Understanding of distributed computing concepts including data partitioning and parallel processing
- Basic knowledge of Spark DataFrame/Dataset APIs and SQL for job submission and data manipulation
- Familiarity with JVM memory management and garbage collection tuning for optimal performance
- Network ports 7077 and 8080 available on host system for Spark master and web UI access
For development & testing. Review security settings, change default credentials, and test thoroughly before production use. See Terms
docker-compose.yml
docker-compose.yml
1services: 2 spark-master: 3 image: bitnami/spark:latest4 container_name: spark-master5 restart: unless-stopped6 environment: 7 - SPARK_MODE=master8 - SPARK_MASTER_HOST=spark-master9 - SPARK_MASTER_PORT=${SPARK_MASTER_PORT:-7077}10 - SPARK_MASTER_WEBUI_PORT=${SPARK_WEBUI_PORT:-8080}11 ports: 12 - "${SPARK_MASTER_PORT:-7077}:7077"13 - "${SPARK_WEBUI_PORT:-8080}:8080"14 volumes: 15 - spark_data:/opt/bitnami/spark16 networks: 17 - spark-network1819 spark-worker-1: 20 image: bitnami/spark:latest21 container_name: spark-worker-122 restart: unless-stopped23 environment: 24 - SPARK_MODE=worker25 - SPARK_MASTER_URL=spark://spark-master:707726 - SPARK_WORKER_MEMORY=${SPARK_WORKER_MEMORY:-2g}27 - SPARK_WORKER_CORES=${SPARK_WORKER_CORES:-2}28 depends_on: 29 - spark-master30 networks: 31 - spark-network3233 spark-worker-2: 34 image: bitnami/spark:latest35 container_name: spark-worker-236 restart: unless-stopped37 environment: 38 - SPARK_MODE=worker39 - SPARK_MASTER_URL=spark://spark-master:707740 - SPARK_WORKER_MEMORY=${SPARK_WORKER_MEMORY:-2g}41 - SPARK_WORKER_CORES=${SPARK_WORKER_CORES:-2}42 depends_on: 43 - spark-master44 networks: 45 - spark-network4647volumes: 48 spark_data: 4950networks: 51 spark-network: 52 driver: bridge.env Template
.env
1# Apache Spark Cluster2SPARK_MASTER_PORT=70773SPARK_WEBUI_PORT=80804SPARK_WORKER_MEMORY=2g5SPARK_WORKER_CORES=2Usage Notes
- 1Spark Master UI at http://localhost:8080
- 2Submit jobs to spark://localhost:7077
- 3Scale workers by adding more services
- 4Great for big data processing
Individual Services(3 services)
Copy individual services to mix and match with your existing compose files.
spark-master
spark-master:
image: bitnami/spark:latest
container_name: spark-master
restart: unless-stopped
environment:
- SPARK_MODE=master
- SPARK_MASTER_HOST=spark-master
- SPARK_MASTER_PORT=${SPARK_MASTER_PORT:-7077}
- SPARK_MASTER_WEBUI_PORT=${SPARK_WEBUI_PORT:-8080}
ports:
- ${SPARK_MASTER_PORT:-7077}:7077
- ${SPARK_WEBUI_PORT:-8080}:8080
volumes:
- spark_data:/opt/bitnami/spark
networks:
- spark-network
spark-worker-1
spark-worker-1:
image: bitnami/spark:latest
container_name: spark-worker-1
restart: unless-stopped
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://spark-master:7077
- SPARK_WORKER_MEMORY=${SPARK_WORKER_MEMORY:-2g}
- SPARK_WORKER_CORES=${SPARK_WORKER_CORES:-2}
depends_on:
- spark-master
networks:
- spark-network
spark-worker-2
spark-worker-2:
image: bitnami/spark:latest
container_name: spark-worker-2
restart: unless-stopped
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://spark-master:7077
- SPARK_WORKER_MEMORY=${SPARK_WORKER_MEMORY:-2g}
- SPARK_WORKER_CORES=${SPARK_WORKER_CORES:-2}
depends_on:
- spark-master
networks:
- spark-network
Quick Start
terminal
1# 1. Create the compose file2cat > docker-compose.yml << 'EOF'3services:4 spark-master:5 image: bitnami/spark:latest6 container_name: spark-master7 restart: unless-stopped8 environment:9 - SPARK_MODE=master10 - SPARK_MASTER_HOST=spark-master11 - SPARK_MASTER_PORT=${SPARK_MASTER_PORT:-7077}12 - SPARK_MASTER_WEBUI_PORT=${SPARK_WEBUI_PORT:-8080}13 ports:14 - "${SPARK_MASTER_PORT:-7077}:7077"15 - "${SPARK_WEBUI_PORT:-8080}:8080"16 volumes:17 - spark_data:/opt/bitnami/spark18 networks:19 - spark-network2021 spark-worker-1:22 image: bitnami/spark:latest23 container_name: spark-worker-124 restart: unless-stopped25 environment:26 - SPARK_MODE=worker27 - SPARK_MASTER_URL=spark://spark-master:707728 - SPARK_WORKER_MEMORY=${SPARK_WORKER_MEMORY:-2g}29 - SPARK_WORKER_CORES=${SPARK_WORKER_CORES:-2}30 depends_on:31 - spark-master32 networks:33 - spark-network3435 spark-worker-2:36 image: bitnami/spark:latest37 container_name: spark-worker-238 restart: unless-stopped39 environment:40 - SPARK_MODE=worker41 - SPARK_MASTER_URL=spark://spark-master:707742 - SPARK_WORKER_MEMORY=${SPARK_WORKER_MEMORY:-2g}43 - SPARK_WORKER_CORES=${SPARK_WORKER_CORES:-2}44 depends_on:45 - spark-master46 networks:47 - spark-network4849volumes:50 spark_data:5152networks:53 spark-network:54 driver: bridge55EOF5657# 2. Create the .env file58cat > .env << 'EOF'59# Apache Spark Cluster60SPARK_MASTER_PORT=707761SPARK_WEBUI_PORT=808062SPARK_WORKER_MEMORY=2g63SPARK_WORKER_CORES=264EOF6566# 3. Start the services67docker compose up -d6869# 4. View logs70docker compose logs -fOne-Liner
Run this command to download and set up the recipe in one step:
terminal
1curl -fsSL https://docker.recipes/api/recipes/apache-spark-standalone/run | bashTroubleshooting
- java.lang.OutOfMemoryError in worker containers: Increase SPARK_WORKER_MEMORY environment variable or reduce SPARK_WORKER_CORES to balance memory per core allocation
- Workers failing to connect to master with 'Connection refused' errors: Verify spark-master container is fully started before workers attempt connection, add health checks or startup delays
- Spark jobs hanging in RUNNING state indefinitely: Check executor memory settings and ensure sufficient resources are available, may need to increase driver memory or reduce executor instances
- ClassNotFoundException when submitting custom applications: Mount application JAR files to /opt/bitnami/spark/jars directory in all containers or use spark-submit with --jars parameter
- Web UI showing 'Application Not Found' errors: Ensure Spark History Server is configured with shared storage, or check that application completed successfully without exceptions
- Serialization errors with custom objects: Verify all custom classes implement Serializable interface and are available on classpath of all worker nodes
Community Notes
Loading...
Loading notes...
Download Recipe Kit
Get all files in a ready-to-deploy package
Includes docker-compose.yml, .env template, README, and license
Components
spark-masterspark-workerspark-history
Tags
#spark#big-data#distributed#analytics#etl
Category
Database StacksAd Space
Shortcuts: C CopyF FavoriteD Download