Jupyter Data Science Lab
Complete data science environment with JupyterLab, Spark, and databases.
Overview
JupyterLab is an interactive computing platform that revolutionized data science workflows through web-based notebooks combining live code, equations, visualizations, and narrative text. Born from Project Jupyter in 2014, it has become the de facto standard for data exploration, machine learning prototyping, and research reproducibility, supporting multiple programming languages through kernels while providing rich output capabilities for plots, tables, and HTML content. This comprehensive data science stack integrates JupyterLab with Apache Spark for distributed computing, PostgreSQL for robust data warehousing, MinIO for S3-compatible object storage, and MLflow for machine learning lifecycle management. The combination creates a complete analytics pipeline where data scientists can ingest raw data into MinIO, process it with Spark's distributed computing engine, store structured results in PostgreSQL, experiment within JupyterLab notebooks, and track ML experiments through MLflow's versioning system. This stack serves data science teams, ML engineers, and research organizations who need a self-hosted alternative to cloud platforms like Databricks or AWS SageMaker, providing enterprise-grade capabilities for experiment tracking, model versioning, and large-scale data processing while maintaining complete control over sensitive datasets and proprietary algorithms.
Key Features
- Interactive JupyterLab notebooks with Python, R, and Scala kernels pre-configured for Spark integration
- Distributed Apache Spark processing with master-worker architecture for big data analytics
- PostgreSQL data warehouse with ACID compliance and JSON support for structured analytics
- S3-compatible MinIO object storage with erasure coding and versioning for data lake operations
- MLflow experiment tracking with automatic logging of parameters, metrics, and model artifacts
- Pre-built Spark connectivity in notebooks using the all-spark-notebook Docker image
- Integrated storage backend linking MLflow artifacts to MinIO and metadata to PostgreSQL
- Spark Web UI monitoring and cluster management through dedicated master node interface
Common Use Cases
- 1Machine learning model development with automated experiment tracking and version control
- 2Big data analytics processing terabyte-scale datasets across distributed Spark clusters
- 3Financial modeling and risk analysis requiring ACID-compliant transaction processing
- 4Research data pipelines combining structured PostgreSQL data with unstructured MinIO objects
- 5Data science education environments providing students with industry-standard toolchains
- 6Regulatory compliance scenarios requiring on-premises ML model governance and auditing
- 7Multi-team data science organizations needing centralized experiment management and collaboration
Prerequisites
- Minimum 6GB RAM for full stack operation (2GB JupyterLab, 2GB Spark, 1GB PostgreSQL, 1GB MinIO/MLflow)
- Docker Engine 20.10+ and Docker Compose V2 for modern compose file syntax support
- Available ports 5000, 7077, 8081, 8888, 9000, 9001 for service web interfaces
- Basic understanding of Spark DataFrames and SQL for effective data processing workflows
- Familiarity with Python data science libraries (pandas, scikit-learn, matplotlib) for notebook development
- Knowledge of S3 API concepts for MinIO bucket and object management operations
For development & testing. Review security settings, change default credentials, and test thoroughly before production use. See Terms
docker-compose.yml
docker-compose.yml
1services: 2 jupyterlab: 3 image: jupyter/all-spark-notebook:latest4 ports: 5 - "8888:8888"6 environment: 7 - JUPYTER_TOKEN=${JUPYTER_TOKEN}8 volumes: 9 - ./notebooks:/home/jovyan/work10 - jupyter_data:/home/jovyan/.local11 networks: 12 - datascience_net1314 spark-master: 15 image: bitnami/spark:latest16 ports: 17 - "7077:7077"18 - "8081:8080"19 environment: 20 - SPARK_MODE=master21 networks: 22 - datascience_net2324 spark-worker: 25 image: bitnami/spark:latest26 environment: 27 - SPARK_MODE=worker28 - SPARK_MASTER_URL=spark://spark-master:707729 depends_on: 30 - spark-master31 networks: 32 - datascience_net3334 postgres: 35 image: postgres:15-alpine36 environment: 37 - POSTGRES_USER=datascience38 - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}39 - POSTGRES_DB=datawarehouse40 volumes: 41 - postgres_data:/var/lib/postgresql/data42 networks: 43 - datascience_net4445 minio: 46 image: minio/minio:latest47 command: server /data --console-address ":9001"48 ports: 49 - "9000:9000"50 - "9001:9001"51 environment: 52 - MINIO_ROOT_USER=${MINIO_USER}53 - MINIO_ROOT_PASSWORD=${MINIO_PASSWORD}54 volumes: 55 - minio_data:/data56 networks: 57 - datascience_net5859 mlflow: 60 image: ghcr.io/mlflow/mlflow:latest61 ports: 62 - "5000:5000"63 environment: 64 - MLFLOW_S3_ENDPOINT_URL=http://minio:900065 - AWS_ACCESS_KEY_ID=${MINIO_USER}66 - AWS_SECRET_ACCESS_KEY=${MINIO_PASSWORD}67 command: mlflow server --host 0.0.0.0 --backend-store-uri postgresql://datascience:${POSTGRES_PASSWORD}@postgres:5432/datawarehouse68 depends_on: 69 - postgres70 - minio71 networks: 72 - datascience_net7374volumes: 75 jupyter_data: 76 postgres_data: 77 minio_data: 7879networks: 80 datascience_net: .env Template
.env
1# Data Science Lab2JUPYTER_TOKEN=secure_jupyter_token3POSTGRES_PASSWORD=secure_postgres_password4MINIO_USER=minioadmin5MINIO_PASSWORD=secure_minio_password67# JupyterLab at http://localhost:88888# Spark UI at http://localhost:80819# MLflow at http://localhost:5000Usage Notes
- 1JupyterLab at http://localhost:8888
- 2Spark Master UI at http://localhost:8081
- 3MLflow tracking at http://localhost:5000
- 4MinIO for S3-compatible storage
- 5All spark notebooks included
Individual Services(6 services)
Copy individual services to mix and match with your existing compose files.
jupyterlab
jupyterlab:
image: jupyter/all-spark-notebook:latest
ports:
- "8888:8888"
environment:
- JUPYTER_TOKEN=${JUPYTER_TOKEN}
volumes:
- ./notebooks:/home/jovyan/work
- jupyter_data:/home/jovyan/.local
networks:
- datascience_net
spark-master
spark-master:
image: bitnami/spark:latest
ports:
- "7077:7077"
- "8081:8080"
environment:
- SPARK_MODE=master
networks:
- datascience_net
spark-worker
spark-worker:
image: bitnami/spark:latest
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://spark-master:7077
depends_on:
- spark-master
networks:
- datascience_net
postgres
postgres:
image: postgres:15-alpine
environment:
- POSTGRES_USER=datascience
- POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
- POSTGRES_DB=datawarehouse
volumes:
- postgres_data:/var/lib/postgresql/data
networks:
- datascience_net
minio
minio:
image: minio/minio:latest
command: server /data --console-address ":9001"
ports:
- "9000:9000"
- "9001:9001"
environment:
- MINIO_ROOT_USER=${MINIO_USER}
- MINIO_ROOT_PASSWORD=${MINIO_PASSWORD}
volumes:
- minio_data:/data
networks:
- datascience_net
mlflow
mlflow:
image: ghcr.io/mlflow/mlflow:latest
ports:
- "5000:5000"
environment:
- MLFLOW_S3_ENDPOINT_URL=http://minio:9000
- AWS_ACCESS_KEY_ID=${MINIO_USER}
- AWS_SECRET_ACCESS_KEY=${MINIO_PASSWORD}
command: mlflow server --host 0.0.0.0 --backend-store-uri postgresql://datascience:${POSTGRES_PASSWORD}@postgres:5432/datawarehouse
depends_on:
- postgres
- minio
networks:
- datascience_net
Quick Start
terminal
1# 1. Create the compose file2cat > docker-compose.yml << 'EOF'3services:4 jupyterlab:5 image: jupyter/all-spark-notebook:latest6 ports:7 - "8888:8888"8 environment:9 - JUPYTER_TOKEN=${JUPYTER_TOKEN}10 volumes:11 - ./notebooks:/home/jovyan/work12 - jupyter_data:/home/jovyan/.local13 networks:14 - datascience_net1516 spark-master:17 image: bitnami/spark:latest18 ports:19 - "7077:7077"20 - "8081:8080"21 environment:22 - SPARK_MODE=master23 networks:24 - datascience_net2526 spark-worker:27 image: bitnami/spark:latest28 environment:29 - SPARK_MODE=worker30 - SPARK_MASTER_URL=spark://spark-master:707731 depends_on:32 - spark-master33 networks:34 - datascience_net3536 postgres:37 image: postgres:15-alpine38 environment:39 - POSTGRES_USER=datascience40 - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}41 - POSTGRES_DB=datawarehouse42 volumes:43 - postgres_data:/var/lib/postgresql/data44 networks:45 - datascience_net4647 minio:48 image: minio/minio:latest49 command: server /data --console-address ":9001"50 ports:51 - "9000:9000"52 - "9001:9001"53 environment:54 - MINIO_ROOT_USER=${MINIO_USER}55 - MINIO_ROOT_PASSWORD=${MINIO_PASSWORD}56 volumes:57 - minio_data:/data58 networks:59 - datascience_net6061 mlflow:62 image: ghcr.io/mlflow/mlflow:latest63 ports:64 - "5000:5000"65 environment:66 - MLFLOW_S3_ENDPOINT_URL=http://minio:900067 - AWS_ACCESS_KEY_ID=${MINIO_USER}68 - AWS_SECRET_ACCESS_KEY=${MINIO_PASSWORD}69 command: mlflow server --host 0.0.0.0 --backend-store-uri postgresql://datascience:${POSTGRES_PASSWORD}@postgres:5432/datawarehouse70 depends_on:71 - postgres72 - minio73 networks:74 - datascience_net7576volumes:77 jupyter_data:78 postgres_data:79 minio_data:8081networks:82 datascience_net:83EOF8485# 2. Create the .env file86cat > .env << 'EOF'87# Data Science Lab88JUPYTER_TOKEN=secure_jupyter_token89POSTGRES_PASSWORD=secure_postgres_password90MINIO_USER=minioadmin91MINIO_PASSWORD=secure_minio_password9293# JupyterLab at http://localhost:888894# Spark UI at http://localhost:808195# MLflow at http://localhost:500096EOF9798# 3. Start the services99docker compose up -d100101# 4. View logs102docker compose logs -fOne-Liner
Run this command to download and set up the recipe in one step:
terminal
1curl -fsSL https://docker.recipes/api/recipes/jupyter-datascience/run | bashTroubleshooting
- JupyterLab shows 'Kernel not ready' for Spark: Increase Docker memory allocation and verify spark-master container is running
- MLflow experiments not saving: Check PostgreSQL connection string format and ensure database 'datawarehouse' exists
- Spark jobs failing with memory errors: Reduce spark.executor.memory in notebook configuration or add more worker containers
- MinIO console inaccessible: Verify MINIO_ROOT_USER and MINIO_ROOT_PASSWORD environment variables are set correctly
- PostgreSQL connection refused: Wait 30-60 seconds after startup for database initialization to complete
- MLflow artifacts not uploading to MinIO: Confirm AWS_ACCESS_KEY_ID matches MINIO_ROOT_USER in environment variables
Community Notes
Loading...
Loading notes...
Download Recipe Kit
Get all files in a ready-to-deploy package
Includes docker-compose.yml, .env template, README, and license
Components
jupyterlabsparkpostgresqlminiomlflow
Tags
#jupyter#datascience#spark#python#ml
Category
Development ToolsAd Space
Shortcuts: C CopyF FavoriteD Download