docker.recipes

Jupyter Data Science Lab

intermediate

Complete data science environment with JupyterLab, Spark, and databases.

Overview

JupyterLab is an interactive computing platform that revolutionized data science workflows through web-based notebooks combining live code, equations, visualizations, and narrative text. Born from Project Jupyter in 2014, it has become the de facto standard for data exploration, machine learning prototyping, and research reproducibility, supporting multiple programming languages through kernels while providing rich output capabilities for plots, tables, and HTML content. This comprehensive data science stack integrates JupyterLab with Apache Spark for distributed computing, PostgreSQL for robust data warehousing, MinIO for S3-compatible object storage, and MLflow for machine learning lifecycle management. The combination creates a complete analytics pipeline where data scientists can ingest raw data into MinIO, process it with Spark's distributed computing engine, store structured results in PostgreSQL, experiment within JupyterLab notebooks, and track ML experiments through MLflow's versioning system. This stack serves data science teams, ML engineers, and research organizations who need a self-hosted alternative to cloud platforms like Databricks or AWS SageMaker, providing enterprise-grade capabilities for experiment tracking, model versioning, and large-scale data processing while maintaining complete control over sensitive datasets and proprietary algorithms.

Key Features

  • Interactive JupyterLab notebooks with Python, R, and Scala kernels pre-configured for Spark integration
  • Distributed Apache Spark processing with master-worker architecture for big data analytics
  • PostgreSQL data warehouse with ACID compliance and JSON support for structured analytics
  • S3-compatible MinIO object storage with erasure coding and versioning for data lake operations
  • MLflow experiment tracking with automatic logging of parameters, metrics, and model artifacts
  • Pre-built Spark connectivity in notebooks using the all-spark-notebook Docker image
  • Integrated storage backend linking MLflow artifacts to MinIO and metadata to PostgreSQL
  • Spark Web UI monitoring and cluster management through dedicated master node interface

Common Use Cases

  • 1Machine learning model development with automated experiment tracking and version control
  • 2Big data analytics processing terabyte-scale datasets across distributed Spark clusters
  • 3Financial modeling and risk analysis requiring ACID-compliant transaction processing
  • 4Research data pipelines combining structured PostgreSQL data with unstructured MinIO objects
  • 5Data science education environments providing students with industry-standard toolchains
  • 6Regulatory compliance scenarios requiring on-premises ML model governance and auditing
  • 7Multi-team data science organizations needing centralized experiment management and collaboration

Prerequisites

  • Minimum 6GB RAM for full stack operation (2GB JupyterLab, 2GB Spark, 1GB PostgreSQL, 1GB MinIO/MLflow)
  • Docker Engine 20.10+ and Docker Compose V2 for modern compose file syntax support
  • Available ports 5000, 7077, 8081, 8888, 9000, 9001 for service web interfaces
  • Basic understanding of Spark DataFrames and SQL for effective data processing workflows
  • Familiarity with Python data science libraries (pandas, scikit-learn, matplotlib) for notebook development
  • Knowledge of S3 API concepts for MinIO bucket and object management operations

For development & testing. Review security settings, change default credentials, and test thoroughly before production use. See Terms

docker-compose.yml

docker-compose.yml
1services:
2 jupyterlab:
3 image: jupyter/all-spark-notebook:latest
4 ports:
5 - "8888:8888"
6 environment:
7 - JUPYTER_TOKEN=${JUPYTER_TOKEN}
8 volumes:
9 - ./notebooks:/home/jovyan/work
10 - jupyter_data:/home/jovyan/.local
11 networks:
12 - datascience_net
13
14 spark-master:
15 image: bitnami/spark:latest
16 ports:
17 - "7077:7077"
18 - "8081:8080"
19 environment:
20 - SPARK_MODE=master
21 networks:
22 - datascience_net
23
24 spark-worker:
25 image: bitnami/spark:latest
26 environment:
27 - SPARK_MODE=worker
28 - SPARK_MASTER_URL=spark://spark-master:7077
29 depends_on:
30 - spark-master
31 networks:
32 - datascience_net
33
34 postgres:
35 image: postgres:15-alpine
36 environment:
37 - POSTGRES_USER=datascience
38 - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
39 - POSTGRES_DB=datawarehouse
40 volumes:
41 - postgres_data:/var/lib/postgresql/data
42 networks:
43 - datascience_net
44
45 minio:
46 image: minio/minio:latest
47 command: server /data --console-address ":9001"
48 ports:
49 - "9000:9000"
50 - "9001:9001"
51 environment:
52 - MINIO_ROOT_USER=${MINIO_USER}
53 - MINIO_ROOT_PASSWORD=${MINIO_PASSWORD}
54 volumes:
55 - minio_data:/data
56 networks:
57 - datascience_net
58
59 mlflow:
60 image: ghcr.io/mlflow/mlflow:latest
61 ports:
62 - "5000:5000"
63 environment:
64 - MLFLOW_S3_ENDPOINT_URL=http://minio:9000
65 - AWS_ACCESS_KEY_ID=${MINIO_USER}
66 - AWS_SECRET_ACCESS_KEY=${MINIO_PASSWORD}
67 command: mlflow server --host 0.0.0.0 --backend-store-uri postgresql://datascience:${POSTGRES_PASSWORD}@postgres:5432/datawarehouse
68 depends_on:
69 - postgres
70 - minio
71 networks:
72 - datascience_net
73
74volumes:
75 jupyter_data:
76 postgres_data:
77 minio_data:
78
79networks:
80 datascience_net:

.env Template

.env
1# Data Science Lab
2JUPYTER_TOKEN=secure_jupyter_token
3POSTGRES_PASSWORD=secure_postgres_password
4MINIO_USER=minioadmin
5MINIO_PASSWORD=secure_minio_password
6
7# JupyterLab at http://localhost:8888
8# Spark UI at http://localhost:8081
9# MLflow at http://localhost:5000

Usage Notes

  1. 1JupyterLab at http://localhost:8888
  2. 2Spark Master UI at http://localhost:8081
  3. 3MLflow tracking at http://localhost:5000
  4. 4MinIO for S3-compatible storage
  5. 5All spark notebooks included

Individual Services(6 services)

Copy individual services to mix and match with your existing compose files.

jupyterlab
jupyterlab:
  image: jupyter/all-spark-notebook:latest
  ports:
    - "8888:8888"
  environment:
    - JUPYTER_TOKEN=${JUPYTER_TOKEN}
  volumes:
    - ./notebooks:/home/jovyan/work
    - jupyter_data:/home/jovyan/.local
  networks:
    - datascience_net
spark-master
spark-master:
  image: bitnami/spark:latest
  ports:
    - "7077:7077"
    - "8081:8080"
  environment:
    - SPARK_MODE=master
  networks:
    - datascience_net
spark-worker
spark-worker:
  image: bitnami/spark:latest
  environment:
    - SPARK_MODE=worker
    - SPARK_MASTER_URL=spark://spark-master:7077
  depends_on:
    - spark-master
  networks:
    - datascience_net
postgres
postgres:
  image: postgres:15-alpine
  environment:
    - POSTGRES_USER=datascience
    - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
    - POSTGRES_DB=datawarehouse
  volumes:
    - postgres_data:/var/lib/postgresql/data
  networks:
    - datascience_net
minio
minio:
  image: minio/minio:latest
  command: server /data --console-address ":9001"
  ports:
    - "9000:9000"
    - "9001:9001"
  environment:
    - MINIO_ROOT_USER=${MINIO_USER}
    - MINIO_ROOT_PASSWORD=${MINIO_PASSWORD}
  volumes:
    - minio_data:/data
  networks:
    - datascience_net
mlflow
mlflow:
  image: ghcr.io/mlflow/mlflow:latest
  ports:
    - "5000:5000"
  environment:
    - MLFLOW_S3_ENDPOINT_URL=http://minio:9000
    - AWS_ACCESS_KEY_ID=${MINIO_USER}
    - AWS_SECRET_ACCESS_KEY=${MINIO_PASSWORD}
  command: mlflow server --host 0.0.0.0 --backend-store-uri postgresql://datascience:${POSTGRES_PASSWORD}@postgres:5432/datawarehouse
  depends_on:
    - postgres
    - minio
  networks:
    - datascience_net

Quick Start

terminal
1# 1. Create the compose file
2cat > docker-compose.yml << 'EOF'
3services:
4 jupyterlab:
5 image: jupyter/all-spark-notebook:latest
6 ports:
7 - "8888:8888"
8 environment:
9 - JUPYTER_TOKEN=${JUPYTER_TOKEN}
10 volumes:
11 - ./notebooks:/home/jovyan/work
12 - jupyter_data:/home/jovyan/.local
13 networks:
14 - datascience_net
15
16 spark-master:
17 image: bitnami/spark:latest
18 ports:
19 - "7077:7077"
20 - "8081:8080"
21 environment:
22 - SPARK_MODE=master
23 networks:
24 - datascience_net
25
26 spark-worker:
27 image: bitnami/spark:latest
28 environment:
29 - SPARK_MODE=worker
30 - SPARK_MASTER_URL=spark://spark-master:7077
31 depends_on:
32 - spark-master
33 networks:
34 - datascience_net
35
36 postgres:
37 image: postgres:15-alpine
38 environment:
39 - POSTGRES_USER=datascience
40 - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
41 - POSTGRES_DB=datawarehouse
42 volumes:
43 - postgres_data:/var/lib/postgresql/data
44 networks:
45 - datascience_net
46
47 minio:
48 image: minio/minio:latest
49 command: server /data --console-address ":9001"
50 ports:
51 - "9000:9000"
52 - "9001:9001"
53 environment:
54 - MINIO_ROOT_USER=${MINIO_USER}
55 - MINIO_ROOT_PASSWORD=${MINIO_PASSWORD}
56 volumes:
57 - minio_data:/data
58 networks:
59 - datascience_net
60
61 mlflow:
62 image: ghcr.io/mlflow/mlflow:latest
63 ports:
64 - "5000:5000"
65 environment:
66 - MLFLOW_S3_ENDPOINT_URL=http://minio:9000
67 - AWS_ACCESS_KEY_ID=${MINIO_USER}
68 - AWS_SECRET_ACCESS_KEY=${MINIO_PASSWORD}
69 command: mlflow server --host 0.0.0.0 --backend-store-uri postgresql://datascience:${POSTGRES_PASSWORD}@postgres:5432/datawarehouse
70 depends_on:
71 - postgres
72 - minio
73 networks:
74 - datascience_net
75
76volumes:
77 jupyter_data:
78 postgres_data:
79 minio_data:
80
81networks:
82 datascience_net:
83EOF
84
85# 2. Create the .env file
86cat > .env << 'EOF'
87# Data Science Lab
88JUPYTER_TOKEN=secure_jupyter_token
89POSTGRES_PASSWORD=secure_postgres_password
90MINIO_USER=minioadmin
91MINIO_PASSWORD=secure_minio_password
92
93# JupyterLab at http://localhost:8888
94# Spark UI at http://localhost:8081
95# MLflow at http://localhost:5000
96EOF
97
98# 3. Start the services
99docker compose up -d
100
101# 4. View logs
102docker compose logs -f

One-Liner

Run this command to download and set up the recipe in one step:

terminal
1curl -fsSL https://docker.recipes/api/recipes/jupyter-datascience/run | bash

Troubleshooting

  • JupyterLab shows 'Kernel not ready' for Spark: Increase Docker memory allocation and verify spark-master container is running
  • MLflow experiments not saving: Check PostgreSQL connection string format and ensure database 'datawarehouse' exists
  • Spark jobs failing with memory errors: Reduce spark.executor.memory in notebook configuration or add more worker containers
  • MinIO console inaccessible: Verify MINIO_ROOT_USER and MINIO_ROOT_PASSWORD environment variables are set correctly
  • PostgreSQL connection refused: Wait 30-60 seconds after startup for database initialization to complete
  • MLflow artifacts not uploading to MinIO: Confirm AWS_ACCESS_KEY_ID matches MINIO_ROOT_USER in environment variables

Community Notes

Loading...
Loading notes...

Download Recipe Kit

Get all files in a ready-to-deploy package

Includes docker-compose.yml, .env template, README, and license

Ad Space