Jupyter Data Science Lab

intermediate

Complete data science environment with JupyterLab, Spark, and databases.

[i]Overview

JupyterLab is an interactive computing platform that revolutionized data science workflows through web-based notebooks combining live code, equations, visualizations, and narrative text. Born from Project Jupyter in 2014, it has become the de facto standard for data exploration, machine learning prototyping, and research reproducibility, supporting multiple programming languages through kernels while providing rich output capabilities for plots, tables, and HTML content. This comprehensive data science stack integrates JupyterLab with Apache Spark for distributed computing, PostgreSQL for robust data warehousing, MinIO for S3-compatible object storage, and MLflow for machine learning lifecycle management. The combination creates a complete analytics pipeline where data scientists can ingest raw data into MinIO, process it with Spark's distributed computing engine, store structured results in PostgreSQL, experiment within JupyterLab notebooks, and track ML experiments through MLflow's versioning system. This stack serves data science teams, ML engineers, and research organizations who need a self-hosted alternative to cloud platforms like Databricks or AWS SageMaker, providing enterprise-grade capabilities for experiment tracking, model versioning, and large-scale data processing while maintaining complete control over sensitive datasets and proprietary algorithms.

[*]Key Features

[+]Interactive JupyterLab notebooks with Python, R, and Scala kernels pre-configured for Spark integration
[+]Distributed Apache Spark processing with master-worker architecture for big data analytics
[+]PostgreSQL data warehouse with ACID compliance and JSON support for structured analytics
[+]S3-compatible MinIO object storage with erasure coding and versioning for data lake operations
[+]MLflow experiment tracking with automatic logging of parameters, metrics, and model artifacts
[+]Pre-built Spark connectivity in notebooks using the all-spark-notebook Docker image
[+]Integrated storage backend linking MLflow artifacts to MinIO and metadata to PostgreSQL
[+]Spark Web UI monitoring and cluster management through dedicated master node interface

[#]Common Use Cases

[1]Machine learning model development with automated experiment tracking and version control
[2]Big data analytics processing terabyte-scale datasets across distributed Spark clusters
[3]Financial modeling and risk analysis requiring ACID-compliant transaction processing
[4]Research data pipelines combining structured PostgreSQL data with unstructured MinIO objects
[5]Data science education environments providing students with industry-standard toolchains
[6]Regulatory compliance scenarios requiring on-premises ML model governance and auditing
[7]Multi-team data science organizations needing centralized experiment management and collaboration

[!]Prerequisites

[!]Minimum 6GB RAM for full stack operation (2GB JupyterLab, 2GB Spark, 1GB PostgreSQL, 1GB MinIO/MLflow)
[!]Docker Engine 20.10+ and Docker Compose V2 for modern compose file syntax support
[!]Available ports 5000, 7077, 8081, 8888, 9000, 9001 for service web interfaces
[!]Basic understanding of Spark DataFrames and SQL for effective data processing workflows
[!]Familiarity with Python data science libraries (pandas, scikit-learn, matplotlib) for notebook development
[!]Knowledge of S3 API concepts for MinIO bucket and object management operations

[!]

WARNING: For development & testing. Review security settings, change default credentials, and test thoroughly before production use. See Terms

[$]docker-compose.yml

[docker-compose.yml]

1services: 
2  jupyterlab: 
3    image: jupyter/all-spark-notebook:latest
4    ports: 
5      - "8888:8888"
6    environment: 
7      - JUPYTER_TOKEN=${JUPYTER_TOKEN}
8    volumes: 
9      - ./notebooks:/home/jovyan/work
10      - jupyter_data:/home/jovyan/.local
11    networks: 
12      - datascience_net
13
14  spark-master: 
15    image: bitnami/spark:latest
16    ports: 
17      - "7077:7077"
18      - "8081:8080"
19    environment: 
20      - SPARK_MODE=master
21    networks: 
22      - datascience_net
23
24  spark-worker: 
25    image: bitnami/spark:latest
26    environment: 
27      - SPARK_MODE=worker
28      - SPARK_MASTER_URL=spark://spark-master:7077
29    depends_on: 
30      - spark-master
31    networks: 
32      - datascience_net
33
34  postgres: 
35    image: postgres:15-alpine
36    environment: 
37      - POSTGRES_USER=datascience
38      - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
39      - POSTGRES_DB=datawarehouse
40    volumes: 
41      - postgres_data:/var/lib/postgresql/data
42    networks: 
43      - datascience_net
44
45  minio: 
46    image: minio/minio:latest
47    command: server /data --console-address ":9001"
48    ports: 
49      - "9000:9000"
50      - "9001:9001"
51    environment: 
52      - MINIO_ROOT_USER=${MINIO_USER}
53      - MINIO_ROOT_PASSWORD=${MINIO_PASSWORD}
54    volumes: 
55      - minio_data:/data
56    networks: 
57      - datascience_net
58
59  mlflow: 
60    image: ghcr.io/mlflow/mlflow:latest
61    ports: 
62      - "5000:5000"
63    environment: 
64      - MLFLOW_S3_ENDPOINT_URL=http://minio:9000
65      - AWS_ACCESS_KEY_ID=${MINIO_USER}
66      - AWS_SECRET_ACCESS_KEY=${MINIO_PASSWORD}
67    command: mlflow server --host 0.0.0.0 --backend-store-uri postgresql://datascience:${POSTGRES_PASSWORD}@postgres:5432/datawarehouse
68    depends_on: 
69      - postgres
70      - minio
71    networks: 
72      - datascience_net
73
74volumes: 
75  jupyter_data: 
76  postgres_data: 
77  minio_data: 
78
79networks: 
80  datascience_net:

[$].env Template

[.env]

1# Data Science Lab
2JUPYTER_TOKEN=secure_jupyter_token
3POSTGRES_PASSWORD=secure_postgres_password
4MINIO_USER=minioadmin
5MINIO_PASSWORD=secure_minio_password
6
7# JupyterLab at http://localhost:8888
8# Spark UI at http://localhost:8081
9# MLflow at http://localhost:5000

[i]Usage Notes

[1]JupyterLab at http://localhost:8888
[2]Spark Master UI at http://localhost:8081
[3]MLflow tracking at http://localhost:5000
[4]MinIO for S3-compatible storage
[5]All spark notebooks included

Individual Services(6 services)

Copy individual services to mix and match with your existing compose files.

jupyterlab

jupyterlab:
  image: jupyter/all-spark-notebook:latest
  ports:
    - "8888:8888"
  environment:
    - JUPYTER_TOKEN=${JUPYTER_TOKEN}
  volumes:
    - ./notebooks:/home/jovyan/work
    - jupyter_data:/home/jovyan/.local
  networks:
    - datascience_net

spark-master

spark-master:
  image: bitnami/spark:latest
  ports:
    - "7077:7077"
    - "8081:8080"
  environment:
    - SPARK_MODE=master
  networks:
    - datascience_net

spark-worker

spark-worker:
  image: bitnami/spark:latest
  environment:
    - SPARK_MODE=worker
    - SPARK_MASTER_URL=spark://spark-master:7077
  depends_on:
    - spark-master
  networks:
    - datascience_net

postgres

postgres:
  image: postgres:15-alpine
  environment:
    - POSTGRES_USER=datascience
    - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
    - POSTGRES_DB=datawarehouse
  volumes:
    - postgres_data:/var/lib/postgresql/data
  networks:
    - datascience_net

minio

minio:
  image: minio/minio:latest
  command: server /data --console-address ":9001"
  ports:
    - "9000:9000"
    - "9001:9001"
  environment:
    - MINIO_ROOT_USER=${MINIO_USER}
    - MINIO_ROOT_PASSWORD=${MINIO_PASSWORD}
  volumes:
    - minio_data:/data
  networks:
    - datascience_net

mlflow

mlflow:
  image: ghcr.io/mlflow/mlflow:latest
  ports:
    - "5000:5000"
  environment:
    - MLFLOW_S3_ENDPOINT_URL=http://minio:9000
    - AWS_ACCESS_KEY_ID=${MINIO_USER}
    - AWS_SECRET_ACCESS_KEY=${MINIO_PASSWORD}
  command: mlflow server --host 0.0.0.0 --backend-store-uri postgresql://datascience:${POSTGRES_PASSWORD}@postgres:5432/datawarehouse
  depends_on:
    - postgres
    - minio
  networks:
    - datascience_net

[>]Quick Start

[terminal]

1# 1. Create the compose file
2cat > docker-compose.yml << 'EOF'
3services:
4  jupyterlab:
5    image: jupyter/all-spark-notebook:latest
6    ports:
7      - "8888:8888"
8    environment:
9      - JUPYTER_TOKEN=${JUPYTER_TOKEN}
10    volumes:
11      - ./notebooks:/home/jovyan/work
12      - jupyter_data:/home/jovyan/.local
13    networks:
14      - datascience_net
15
16  spark-master:
17    image: bitnami/spark:latest
18    ports:
19      - "7077:7077"
20      - "8081:8080"
21    environment:
22      - SPARK_MODE=master
23    networks:
24      - datascience_net
25
26  spark-worker:
27    image: bitnami/spark:latest
28    environment:
29      - SPARK_MODE=worker
30      - SPARK_MASTER_URL=spark://spark-master:7077
31    depends_on:
32      - spark-master
33    networks:
34      - datascience_net
35
36  postgres:
37    image: postgres:15-alpine
38    environment:
39      - POSTGRES_USER=datascience
40      - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
41      - POSTGRES_DB=datawarehouse
42    volumes:
43      - postgres_data:/var/lib/postgresql/data
44    networks:
45      - datascience_net
46
47  minio:
48    image: minio/minio:latest
49    command: server /data --console-address ":9001"
50    ports:
51      - "9000:9000"
52      - "9001:9001"
53    environment:
54      - MINIO_ROOT_USER=${MINIO_USER}
55      - MINIO_ROOT_PASSWORD=${MINIO_PASSWORD}
56    volumes:
57      - minio_data:/data
58    networks:
59      - datascience_net
60
61  mlflow:
62    image: ghcr.io/mlflow/mlflow:latest
63    ports:
64      - "5000:5000"
65    environment:
66      - MLFLOW_S3_ENDPOINT_URL=http://minio:9000
67      - AWS_ACCESS_KEY_ID=${MINIO_USER}
68      - AWS_SECRET_ACCESS_KEY=${MINIO_PASSWORD}
69    command: mlflow server --host 0.0.0.0 --backend-store-uri postgresql://datascience:${POSTGRES_PASSWORD}@postgres:5432/datawarehouse
70    depends_on:
71      - postgres
72      - minio
73    networks:
74      - datascience_net
75
76volumes:
77  jupyter_data:
78  postgres_data:
79  minio_data:
80
81networks:
82  datascience_net:
83EOF
84
85# 2. Create the .env file
86cat > .env << 'EOF'
87# Data Science Lab
88JUPYTER_TOKEN=secure_jupyter_token
89POSTGRES_PASSWORD=secure_postgres_password
90MINIO_USER=minioadmin
91MINIO_PASSWORD=secure_minio_password
92
93# JupyterLab at http://localhost:8888
94# Spark UI at http://localhost:8081
95# MLflow at http://localhost:5000
96EOF
97
98# 3. Start the services
99docker compose up -d
100
101# 4. View logs
102docker compose logs -f

[>]One-Liner

Run this command to download and set up the recipe in one step:

[terminal]

1curl -fsSL https://docker.recipes/api/recipes/jupyter-datascience/run | bash

[?]Troubleshooting

[!]JupyterLab shows 'Kernel not ready' for Spark: Increase Docker memory allocation and verify spark-master container is running
[!]MLflow experiments not saving: Check PostgreSQL connection string format and ensure database 'datawarehouse' exists
[!]Spark jobs failing with memory errors: Reduce spark.executor.memory in notebook configuration or add more worker containers
[!]MinIO console inaccessible: Verify MINIO_ROOT_USER and MINIO_ROOT_PASSWORD environment variables are set correctly
[!]PostgreSQL connection refused: Wait 30-60 seconds after startup for database initialization to complete
[!]MLflow artifacts not uploading to MinIO: Confirm AWS_ACCESS_KEY_ID matches MINIO_ROOT_USER in environment variables

Community Notes

Loading notes...

## Download Recipe Kit

Get all files in a ready-to-deploy package

Includes docker-compose.yml, .env template, README, and license

## Components

jupyterlabsparkpostgresqlminiomlflow

## Tags

#jupyter#datascience#spark#python#ml

## Category

Development Tools

## Related

Shortcuts: C CopyF FavoriteD Download