Going Paperless: Document Management with Paperless-ngx and Docker Compose

01Why I Digitized Everything

Three years ago, I had a filing cabinet full of paper: tax returns, insurance policies, medical records, appliance manuals, warranties, receipts. Finding anything meant 15 minutes of digging through folders, and I was always worried about losing something important in a fire or flood. Today, everything is in Paperless-ngx. I scan a document, drop it in a folder, and Paperless OCRs it, auto-tags it based on content, and makes it searchable. Finding my 2023 tax return takes 5 seconds. Finding the warranty for my dishwasher takes 3 seconds. I've processed over 2,400 documents and the system has been completely reliable. The initial scanning took a few weekends (I did it in batches while watching TV), but the ongoing effort is minimal — maybe 5 minutes per week to scan new mail and receipts. The stack is heavier than most self-hosted services because OCR is CPU-intensive, but for what it provides, it's absolutely worth it.

02Docker Compose Setup

Paperless-ngx requires several supporting services: PostgreSQL for the database, Redis for the task queue, Gotenberg for PDF processing, and Tika for content extraction from Office documents. It looks like a lot of containers, but they're all lightweight except during active OCR processing.

[docker-compose.yml]

1services: 
2  paperless: 
3    image: ghcr.io/paperless-ngx/paperless-ngx:2.15
4    depends_on: 
5      db: 
6        condition: service_healthy
7      redis: 
8        condition: service_started
9      gotenberg: 
10        condition: service_started
11      tika: 
12        condition: service_started
13    ports: 
14      - "8000:8000"
15    volumes: 
16      - paperless-data:/usr/src/paperless/data
17      - paperless-media:/usr/src/paperless/media
18      - ./consume:/usr/src/paperless/consume
19      - ./export:/usr/src/paperless/export
20    env_file: .env
21    restart: unless-stopped
22
23  db: 
24    image: postgres:16-alpine
25    environment: 
26      POSTGRES_DB: paperless
27      POSTGRES_USER: paperless
28      POSTGRES_PASSWORD: ${DB_PASSWORD}
29    volumes: 
30      - postgres-data:/var/lib/postgresql/data
31    healthcheck: 
32      test: ["CMD", "pg_isready", "-U", "paperless"]
33      interval: 10s
34      timeout: 5s
35      retries: 5
36    restart: unless-stopped
37
38  redis: 
39    image: redis:7-alpine
40    volumes: 
41      - redis-data:/data
42    restart: unless-stopped
43
44  gotenberg: 
45    image: gotenberg/gotenberg:8
46    command: 
47      - "gotenberg"
48      - "--chromium-disable-javascript=true"
49      - "--chromium-allow-list=file:///tmp/.*"
50    restart: unless-stopped
51
52  tika: 
53    image: apache/tika:3.1
54    restart: unless-stopped
55
56volumes: 
57  paperless-data: 
58  paperless-media: 
59  postgres-data: 
60  redis-data:

OCR processing is CPU-intensive. When you first import a large batch of documents, expect high CPU usage for hours. On a 4-core VPS, processing 500 pages takes about 2-3 hours. Set PAPERLESS_TASK_WORKERS=2 to limit concurrent OCR tasks and prevent the server from becoming unresponsive during bulk imports.

03Configuration and First Run

Paperless has dozens of configuration options, but only a handful matter for initial setup. The consume folder is the most important concept — any file dropped in this folder is automatically imported, OCR'd, and filed.

[.env]

1# Database
2DB_PASSWORD=use-a-strong-random-password
3PAPERLESS_DBHOST=db
4PAPERLESS_DBNAME=paperless
5PAPERLESS_DBUSER=paperless
6PAPERLESS_DBPASS=${DB_PASSWORD}
7
8# Redis
9PAPERLESS_REDIS=redis://redis:6379
10
11# Core settings
12PAPERLESS_SECRET_KEY=change-this-to-a-long-random-string
13PAPERLESS_URL=https://paperless.example.com
14PAPERLESS_TIME_ZONE=America/New_York
15PAPERLESS_OCR_LANGUAGE=eng
16PAPERLESS_ADMIN_USER=admin
17PAPERLESS_ADMIN_PASSWORD=initial-admin-password
18
19# Performance tuning
20PAPERLESS_TASK_WORKERS=2
21PAPERLESS_THREADS_PER_WORKER=2
22PAPERLESS_WEBSERVER_WORKERS=2
23
24# Tika and Gotenberg
25PAPERLESS_TIKA_ENABLED=1
26PAPERLESS_TIKA_GOTENBERG_ENDPOINT=http://gotenberg:3000
27PAPERLESS_TIKA_ENDPOINT=http://tika:9998

Point a network scanner's 'Scan to Folder' feature at the consume directory (shared via SMB/NFS). Every scanned document automatically appears in Paperless within seconds. This is the workflow that makes going paperless actually sustainable — scan and forget.

04Automated Tagging and Sorting

Paperless-ngx's killer feature is its auto-matching system. You define rules based on document content, and Paperless automatically applies tags, document types, and correspondents. After a few weeks of training, most documents are correctly categorized without any manual intervention. Here's how I set it up: I created document types for invoices, tax documents, medical records, insurance, and receipts. I created correspondents for recurring senders — my bank, insurance company, utility providers. Then I created matching rules: Any document containing "invoice" or "amount due" gets tagged as an invoice. Documents with my bank's name automatically get the bank correspondent. Tax forms with "W-2" or "1099" get tagged as tax documents with the appropriate year. The ML-powered auto-matching is surprisingly good. After manually classifying about 100 documents, Paperless started suggesting correct classifications for new documents with about 85% accuracy. After 500 documents, it's at about 95%. The remaining 5% are unusual documents that don't match existing patterns. I handle these during my weekly 5-minute review session where I check recent imports and correct any misclassifications.

05My Scanning Workflow

The key to going paperless is making scanning effortless. If it takes more than 30 seconds per document, you'll stop doing it. Here's what I use: Physical mail: I have a Fujitsu ScanSnap ix1600 sheet-fed scanner on my desk. I open mail, put it through the scanner, and drop the originals in a shred box. The scanner deposits PDFs directly into the consume folder via network share. Total time: about 10 seconds per document. Mobile receipts: I use the Paperless-ngx mobile app to snap photos of receipts. The app uploads directly to my server. I use this for restaurant receipts, store purchases, and anything I want to track while I'm out. Email: Paperless-ngx can monitor an email inbox and import attachments. I forward bills and statements to a dedicated email address. This handles about 60% of my incoming documents automatically. Existing digital files: For PDFs that are already on my computer, I just drop them in the consume folder. For the initial bulk import, I copied my entire Documents folder into consume and let Paperless process everything over a weekend.

06Tips from 3 Years of Use

Backups are critical. Paperless stores originals and processed versions separately. Back up the media volume (which contains all documents) and the database. I use restic to backup to Backblaze B2 nightly — cost is about $1.50/month for 2,400 documents. Export regularly. Run the built-in document exporter monthly to create a portable archive: docker compose exec paperless document_exporter ../export. This creates a folder of PDFs and a manifest file that can be imported into any future Paperless instance. Storage estimates: My 2,400 documents (mostly single-page) use about 8GB of storage including originals, thumbnails, and the search index. Plan for roughly 3-5MB per document on average. OCR language: If you handle documents in multiple languages, set PAPERLESS_OCR_LANGUAGE=eng+deu+fra (English, German, French). OCR quality improves when you specify the expected languages. Adding unnecessary languages slows processing without improving accuracy. Document retention: I keep everything forever — storage is cheap and you never know when you'll need an old receipt or warranty. But if you want automatic cleanup, Paperless doesn't have built-in retention policies. You'd need to script deletion of documents older than a certain date via the API. The one thing I'd do differently: start with a consistent tagging system from day one. I reorganized my tags twice in the first year because my initial system was too granular. Keep it simple — 10-15 tags covering broad categories works better than 50 specific ones.