pdfx - PDF to Markdown Converter

A Python CLI tool for converting PDF documents to Markdown, optimized for Apple Silicon with MLX acceleration.

Note: This is a user-friendly wrapper around Docling, providing a command-line interface, YAML configuration, and batch processing capabilities.

✨ Features

🚀 Fast PDF Conversion - Vision Language Model (VLM) based processing with MLX acceleration
🍎 Apple Silicon Optimized - Native MPS (Metal Performance Shaders) support
📦 Batch Processing - Convert entire directories while preserving structure
🎨 Multiple Output Formats - Markdown, JSON, HTML, and DocTags
🔍 OCR Support - Extract text from scanned documents
📊 Table Extraction - Intelligent table structure recognition
🧮 Formula Support - Extract mathematical formulas as LaTeX
🖼️ Image Handling - Embed images as base64 or save separately
⚙️ YAML Configuration - Easy customization with config files
🌐 URL Support - Convert PDFs directly from URLs
🎯 Interactive Mode - Choose output location with a visual menu

📋 Requirements

macOS with Apple Silicon (M1/M2/M3/M4) - recommended for best performance
Python 3.10 - 3.12 (Python 3.13+ not yet supported by Docling)
uv package manager (recommended) or pip

🚀 Quick Start

Installation

# 1. Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# 2. Clone or navigate to project directory
cd pdfx

# 3. Create virtual environment
uv venv --python 3.12
source .venv/bin/activate

# 4. Install package in development mode
uv pip install -e .

# 5. Download required models (one-time setup, ~500MB-1GB)
mkdir -p ~/.cache/docling/models
docling-tools models download

# 6. Verify installation
python verify_install.py
pdfx --help

Basic Usage

# Convert a single PDF
pdfx input.pdf

# Interactive mode (choose output location)
pdfx input.pdf -i

# Convert from URL
pdfx https://arxiv.org/pdf/2408.09869

# Convert a directory
pdfx ~/Documents/pdfs/

# List available models
pdfx --list-models

# Verbose output for debugging
pdfx input.pdf --verbose

📖 Usage

Interactive Mode

Use the -i flag to select output location from a menu:

pdfx document.pdf -i

You'll see:

============================================================
📁 Select output location:
============================================================
1. Default (~/Downloads/)
2. Current directory (./output)
3. Same directory as source
4. Desktop (~/Desktop/)
5. Custom path...
============================================================
Enter your choice (1-5) or press Enter for default:

Command-Line Options

pdfx [OPTIONS] input

Options:
  -h, --help                Show help message
  -c, --config CONFIG       Path to config file (default: config.yaml)
  -o, --output OUTPUT       Output directory (overrides config)
  -f, --format FORMAT       Output format: markdown, json, html, doctags
  -v, --verbose             Enable verbose logging
  -i, --interactive         Prompt for output location
  --list-models             List available VLM models

Configuration

Create or edit config.yaml to customize behavior:

# Model Configuration
model:
  # Pipeline type: vlm (fast) or standard (full features)
  pipeline_type: "standard"

  # VLM model (Apple Silicon optimized)
  vlm_model: "SMOLDOCLING_MLX"

# Output Configuration
output:
  # Format: markdown, json, html, doctags
  # Can be single format or list for multiple outputs
  format: ["markdown"]  # or ["markdown", "json"]

  # Image handling
  include_images: true
  image_mode: "embedded"  # or "referenced"

# Processing Options
processing:
  # OCR for scanned documents
  enable_ocr: false
  ocr_engine: "auto"

  # Performance tuning
  page_batch_size: 8  # Higher = faster but more memory

# Feature Toggles (standard pipeline only)
features:
  # Table extraction
  table_structure: true
  table_mode: "ACCURATE"  # or "FAST"

  # Content enrichment
  formula_enrichment: true
  code_enrichment: true
  picture_classification: true

🎯 Common Use Cases

Academic Papers

Extract formulas and tables with high accuracy:

# config.yaml
model:
  pipeline_type: "standard"
features:
  formula_enrichment: true
  table_structure: true
  table_mode: "ACCURATE"

pdfx research_paper.pdf

Scanned Documents

Enable OCR for image-based PDFs:

# config.yaml
processing:
  enable_ocr: true
  ocr_engine: "auto"
  page_batch_size: 2

pdfx scanned_document.pdf

Batch Processing

Convert entire directories:

# Convert all PDFs in a directory
pdfx ~/Documents/reports/

# With interactive output selection
pdfx ~/Documents/reports/ -i

Multiple Output Formats

Export to both Markdown and JSON:

# config.yaml
output:
  format: ["markdown", "json"]

This creates both .md and .json files for each PDF.

📊 Pipeline Comparison

VLM Pipeline (Fast)

Best for: Simple documents, speed priority

model:
  pipeline_type: "vlm"

⚡ Fastest processing (~1 second/page)
🍎 Apple Silicon optimized
⚠️ Limited features (no OCR, table extraction, or enrichments)

Standard Pipeline (Full Features)

Best for: Complex documents, tables, formulas

model:
  pipeline_type: "standard"

✅ Full feature support
📊 Table structure recognition
🧮 Formula extraction
🔍 OCR support
⏱️ Slower but more accurate

🛠️ Troubleshooting

Models Not Found

Download models manually:

mkdir -p ~/.cache/docling/models
docling-tools models download

Python Version Issues

Ensure you're using Python 3.10-3.12:

python --version
# If wrong version:
uv venv --python 3.12
source .venv/bin/activate

Out of Memory

Reduce batch size in config:

processing:
  page_batch_size: 2  # or 1 for very large files

Images Not Embedding

Ensure correct configuration:

output:
  include_images: true
  image_mode: "embedded"

Empty Table Columns

This may occur if:

Table contains images/icons instead of text
Complex table structure
Try JSON export to see raw extracted data:

output:
  format: ["markdown", "json"]

OCR Not Working

Enable OCR in config
Install OCR dependencies:

uv pip install easyocr

For additional help:

Run with --verbose flag for detailed logging
Check Docling documentation
Review Docling GitHub issues

🏗️ Project Structure

pdfx/
├── src/
│   └── pdfx/
│       ├── __init__.py
│       ├── cli.py           # CLI entry point
│       ├── config.py        # Configuration management
│       └── converter.py     # Core conversion logic
├── tests/                   # Test suite
├── examples/
│   └── config.yaml         # Example configuration
├── config.yaml             # Default configuration
├── pyproject.toml          # Package configuration
├── requirements.txt        # Dependencies
├── verify_install.py       # Installation verification
└── README.md               # This file

🔧 Development

Running Tests

pytest

# With coverage
pytest --cov=pdfx --cov-report=html

# Verify installation
python verify_install.py

Package Installation

# Development mode (editable)
uv pip install -e .

# Or install dependencies manually
uv pip install -r requirements.txt

📦 Dependencies

Core:

docling>=2.0.0 - PDF processing engine
mlx-vlm>=0.1.0 - Apple Silicon acceleration
pyyaml>=6.0 - Configuration parsing
docling-core - Core document types

Optional:

easyocr - OCR engine for scanned documents
rapidocr - Alternative lightweight OCR
pytesseract - Tesseract OCR wrapper

Model Downloads:

First run downloads ~500MB-1GB to ~/.cache/docling/models
SmolDocling MLX: ~250MB
Supporting models: ~200-400MB

⚡ Performance

Apple Silicon (MLX):

Simple PDFs: ~1 second/page
Complex PDFs with tables: ~2 seconds/page
With OCR: ~3-5 seconds/page

Memory Usage:

Base (models loaded): ~500MB
Per page batch (4 pages): ~200-400MB
Peak (batch_size=8): ~1.5GB

🙏 Credits

This project is built on top of Docling by IBM Research.

Docling provides:

Advanced PDF understanding using Vision Language Models
Layout analysis and table structure recognition
Formula extraction and code block detection
Multiple export formats

This wrapper adds:

User-friendly CLI interface
YAML-based configuration
Batch processing capabilities
Interactive output selection
Apple Silicon optimization out of the box

Related Projects

Docling - Core PDF processing library
MLX - Apple's ML framework for Apple Silicon
SmolDocling - Lightweight VLM model

📚 Resources

📄 License

MIT License - see LICENSE file for details.

This project uses Docling (MIT License). See individual package licenses for dependencies.

🤝 Contributing

Contributions welcome! Please ensure:

Code follows existing style
Config changes are documented
Tests pass with various PDF types
Update CHANGELOG.md with your changes

📮 Support

For issues:

This tool: Open an issue on GitHub
Docling: Docling GitHub Issues
uv package manager: uv GitHub Issues

Made with ❤️ using Docling

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
examples		examples
src/pdfx		src/pdfx
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
config.yaml.example		config.yaml.example
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
test-installation.sh		test-installation.sh
verify_install.py		verify_install.py

License

jaydotsee/pdfx

Folders and files

Latest commit

History

Repository files navigation