A Python CLI tool for converting PDF documents to Markdown, optimized for Apple Silicon with MLX acceleration.
Note: This is a user-friendly wrapper around Docling, providing a command-line interface, YAML configuration, and batch processing capabilities.
- ๐ Fast PDF Conversion - Vision Language Model (VLM) based processing with MLX acceleration
- ๐ Apple Silicon Optimized - Native MPS (Metal Performance Shaders) support
- ๐ฆ Batch Processing - Convert entire directories while preserving structure
- ๐จ Multiple Output Formats - Markdown, JSON, HTML, and DocTags
- ๐ OCR Support - Extract text from scanned documents
- ๐ Table Extraction - Intelligent table structure recognition
- ๐งฎ Formula Support - Extract mathematical formulas as LaTeX
- ๐ผ๏ธ Image Handling - Embed images as base64 or save separately
- โ๏ธ YAML Configuration - Easy customization with config files
- ๐ URL Support - Convert PDFs directly from URLs
- ๐ฏ Interactive Mode - Choose output location with a visual menu
- macOS with Apple Silicon (M1/M2/M3/M4) - recommended for best performance
- Python 3.10 - 3.12 (Python 3.13+ not yet supported by Docling)
- uv package manager (recommended) or pip
# 1. Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# 2. Clone or navigate to project directory
cd pdfx
# 3. Create virtual environment
uv venv --python 3.12
source .venv/bin/activate
# 4. Install package in development mode
uv pip install -e .
# 5. Download required models (one-time setup, ~500MB-1GB)
mkdir -p ~/.cache/docling/models
docling-tools models download
# 6. Verify installation
python verify_install.py
pdfx --help# Convert a single PDF
pdfx input.pdf
# Interactive mode (choose output location)
pdfx input.pdf -i
# Convert from URL
pdfx https://arxiv.org/pdf/2408.09869
# Convert a directory
pdfx ~/Documents/pdfs/
# List available models
pdfx --list-models
# Verbose output for debugging
pdfx input.pdf --verboseUse the -i flag to select output location from a menu:
pdfx document.pdf -iYou'll see:
============================================================
๐ Select output location:
============================================================
1. Default (~/Downloads/)
2. Current directory (./output)
3. Same directory as source
4. Desktop (~/Desktop/)
5. Custom path...
============================================================
Enter your choice (1-5) or press Enter for default:
pdfx [OPTIONS] input
Options:
-h, --help Show help message
-c, --config CONFIG Path to config file (default: config.yaml)
-o, --output OUTPUT Output directory (overrides config)
-f, --format FORMAT Output format: markdown, json, html, doctags
-v, --verbose Enable verbose logging
-i, --interactive Prompt for output location
--list-models List available VLM modelsCreate or edit config.yaml to customize behavior:
# Model Configuration
model:
# Pipeline type: vlm (fast) or standard (full features)
pipeline_type: "standard"
# VLM model (Apple Silicon optimized)
vlm_model: "SMOLDOCLING_MLX"
# Output Configuration
output:
# Format: markdown, json, html, doctags
# Can be single format or list for multiple outputs
format: ["markdown"] # or ["markdown", "json"]
# Image handling
include_images: true
image_mode: "embedded" # or "referenced"
# Processing Options
processing:
# OCR for scanned documents
enable_ocr: false
ocr_engine: "auto"
# Performance tuning
page_batch_size: 8 # Higher = faster but more memory
# Feature Toggles (standard pipeline only)
features:
# Table extraction
table_structure: true
table_mode: "ACCURATE" # or "FAST"
# Content enrichment
formula_enrichment: true
code_enrichment: true
picture_classification: trueExtract formulas and tables with high accuracy:
# config.yaml
model:
pipeline_type: "standard"
features:
formula_enrichment: true
table_structure: true
table_mode: "ACCURATE"pdfx research_paper.pdfEnable OCR for image-based PDFs:
# config.yaml
processing:
enable_ocr: true
ocr_engine: "auto"
page_batch_size: 2pdfx scanned_document.pdfConvert entire directories:
# Convert all PDFs in a directory
pdfx ~/Documents/reports/
# With interactive output selection
pdfx ~/Documents/reports/ -iExport to both Markdown and JSON:
# config.yaml
output:
format: ["markdown", "json"]This creates both .md and .json files for each PDF.
Best for: Simple documents, speed priority
model:
pipeline_type: "vlm"- โก Fastest processing (~1 second/page)
- ๐ Apple Silicon optimized
โ ๏ธ Limited features (no OCR, table extraction, or enrichments)
Best for: Complex documents, tables, formulas
model:
pipeline_type: "standard"- โ Full feature support
- ๐ Table structure recognition
- ๐งฎ Formula extraction
- ๐ OCR support
- โฑ๏ธ Slower but more accurate
Download models manually:
mkdir -p ~/.cache/docling/models
docling-tools models downloadEnsure you're using Python 3.10-3.12:
python --version
# If wrong version:
uv venv --python 3.12
source .venv/bin/activateReduce batch size in config:
processing:
page_batch_size: 2 # or 1 for very large filesEnsure correct configuration:
output:
include_images: true
image_mode: "embedded"This may occur if:
- Table contains images/icons instead of text
- Complex table structure
- Try JSON export to see raw extracted data:
output:
format: ["markdown", "json"]- Enable OCR in config
- Install OCR dependencies:
uv pip install easyocrFor additional help:
- Run with
--verboseflag for detailed logging - Check Docling documentation
- Review Docling GitHub issues
pdfx/
โโโ src/
โ โโโ pdfx/
โ โโโ __init__.py
โ โโโ cli.py # CLI entry point
โ โโโ config.py # Configuration management
โ โโโ converter.py # Core conversion logic
โโโ tests/ # Test suite
โโโ examples/
โ โโโ config.yaml # Example configuration
โโโ config.yaml # Default configuration
โโโ pyproject.toml # Package configuration
โโโ requirements.txt # Dependencies
โโโ verify_install.py # Installation verification
โโโ README.md # This file
pytest
# With coverage
pytest --cov=pdfx --cov-report=html
# Verify installation
python verify_install.py# Development mode (editable)
uv pip install -e .
# Or install dependencies manually
uv pip install -r requirements.txtCore:
docling>=2.0.0- PDF processing enginemlx-vlm>=0.1.0- Apple Silicon accelerationpyyaml>=6.0- Configuration parsingdocling-core- Core document types
Optional:
easyocr- OCR engine for scanned documentsrapidocr- Alternative lightweight OCRpytesseract- Tesseract OCR wrapper
Model Downloads:
- First run downloads ~500MB-1GB to
~/.cache/docling/models - SmolDocling MLX: ~250MB
- Supporting models: ~200-400MB
Apple Silicon (MLX):
- Simple PDFs: ~1 second/page
- Complex PDFs with tables: ~2 seconds/page
- With OCR: ~3-5 seconds/page
Memory Usage:
- Base (models loaded): ~500MB
- Per page batch (4 pages): ~200-400MB
- Peak (batch_size=8): ~1.5GB
This project is built on top of Docling by IBM Research.
Docling provides:
- Advanced PDF understanding using Vision Language Models
- Layout analysis and table structure recognition
- Formula extraction and code block detection
- Multiple export formats
This wrapper adds:
- User-friendly CLI interface
- YAML-based configuration
- Batch processing capabilities
- Interactive output selection
- Apple Silicon optimization out of the box
- Docling - Core PDF processing library
- MLX - Apple's ML framework for Apple Silicon
- SmolDocling - Lightweight VLM model
MIT License - see LICENSE file for details.
This project uses Docling (MIT License). See individual package licenses for dependencies.
Contributions welcome! Please ensure:
- Code follows existing style
- Config changes are documented
- Tests pass with various PDF types
- Update CHANGELOG.md with your changes
For issues:
- This tool: Open an issue on GitHub
- Docling: Docling GitHub Issues
- uv package manager: uv GitHub Issues
Made with โค๏ธ using Docling