Skip to content
/ pdfx Public

๐Ÿš€ Fast, intelligent PDF converter powered by Vision Language Models Convert PDFs to Markdown, HTML, JSON, and more with Apple Silicon optimization - Powered by Docling

License

Notifications You must be signed in to change notification settings

jaydotsee/pdfx

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

1 Commit
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

pdfx - PDF to Markdown Converter

A Python CLI tool for converting PDF documents to Markdown, optimized for Apple Silicon with MLX acceleration.

Note: This is a user-friendly wrapper around Docling, providing a command-line interface, YAML configuration, and batch processing capabilities.

Python 3.10-3.12 License: MIT

โœจ Features

  • ๐Ÿš€ Fast PDF Conversion - Vision Language Model (VLM) based processing with MLX acceleration
  • ๐ŸŽ Apple Silicon Optimized - Native MPS (Metal Performance Shaders) support
  • ๐Ÿ“ฆ Batch Processing - Convert entire directories while preserving structure
  • ๐ŸŽจ Multiple Output Formats - Markdown, JSON, HTML, and DocTags
  • ๐Ÿ” OCR Support - Extract text from scanned documents
  • ๐Ÿ“Š Table Extraction - Intelligent table structure recognition
  • ๐Ÿงฎ Formula Support - Extract mathematical formulas as LaTeX
  • ๐Ÿ–ผ๏ธ Image Handling - Embed images as base64 or save separately
  • โš™๏ธ YAML Configuration - Easy customization with config files
  • ๐ŸŒ URL Support - Convert PDFs directly from URLs
  • ๐ŸŽฏ Interactive Mode - Choose output location with a visual menu

๐Ÿ“‹ Requirements

  • macOS with Apple Silicon (M1/M2/M3/M4) - recommended for best performance
  • Python 3.10 - 3.12 (Python 3.13+ not yet supported by Docling)
  • uv package manager (recommended) or pip

๐Ÿš€ Quick Start

Installation

# 1. Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# 2. Clone or navigate to project directory
cd pdfx

# 3. Create virtual environment
uv venv --python 3.12
source .venv/bin/activate

# 4. Install package in development mode
uv pip install -e .

# 5. Download required models (one-time setup, ~500MB-1GB)
mkdir -p ~/.cache/docling/models
docling-tools models download

# 6. Verify installation
python verify_install.py
pdfx --help

Basic Usage

# Convert a single PDF
pdfx input.pdf

# Interactive mode (choose output location)
pdfx input.pdf -i

# Convert from URL
pdfx https://arxiv.org/pdf/2408.09869

# Convert a directory
pdfx ~/Documents/pdfs/

# List available models
pdfx --list-models

# Verbose output for debugging
pdfx input.pdf --verbose

๐Ÿ“– Usage

Interactive Mode

Use the -i flag to select output location from a menu:

pdfx document.pdf -i

You'll see:

============================================================
๐Ÿ“ Select output location:
============================================================
1. Default (~/Downloads/)
2. Current directory (./output)
3. Same directory as source
4. Desktop (~/Desktop/)
5. Custom path...
============================================================
Enter your choice (1-5) or press Enter for default:

Command-Line Options

pdfx [OPTIONS] input

Options:
  -h, --help                Show help message
  -c, --config CONFIG       Path to config file (default: config.yaml)
  -o, --output OUTPUT       Output directory (overrides config)
  -f, --format FORMAT       Output format: markdown, json, html, doctags
  -v, --verbose             Enable verbose logging
  -i, --interactive         Prompt for output location
  --list-models             List available VLM models

Configuration

Create or edit config.yaml to customize behavior:

# Model Configuration
model:
  # Pipeline type: vlm (fast) or standard (full features)
  pipeline_type: "standard"

  # VLM model (Apple Silicon optimized)
  vlm_model: "SMOLDOCLING_MLX"

# Output Configuration
output:
  # Format: markdown, json, html, doctags
  # Can be single format or list for multiple outputs
  format: ["markdown"]  # or ["markdown", "json"]

  # Image handling
  include_images: true
  image_mode: "embedded"  # or "referenced"

# Processing Options
processing:
  # OCR for scanned documents
  enable_ocr: false
  ocr_engine: "auto"

  # Performance tuning
  page_batch_size: 8  # Higher = faster but more memory

# Feature Toggles (standard pipeline only)
features:
  # Table extraction
  table_structure: true
  table_mode: "ACCURATE"  # or "FAST"

  # Content enrichment
  formula_enrichment: true
  code_enrichment: true
  picture_classification: true

๐ŸŽฏ Common Use Cases

Academic Papers

Extract formulas and tables with high accuracy:

# config.yaml
model:
  pipeline_type: "standard"
features:
  formula_enrichment: true
  table_structure: true
  table_mode: "ACCURATE"
pdfx research_paper.pdf

Scanned Documents

Enable OCR for image-based PDFs:

# config.yaml
processing:
  enable_ocr: true
  ocr_engine: "auto"
  page_batch_size: 2
pdfx scanned_document.pdf

Batch Processing

Convert entire directories:

# Convert all PDFs in a directory
pdfx ~/Documents/reports/

# With interactive output selection
pdfx ~/Documents/reports/ -i

Multiple Output Formats

Export to both Markdown and JSON:

# config.yaml
output:
  format: ["markdown", "json"]

This creates both .md and .json files for each PDF.

๐Ÿ“Š Pipeline Comparison

VLM Pipeline (Fast)

Best for: Simple documents, speed priority

model:
  pipeline_type: "vlm"
  • โšก Fastest processing (~1 second/page)
  • ๐ŸŽ Apple Silicon optimized
  • โš ๏ธ Limited features (no OCR, table extraction, or enrichments)

Standard Pipeline (Full Features)

Best for: Complex documents, tables, formulas

model:
  pipeline_type: "standard"
  • โœ… Full feature support
  • ๐Ÿ“Š Table structure recognition
  • ๐Ÿงฎ Formula extraction
  • ๐Ÿ” OCR support
  • โฑ๏ธ Slower but more accurate

๐Ÿ› ๏ธ Troubleshooting

Models Not Found

Download models manually:

mkdir -p ~/.cache/docling/models
docling-tools models download

Python Version Issues

Ensure you're using Python 3.10-3.12:

python --version
# If wrong version:
uv venv --python 3.12
source .venv/bin/activate

Out of Memory

Reduce batch size in config:

processing:
  page_batch_size: 2  # or 1 for very large files

Images Not Embedding

Ensure correct configuration:

output:
  include_images: true
  image_mode: "embedded"

Empty Table Columns

This may occur if:

  • Table contains images/icons instead of text
  • Complex table structure
  • Try JSON export to see raw extracted data:
output:
  format: ["markdown", "json"]

OCR Not Working

  1. Enable OCR in config
  2. Install OCR dependencies:
uv pip install easyocr

For additional help:

๐Ÿ—๏ธ Project Structure

pdfx/
โ”œโ”€โ”€ src/
โ”‚   โ””โ”€โ”€ pdfx/
โ”‚       โ”œโ”€โ”€ __init__.py
โ”‚       โ”œโ”€โ”€ cli.py           # CLI entry point
โ”‚       โ”œโ”€โ”€ config.py        # Configuration management
โ”‚       โ””โ”€โ”€ converter.py     # Core conversion logic
โ”œโ”€โ”€ tests/                   # Test suite
โ”œโ”€โ”€ examples/
โ”‚   โ””โ”€โ”€ config.yaml         # Example configuration
โ”œโ”€โ”€ config.yaml             # Default configuration
โ”œโ”€โ”€ pyproject.toml          # Package configuration
โ”œโ”€โ”€ requirements.txt        # Dependencies
โ”œโ”€โ”€ verify_install.py       # Installation verification
โ””โ”€โ”€ README.md               # This file

๐Ÿ”ง Development

Running Tests

pytest

# With coverage
pytest --cov=pdfx --cov-report=html

# Verify installation
python verify_install.py

Package Installation

# Development mode (editable)
uv pip install -e .

# Or install dependencies manually
uv pip install -r requirements.txt

๐Ÿ“ฆ Dependencies

Core:

  • docling>=2.0.0 - PDF processing engine
  • mlx-vlm>=0.1.0 - Apple Silicon acceleration
  • pyyaml>=6.0 - Configuration parsing
  • docling-core - Core document types

Optional:

  • easyocr - OCR engine for scanned documents
  • rapidocr - Alternative lightweight OCR
  • pytesseract - Tesseract OCR wrapper

Model Downloads:

  • First run downloads ~500MB-1GB to ~/.cache/docling/models
  • SmolDocling MLX: ~250MB
  • Supporting models: ~200-400MB

โšก Performance

Apple Silicon (MLX):

  • Simple PDFs: ~1 second/page
  • Complex PDFs with tables: ~2 seconds/page
  • With OCR: ~3-5 seconds/page

Memory Usage:

  • Base (models loaded): ~500MB
  • Per page batch (4 pages): ~200-400MB
  • Peak (batch_size=8): ~1.5GB

๐Ÿ™ Credits

This project is built on top of Docling by IBM Research.

Docling provides:

  • Advanced PDF understanding using Vision Language Models
  • Layout analysis and table structure recognition
  • Formula extraction and code block detection
  • Multiple export formats

This wrapper adds:

  • User-friendly CLI interface
  • YAML-based configuration
  • Batch processing capabilities
  • Interactive output selection
  • Apple Silicon optimization out of the box

Related Projects

  • Docling - Core PDF processing library
  • MLX - Apple's ML framework for Apple Silicon
  • SmolDocling - Lightweight VLM model

๐Ÿ“š Resources

๐Ÿ“„ License

MIT License - see LICENSE file for details.

This project uses Docling (MIT License). See individual package licenses for dependencies.

๐Ÿค Contributing

Contributions welcome! Please ensure:

  • Code follows existing style
  • Config changes are documented
  • Tests pass with various PDF types
  • Update CHANGELOG.md with your changes

๐Ÿ“ฎ Support

For issues:


Made with โค๏ธ using Docling

About

๐Ÿš€ Fast, intelligent PDF converter powered by Vision Language Models Convert PDFs to Markdown, HTML, JSON, and more with Apple Silicon optimization - Powered by Docling

Resources

License

Stars

Watchers

Forks

Packages

No packages published