DocProcesser
Document processing API built with FastAPI. Converts images to LaTeX/Markdown/MathML and Markdown to DOCX.
Features
-
Image OCR API (
POST /doc_process/v1/image/ocr)- Accept images via URL or base64
- Automatic layout detection using DocLayout-YOLO
- Text and formula recognition via PaddleOCR-VL
- Output in LaTeX, Markdown, and MathML formats
-
Markdown to DOCX API (
POST /doc_process/v1/convert/docx)- Convert markdown content to Word documents
- Preserve formatting, tables, and code blocks
Prerequisites
- Python 3.11+
- NVIDIA GPU with CUDA support (RTX 5080 recommended)
- PaddleOCR-VL service running via vLLM (default:
http://localhost:8080/v1) - Pre-downloaded models:
- DocLayout-YOLO
- PP-DocLayoutV2
Quick Start
1. Install Dependencies
Using uv:
# Install uv if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create virtual environment and install dependencies
uv venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
uv pip install -e .
2. Download Models
Download the required models and place them in the models/ directory:
mkdir -p models/DocLayout models/PP-DocLayout
# DocLayout-YOLO (from HuggingFace)
# https://huggingface.co/juliozhao/DocLayout-YOLO-DocStructBench
# Place the .pt file in models/DocLayout/
# PP-DocLayoutV2 (from PaddlePaddle)
# Place the model files in models/PP-DocLayout/
3. Configure Environment
Create a .env file:
# PaddleOCR-VL vLLM server URL
PADDLEOCR_VL_URL=http://localhost:8080/v1
# Model paths
DOCLAYOUT_MODEL_PATH=models/DocLayout/doclayout_yolo_docstructbench_imgsz1024.pt
PP_DOCLAYOUT_MODEL_DIR=models/PP-DocLayout
# Server settings
HOST=0.0.0.0
PORT=8053
4. Run the Server
uvicorn app.main:app --host 0.0.0.0 --port 8053
Docker Deployment
Build and Run with GPU
# Build the image
docker build -t doc-processer .
# Run with GPU support
docker run --gpus all -p 8053:8053 \
-v ./models/DocLayout:/app/models/DocLayout:ro \
-v ./models/PP-DocLayout:/app/models/PP-DocLayout:ro \
-e PADDLEOCR_VL_URL=http://host.docker.internal:8080/v1 \
doc-processer
Using Docker Compose
# Start the service with GPU
docker-compose up -d doc-processer
# Or without GPU (CPU mode)
docker-compose --profile cpu up -d doc-processer-cpu
API Usage
Image OCR
# Using image URL
curl -X POST http://localhost:8053/doc_process/v1/image/ocr \
-H "Content-Type: application/json" \
-d '{"image_url": "https://example.com/document.png"}'
# Using base64 image
curl -X POST http://localhost:8053/doc_process/v1/image/ocr \
-H "Content-Type: application/json" \
-d '{"image_base64": "iVBORw0KGgo..."}'
Response:
{
"latex": "\\section{Title}...",
"markdown": "# Title\n...",
"mathml": "<math>...</math>",
"layout_info": {
"regions": [
{"type": "text", "bbox": [10, 20, 100, 50], "confidence": 0.95}
],
"has_plain_text": true,
"has_formula": false
},
"recognition_mode": "mixed_recognition"
}
Markdown to DOCX
curl -X POST http://localhost:8053/doc_process/v1/convert/docx \
-H "Content-Type: application/json" \
-d '{"markdown": "# Hello World\n\nThis is a test.", "filename": "output"}' \
--output output.docx
Project Structure
doc_processer/
├── app/
│ ├── api/v1/
│ │ ├── endpoints/
│ │ │ ├── image.py # Image OCR endpoint
│ │ │ └── convert.py # Markdown to DOCX endpoint
│ │ └── router.py
│ ├── core/
│ │ ├── config.py # Settings
│ │ └── dependencies.py # DI providers
│ ├── services/
│ │ ├── image_processor.py # OpenCV preprocessing
│ │ ├── layout_detector.py # DocLayout-YOLO
│ │ ├── ocr_service.py # PaddleOCR-VL client
│ │ └── docx_converter.py # Markdown to DOCX
│ ├── schemas/
│ │ ├── image.py
│ │ └── convert.py
│ └── main.py
├── models/ # Pre-downloaded models (git-ignored)
├── Dockerfile
├── docker-compose.yml
├── pyproject.toml
└── README.md
Processing Pipeline
Image OCR Flow
- Input: Accept
image_urlorimage_base64 - Preprocessing: Add 30% whitespace padding using OpenCV
- Layout Detection: DocLayout-YOLO detects regions (text, formula, table, figure)
- Recognition:
- If plain text detected → PP-DocLayoutV2 for mixed content recognition
- Otherwise → PaddleOCR-VL with formula prompt
- Output Conversion: Generate LaTeX, Markdown, and MathML
Hardware Requirements
- Minimum: 8GB GPU VRAM
- Recommended: RTX 5080 16GB or equivalent
- CPU: 4+ cores
- RAM: 16GB+
License
MIT
Description
Languages
Python
92.5%
Dockerfile
7.5%