2025-12-29 20:05:54 +08:00
2025-12-29 17:34:58 +08:00
2025-12-29 20:02:07 +08:00
2025-12-29 17:34:58 +08:00
2025-12-29 20:02:07 +08:00
2025-12-29 17:34:58 +08:00
2025-12-29 17:34:58 +08:00
2025-12-29 17:34:58 +08:00
2025-12-29 20:05:54 +08:00
2025-12-29 17:34:58 +08:00

DocProcesser

Document processing API built with FastAPI. Converts images to LaTeX/Markdown/MathML and Markdown to DOCX.

Features

  • Image OCR API (POST /doc_process/v1/image/ocr)

    • Accept images via URL or base64
    • Automatic layout detection using DocLayout-YOLO
    • Text and formula recognition via PaddleOCR-VL
    • Output in LaTeX, Markdown, and MathML formats
  • Markdown to DOCX API (POST /doc_process/v1/convert/docx)

    • Convert markdown content to Word documents
    • Preserve formatting, tables, and code blocks

Prerequisites

  • Python 3.11+
  • NVIDIA GPU with CUDA support (RTX 5080 recommended)
  • PaddleOCR-VL service running via vLLM (default: http://localhost:8080/v1)
  • Pre-downloaded models:
    • DocLayout-YOLO
    • PP-DocLayoutV2

Quick Start

1. Install Dependencies

Using uv:

# Install uv if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create virtual environment and install dependencies
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
uv pip install -e .

2. Download Models

Download the required models and place them in the models/ directory:

mkdir -p models/DocLayout models/PP-DocLayout

# DocLayout-YOLO (from HuggingFace)
# https://huggingface.co/juliozhao/DocLayout-YOLO-DocStructBench
# Place the .pt file in models/DocLayout/

# PP-DocLayoutV2 (from PaddlePaddle)
# Place the model files in models/PP-DocLayout/

3. Configure Environment

Create a .env file:

# PaddleOCR-VL vLLM server URL
PADDLEOCR_VL_URL=http://localhost:8080/v1

# Model paths
DOCLAYOUT_MODEL_PATH=models/DocLayout/doclayout_yolo_docstructbench_imgsz1024.pt
PP_DOCLAYOUT_MODEL_DIR=models/PP-DocLayout

# Server settings
HOST=0.0.0.0
PORT=8053

4. Run the Server

uvicorn app.main:app --host 0.0.0.0 --port 8053

Docker Deployment

Build and Run with GPU

# Build the image
docker build -t doc-processer .

# Run with GPU support
docker run --gpus all -p 8053:8053 \
  -v ./models/DocLayout:/app/models/DocLayout:ro \
  -v ./models/PP-DocLayout:/app/models/PP-DocLayout:ro \
  -e PADDLEOCR_VL_URL=http://host.docker.internal:8080/v1 \
  doc-processer

Using Docker Compose

# Start the service with GPU
docker-compose up -d doc-processer

# Or without GPU (CPU mode)
docker-compose --profile cpu up -d doc-processer-cpu

API Usage

Image OCR

# Using image URL
curl -X POST http://localhost:8053/doc_process/v1/image/ocr \
  -H "Content-Type: application/json" \
  -d '{"image_url": "https://example.com/document.png"}'

# Using base64 image
curl -X POST http://localhost:8053/doc_process/v1/image/ocr \
  -H "Content-Type: application/json" \
  -d '{"image_base64": "iVBORw0KGgo..."}'

Response:

{
  "latex": "\\section{Title}...",
  "markdown": "# Title\n...",
  "mathml": "<math>...</math>",
  "layout_info": {
    "regions": [
      {"type": "text", "bbox": [10, 20, 100, 50], "confidence": 0.95}
    ],
    "has_plain_text": true,
    "has_formula": false
  },
  "recognition_mode": "mixed_recognition"
}

Markdown to DOCX

curl -X POST http://localhost:8053/doc_process/v1/convert/docx \
  -H "Content-Type: application/json" \
  -d '{"markdown": "# Hello World\n\nThis is a test.", "filename": "output"}' \
  --output output.docx

Project Structure

doc_processer/
├── app/
│   ├── api/v1/
│   │   ├── endpoints/
│   │   │   ├── image.py      # Image OCR endpoint
│   │   │   └── convert.py    # Markdown to DOCX endpoint
│   │   └── router.py
│   ├── core/
│   │   ├── config.py         # Settings
│   │   └── dependencies.py   # DI providers
│   ├── services/
│   │   ├── image_processor.py    # OpenCV preprocessing
│   │   ├── layout_detector.py    # DocLayout-YOLO
│   │   ├── ocr_service.py        # PaddleOCR-VL client
│   │   └── docx_converter.py     # Markdown to DOCX
│   ├── schemas/
│   │   ├── image.py
│   │   └── convert.py
│   └── main.py
├── models/                   # Pre-downloaded models (git-ignored)
├── Dockerfile
├── docker-compose.yml
├── pyproject.toml
└── README.md

Processing Pipeline

Image OCR Flow

  1. Input: Accept image_url or image_base64
  2. Preprocessing: Add 30% whitespace padding using OpenCV
  3. Layout Detection: DocLayout-YOLO detects regions (text, formula, table, figure)
  4. Recognition:
    • If plain text detected → PP-DocLayoutV2 for mixed content recognition
    • Otherwise → PaddleOCR-VL with formula prompt
  5. Output Conversion: Generate LaTeX, Markdown, and MathML

Hardware Requirements

  • Minimum: 8GB GPU VRAM
  • Recommended: RTX 5080 16GB or equivalent
  • CPU: 4+ cores
  • RAM: 16GB+

License

MIT

Description
No description provided
Readme 154 KiB
Languages
Python 92.5%
Dockerfile 7.5%