# DocProcesser

Document processing API built with FastAPI. Converts images to LaTeX/Markdown/MathML and Markdown to DOCX.

## Features

- **Image OCR API** (`POST /doc_process/v1/image/ocr`)
  - Accept images via URL or base64
  - Automatic layout detection using DocLayout-YOLO
  - Text and formula recognition via PaddleOCR-VL
  - Output in LaTeX, Markdown, and MathML formats

- **Markdown to DOCX API** (`POST /doc_process/v1/convert/docx`)
  - Convert markdown content to Word documents
  - Preserve formatting, tables, and code blocks

## Prerequisites

- Python 3.11+
- NVIDIA GPU with CUDA support (RTX 5080 recommended)
- PaddleOCR-VL service running via vLLM (default: `http://localhost:8080/v1`)
- Pre-downloaded models:
  - DocLayout-YOLO
  - PP-DocLayoutV2

## Quick Start

### 1. Install Dependencies

Using [uv](https://github.com/astral-sh/uv):

```bash
# Install uv if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create virtual environment and install dependencies
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
uv pip install -e .
```

### 2. Download Models

Download the required models and place them in the `models/` directory:

```bash
mkdir -p models/DocLayout models/PP-DocLayout

# DocLayout-YOLO (from HuggingFace)
# https://huggingface.co/juliozhao/DocLayout-YOLO-DocStructBench
# Place the .pt file in models/DocLayout/

# PP-DocLayoutV2 (from PaddlePaddle)
# Place the model files in models/PP-DocLayout/
```

### 3. Configure Environment

Create a `.env` file:

```bash
# PaddleOCR-VL vLLM server URL
PADDLEOCR_VL_URL=http://localhost:8080/v1

# Model paths
DOCLAYOUT_MODEL_PATH=models/DocLayout/doclayout_yolo_docstructbench_imgsz1024.pt
PP_DOCLAYOUT_MODEL_DIR=models/PP-DocLayout

# Server settings
HOST=0.0.0.0
PORT=8053
```

### 4. Run the Server

```bash
uvicorn app.main:app --host 0.0.0.0 --port 8053
```

## Docker Deployment

### Build and Run with GPU

```bash
# Build the image
docker build -t doc-processer .

# Run with GPU support
docker run --gpus all -p 8053:8053 \
  -v ./models/DocLayout:/app/models/DocLayout:ro \
  -v ./models/PP-DocLayout:/app/models/PP-DocLayout:ro \
  -e PADDLEOCR_VL_URL=http://host.docker.internal:8080/v1 \
  doc-processer
```

### Using Docker Compose

```bash
# Start the service with GPU
docker-compose up -d doc-processer

# Or without GPU (CPU mode)
docker-compose --profile cpu up -d doc-processer-cpu
```

## API Usage

### Image OCR

```bash
# Using image URL
curl -X POST http://localhost:8053/doc_process/v1/image/ocr \
  -H "Content-Type: application/json" \
  -d '{"image_url": "https://example.com/document.png"}'

# Using base64 image
curl -X POST http://localhost:8053/doc_process/v1/image/ocr \
  -H "Content-Type: application/json" \
  -d '{"image_base64": "iVBORw0KGgo..."}'
```

Response:
```json
{
  "latex": "\\section{Title}...",
  "markdown": "# Title\n...",
  "mathml": "<math>...</math>",
  "layout_info": {
    "regions": [
      {"type": "text", "bbox": [10, 20, 100, 50], "confidence": 0.95}
    ],
    "has_plain_text": true,
    "has_formula": false
  },
  "recognition_mode": "mixed_recognition"
}
```

### Markdown to DOCX

```bash
curl -X POST http://localhost:8053/doc_process/v1/convert/docx \
  -H "Content-Type: application/json" \
  -d '{"markdown": "# Hello World\n\nThis is a test.", "filename": "output"}' \
  --output output.docx
```

## Project Structure

```
doc_processer/
├── app/
│   ├── api/v1/
│   │   ├── endpoints/
│   │   │   ├── image.py      # Image OCR endpoint
│   │   │   └── convert.py    # Markdown to DOCX endpoint
│   │   └── router.py
│   ├── core/
│   │   ├── config.py         # Settings
│   │   └── dependencies.py   # DI providers
│   ├── services/
│   │   ├── image_processor.py    # OpenCV preprocessing
│   │   ├── layout_detector.py    # DocLayout-YOLO
│   │   ├── ocr_service.py        # PaddleOCR-VL client
│   │   └── docx_converter.py     # Markdown to DOCX
│   ├── schemas/
│   │   ├── image.py
│   │   └── convert.py
│   └── main.py
├── models/                   # Pre-downloaded models (git-ignored)
├── Dockerfile
├── docker-compose.yml
├── pyproject.toml
└── README.md
```

## Processing Pipeline

### Image OCR Flow

1. **Input**: Accept `image_url` or `image_base64`
2. **Preprocessing**: Add 30% whitespace padding using OpenCV
3. **Layout Detection**: DocLayout-YOLO detects regions (text, formula, table, figure)
4. **Recognition**:
   - If plain text detected → PP-DocLayoutV2 for mixed content recognition
   - Otherwise → PaddleOCR-VL with formula prompt
5. **Output Conversion**: Generate LaTeX, Markdown, and MathML

## Hardware Requirements

- **Minimum**: 8GB GPU VRAM
- **Recommended**: RTX 5080 16GB or equivalent
- **CPU**: 4+ cores
- **RAM**: 16GB+

## License

MIT