README.md

# DocProcesser

Document processing API built with FastAPI. Converts images to LaTeX/Markdown/MathML and Markdown to DOCX.

## Features

- **Image OCR API** (`POST /doc_process/v1/image/ocr`)
  - Accept images via URL or base64
  - Automatic layout detection using DocLayout-YOLO
  - Text and formula recognition via PaddleOCR-VL
  - Output in LaTeX, Markdown, and MathML formats

- **Markdown to DOCX API** (`POST /doc_process/v1/convert/docx`)
  - Convert markdown content to Word documents
  - Preserve formatting, tables, and code blocks

## Prerequisites

- Python 3.11+
- NVIDIA GPU with CUDA support (RTX 5080 recommended)
- PaddleOCR-VL service running via vLLM (default: `http://localhost:8080/v1`)
- Pre-downloaded models:
  - DocLayout-YOLO
  - PP-DocLayoutV2

## Quick Start

### 1. Install Dependencies

Using [uv](https://github.com/astral-sh/uv):

```bash
# Install uv if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create virtual environment and install dependencies
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
uv pip install -e .
```

### 2. Download Models

Download the required models and place them in the `models/` directory:

```bash
mkdir -p models/DocLayout models/PP-DocLayout

# DocLayout-YOLO (from HuggingFace)
# https://huggingface.co/juliozhao/DocLayout-YOLO-DocStructBench
# Place the .pt file in models/DocLayout/

# PP-DocLayoutV2 (from PaddlePaddle)
# Place the model files in models/PP-DocLayout/
```

### 3. Configure Environment

Create a `.env` file:

```bash
# PaddleOCR-VL vLLM server URL
PADDLEOCR_VL_URL=http://localhost:8080/v1

# Model paths
DOCLAYOUT_MODEL_PATH=models/DocLayout/doclayout_yolo_docstructbench_imgsz1024.pt
PP_DOCLAYOUT_MODEL_DIR=models/PP-DocLayout

# Server settings
HOST=0.0.0.0
PORT=8053
```

### 4. Run the Server

```bash
uvicorn app.main:app --host 0.0.0.0 --port 8053
```

## Docker Deployment

### Build and Run with GPU

```bash
# Build the image
docker build -t doc-processer .

# Run with GPU support
docker run --gpus all -p 8053:8053 \
  -v ./models/DocLayout:/app/models/DocLayout:ro \
  -v ./models/PP-DocLayout:/app/models/PP-DocLayout:ro \
  -e PADDLEOCR_VL_URL=http://host.docker.internal:8080/v1 \
  doc-processer
```

### Using Docker Compose

```bash
# Start the service with GPU
docker-compose up -d doc-processer

# Or without GPU (CPU mode)
docker-compose --profile cpu up -d doc-processer-cpu
```

## API Usage

### Image OCR

```bash
# Using image URL
curl -X POST http://localhost:8053/doc_process/v1/image/ocr \
  -H "Content-Type: application/json" \
  -d '{"image_url": "https://example.com/document.png"}'

# Using base64 image
curl -X POST http://localhost:8053/doc_process/v1/image/ocr \
  -H "Content-Type: application/json" \
  -d '{"image_base64": "iVBORw0KGgo..."}'
```

Response:
```json
{
  "latex": "\\section{Title}...",
  "markdown": "# Title\n...",
  "mathml": "<math>...</math>",
  "layout_info": {
    "regions": [
      {"type": "text", "bbox": [10, 20, 100, 50], "confidence": 0.95}
    ],
    "has_plain_text": true,
    "has_formula": false
  },
  "recognition_mode": "mixed_recognition"
}
```

### Markdown to DOCX

```bash
curl -X POST http://localhost:8053/doc_process/v1/convert/docx \
  -H "Content-Type: application/json" \
  -d '{"markdown": "# Hello World\n\nThis is a test.", "filename": "output"}' \
  --output output.docx
```

## Project Structure

```
doc_processer/
├── app/
│   ├── api/v1/
│   │   ├── endpoints/
│   │   │   ├── image.py      # Image OCR endpoint
│   │   │   └── convert.py    # Markdown to DOCX endpoint
│   │   └── router.py
│   ├── core/
│   │   ├── config.py         # Settings
│   │   └── dependencies.py   # DI providers
│   ├── services/
│   │   ├── image_processor.py    # OpenCV preprocessing
│   │   ├── layout_detector.py    # DocLayout-YOLO
│   │   ├── ocr_service.py        # PaddleOCR-VL client
│   │   └── docx_converter.py     # Markdown to DOCX
│   ├── schemas/
│   │   ├── image.py
│   │   └── convert.py
│   └── main.py
├── models/                   # Pre-downloaded models (git-ignored)
├── Dockerfile
├── docker-compose.yml
├── pyproject.toml
└── README.md
```

## Processing Pipeline

### Image OCR Flow

1. **Input**: Accept `image_url` or `image_base64`
2. **Preprocessing**: Add 30% whitespace padding using OpenCV
3. **Layout Detection**: DocLayout-YOLO detects regions (text, formula, table, figure)
4. **Recognition**:
   - If plain text detected → PP-DocLayoutV2 for mixed content recognition
   - Otherwise → PaddleOCR-VL with formula prompt
5. **Output Conversion**: Generate LaTeX, Markdown, and MathML

## Hardware Requirements

- **Minimum**: 8GB GPU VRAM
- **Recommended**: RTX 5080 16GB or equivalent
- **CPU**: 4+ cores
- **RAM**: 16GB+

## License

MIT
init repo 2025-12-29 17:34:58 +08:00			`# DocProcesser`

			`Document processing API built with FastAPI. Converts images to LaTeX/Markdown/MathML and Markdown to DOCX.`

			`## Features`

			- Image OCR API (`POST /doc_process/v1/image/ocr`)
			`- Accept images via URL or base64`
			`- Automatic layout detection using DocLayout-YOLO`
			`- Text and formula recognition via PaddleOCR-VL`
			`- Output in LaTeX, Markdown, and MathML formats`

			- Markdown to DOCX API (`POST /doc_process/v1/convert/docx`)
			`- Convert markdown content to Word documents`
			`- Preserve formatting, tables, and code blocks`

			`## Prerequisites`

			`- Python 3.11+`
			`- NVIDIA GPU with CUDA support (RTX 5080 recommended)`
			- PaddleOCR-VL service running via vLLM (default: `http://localhost:8080/v1`)
			`- Pre-downloaded models:`
			`- DocLayout-YOLO`
			`- PP-DocLayoutV2`

			`## Quick Start`

			`### 1. Install Dependencies`

			`Using [uv](https://github.com/astral-sh/uv):`

			```bash
			`# Install uv if not already installed`
			`curl -LsSf https://astral.sh/uv/install.sh \| sh`

			`# Create virtual environment and install dependencies`
			`uv venv`
			`source .venv/bin/activate # On Windows: .venv\Scripts\activate`
			`uv pip install -e .`
			```

			`### 2. Download Models`

			Download the required models and place them in the `models/` directory:

			```bash
			`mkdir -p models/DocLayout models/PP-DocLayout`

			`# DocLayout-YOLO (from HuggingFace)`
			`# https://huggingface.co/juliozhao/DocLayout-YOLO-DocStructBench`
			`# Place the .pt file in models/DocLayout/`

			`# PP-DocLayoutV2 (from PaddlePaddle)`
			`# Place the model files in models/PP-DocLayout/`
			```

			`### 3. Configure Environment`

			Create a `.env` file:

			```bash
			`# PaddleOCR-VL vLLM server URL`
			`PADDLEOCR_VL_URL=http://localhost:8080/v1`

			`# Model paths`
			`DOCLAYOUT_MODEL_PATH=models/DocLayout/doclayout_yolo_docstructbench_imgsz1024.pt`
			`PP_DOCLAYOUT_MODEL_DIR=models/PP-DocLayout`

			`# Server settings`
			`HOST=0.0.0.0`
			`PORT=8053`
			```

			`### 4. Run the Server`

			```bash
			`uvicorn app.main:app --host 0.0.0.0 --port 8053`
			```

			`## Docker Deployment`

			`### Build and Run with GPU`

			```bash
			`# Build the image`
			`docker build -t doc-processer .`

			`# Run with GPU support`
			`docker run --gpus all -p 8053:8053 \`
			`-v ./models/DocLayout:/app/models/DocLayout:ro \`
			`-v ./models/PP-DocLayout:/app/models/PP-DocLayout:ro \`
			`-e PADDLEOCR_VL_URL=http://host.docker.internal:8080/v1 \`
			`doc-processer`
			```

			`### Using Docker Compose`

			```bash
			`# Start the service with GPU`
			`docker-compose up -d doc-processer`

			`# Or without GPU (CPU mode)`
			`docker-compose --profile cpu up -d doc-processer-cpu`
			```

			`## API Usage`

			`### Image OCR`

			```bash
			`# Using image URL`
			`curl -X POST http://localhost:8053/doc_process/v1/image/ocr \`
			`-H "Content-Type: application/json" \`
			`-d '{"image_url": "https://example.com/document.png"}'`

			`# Using base64 image`
			`curl -X POST http://localhost:8053/doc_process/v1/image/ocr \`
			`-H "Content-Type: application/json" \`
			`-d '{"image_base64": "iVBORw0KGgo..."}'`
			```

			`Response:`
			```json
			`{`
			`"latex": "\\section{Title}...",`
			`"markdown": "# Title\n...",`
			`"mathml": "<math>...</math>",`
			`"layout_info": {`
			`"regions": [`
			`{"type": "text", "bbox": [10, 20, 100, 50], "confidence": 0.95}`
			`],`
			`"has_plain_text": true,`
			`"has_formula": false`
			`},`
			`"recognition_mode": "mixed_recognition"`
			`}`
			```

			`### Markdown to DOCX`

			```bash
			`curl -X POST http://localhost:8053/doc_process/v1/convert/docx \`
			`-H "Content-Type: application/json" \`
			`-d '{"markdown": "# Hello World\n\nThis is a test.", "filename": "output"}' \`
			`--output output.docx`
			```

			`## Project Structure`

			```
			`doc_processer/`
			`├── app/`
			`│ ├── api/v1/`
			`│ │ ├── endpoints/`
			`│ │ │ ├── image.py # Image OCR endpoint`
			`│ │ │ └── convert.py # Markdown to DOCX endpoint`
			`│ │ └── router.py`
			`│ ├── core/`
			`│ │ ├── config.py # Settings`
			`│ │ └── dependencies.py # DI providers`
			`│ ├── services/`
			`│ │ ├── image_processor.py # OpenCV preprocessing`
			`│ │ ├── layout_detector.py # DocLayout-YOLO`
			`│ │ ├── ocr_service.py # PaddleOCR-VL client`
			`│ │ └── docx_converter.py # Markdown to DOCX`
			`│ ├── schemas/`
			`│ │ ├── image.py`
			`│ │ └── convert.py`
			`│ └── main.py`
			`├── models/ # Pre-downloaded models (git-ignored)`
			`├── Dockerfile`
			`├── docker-compose.yml`
			`├── pyproject.toml`
			`└── README.md`
			```

			`## Processing Pipeline`

			`### Image OCR Flow`

			1. Input: Accept `image_url` or `image_base64`
			`2. Preprocessing: Add 30% whitespace padding using OpenCV`
			`3. Layout Detection: DocLayout-YOLO detects regions (text, formula, table, figure)`
			`4. Recognition:`
			`- If plain text detected → PP-DocLayoutV2 for mixed content recognition`
			`- Otherwise → PaddleOCR-VL with formula prompt`
			`5. Output Conversion: Generate LaTeX, Markdown, and MathML`

			`## Hardware Requirements`

			`- Minimum: 8GB GPU VRAM`
			`- Recommended: RTX 5080 16GB or equivalent`
			`- CPU: 4+ cores`
			`- RAM: 16GB+`

			`## License`

			`MIT`