200 lines
4.9 KiB
Markdown
200 lines
4.9 KiB
Markdown
|
|
# DocProcesser
|
||
|
|
|
||
|
|
Document processing API built with FastAPI. Converts images to LaTeX/Markdown/MathML and Markdown to DOCX.
|
||
|
|
|
||
|
|
## Features
|
||
|
|
|
||
|
|
- **Image OCR API** (`POST /doc_process/v1/image/ocr`)
|
||
|
|
- Accept images via URL or base64
|
||
|
|
- Automatic layout detection using DocLayout-YOLO
|
||
|
|
- Text and formula recognition via PaddleOCR-VL
|
||
|
|
- Output in LaTeX, Markdown, and MathML formats
|
||
|
|
|
||
|
|
- **Markdown to DOCX API** (`POST /doc_process/v1/convert/docx`)
|
||
|
|
- Convert markdown content to Word documents
|
||
|
|
- Preserve formatting, tables, and code blocks
|
||
|
|
|
||
|
|
## Prerequisites
|
||
|
|
|
||
|
|
- Python 3.11+
|
||
|
|
- NVIDIA GPU with CUDA support (RTX 5080 recommended)
|
||
|
|
- PaddleOCR-VL service running via vLLM (default: `http://localhost:8080/v1`)
|
||
|
|
- Pre-downloaded models:
|
||
|
|
- DocLayout-YOLO
|
||
|
|
- PP-DocLayoutV2
|
||
|
|
|
||
|
|
## Quick Start
|
||
|
|
|
||
|
|
### 1. Install Dependencies
|
||
|
|
|
||
|
|
Using [uv](https://github.com/astral-sh/uv):
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Install uv if not already installed
|
||
|
|
curl -LsSf https://astral.sh/uv/install.sh | sh
|
||
|
|
|
||
|
|
# Create virtual environment and install dependencies
|
||
|
|
uv venv
|
||
|
|
source .venv/bin/activate # On Windows: .venv\Scripts\activate
|
||
|
|
uv pip install -e .
|
||
|
|
```
|
||
|
|
|
||
|
|
### 2. Download Models
|
||
|
|
|
||
|
|
Download the required models and place them in the `models/` directory:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
mkdir -p models/DocLayout models/PP-DocLayout
|
||
|
|
|
||
|
|
# DocLayout-YOLO (from HuggingFace)
|
||
|
|
# https://huggingface.co/juliozhao/DocLayout-YOLO-DocStructBench
|
||
|
|
# Place the .pt file in models/DocLayout/
|
||
|
|
|
||
|
|
# PP-DocLayoutV2 (from PaddlePaddle)
|
||
|
|
# Place the model files in models/PP-DocLayout/
|
||
|
|
```
|
||
|
|
|
||
|
|
### 3. Configure Environment
|
||
|
|
|
||
|
|
Create a `.env` file:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# PaddleOCR-VL vLLM server URL
|
||
|
|
PADDLEOCR_VL_URL=http://localhost:8080/v1
|
||
|
|
|
||
|
|
# Model paths
|
||
|
|
DOCLAYOUT_MODEL_PATH=models/DocLayout/doclayout_yolo_docstructbench_imgsz1024.pt
|
||
|
|
PP_DOCLAYOUT_MODEL_DIR=models/PP-DocLayout
|
||
|
|
|
||
|
|
# Server settings
|
||
|
|
HOST=0.0.0.0
|
||
|
|
PORT=8053
|
||
|
|
```
|
||
|
|
|
||
|
|
### 4. Run the Server
|
||
|
|
|
||
|
|
```bash
|
||
|
|
uvicorn app.main:app --host 0.0.0.0 --port 8053
|
||
|
|
```
|
||
|
|
|
||
|
|
## Docker Deployment
|
||
|
|
|
||
|
|
### Build and Run with GPU
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Build the image
|
||
|
|
docker build -t doc-processer .
|
||
|
|
|
||
|
|
# Run with GPU support
|
||
|
|
docker run --gpus all -p 8053:8053 \
|
||
|
|
-v ./models/DocLayout:/app/models/DocLayout:ro \
|
||
|
|
-v ./models/PP-DocLayout:/app/models/PP-DocLayout:ro \
|
||
|
|
-e PADDLEOCR_VL_URL=http://host.docker.internal:8080/v1 \
|
||
|
|
doc-processer
|
||
|
|
```
|
||
|
|
|
||
|
|
### Using Docker Compose
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Start the service with GPU
|
||
|
|
docker-compose up -d doc-processer
|
||
|
|
|
||
|
|
# Or without GPU (CPU mode)
|
||
|
|
docker-compose --profile cpu up -d doc-processer-cpu
|
||
|
|
```
|
||
|
|
|
||
|
|
## API Usage
|
||
|
|
|
||
|
|
### Image OCR
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Using image URL
|
||
|
|
curl -X POST http://localhost:8053/doc_process/v1/image/ocr \
|
||
|
|
-H "Content-Type: application/json" \
|
||
|
|
-d '{"image_url": "https://example.com/document.png"}'
|
||
|
|
|
||
|
|
# Using base64 image
|
||
|
|
curl -X POST http://localhost:8053/doc_process/v1/image/ocr \
|
||
|
|
-H "Content-Type: application/json" \
|
||
|
|
-d '{"image_base64": "iVBORw0KGgo..."}'
|
||
|
|
```
|
||
|
|
|
||
|
|
Response:
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"latex": "\\section{Title}...",
|
||
|
|
"markdown": "# Title\n...",
|
||
|
|
"mathml": "<math>...</math>",
|
||
|
|
"layout_info": {
|
||
|
|
"regions": [
|
||
|
|
{"type": "text", "bbox": [10, 20, 100, 50], "confidence": 0.95}
|
||
|
|
],
|
||
|
|
"has_plain_text": true,
|
||
|
|
"has_formula": false
|
||
|
|
},
|
||
|
|
"recognition_mode": "mixed_recognition"
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### Markdown to DOCX
|
||
|
|
|
||
|
|
```bash
|
||
|
|
curl -X POST http://localhost:8053/doc_process/v1/convert/docx \
|
||
|
|
-H "Content-Type: application/json" \
|
||
|
|
-d '{"markdown": "# Hello World\n\nThis is a test.", "filename": "output"}' \
|
||
|
|
--output output.docx
|
||
|
|
```
|
||
|
|
|
||
|
|
## Project Structure
|
||
|
|
|
||
|
|
```
|
||
|
|
doc_processer/
|
||
|
|
├── app/
|
||
|
|
│ ├── api/v1/
|
||
|
|
│ │ ├── endpoints/
|
||
|
|
│ │ │ ├── image.py # Image OCR endpoint
|
||
|
|
│ │ │ └── convert.py # Markdown to DOCX endpoint
|
||
|
|
│ │ └── router.py
|
||
|
|
│ ├── core/
|
||
|
|
│ │ ├── config.py # Settings
|
||
|
|
│ │ └── dependencies.py # DI providers
|
||
|
|
│ ├── services/
|
||
|
|
│ │ ├── image_processor.py # OpenCV preprocessing
|
||
|
|
│ │ ├── layout_detector.py # DocLayout-YOLO
|
||
|
|
│ │ ├── ocr_service.py # PaddleOCR-VL client
|
||
|
|
│ │ └── docx_converter.py # Markdown to DOCX
|
||
|
|
│ ├── schemas/
|
||
|
|
│ │ ├── image.py
|
||
|
|
│ │ └── convert.py
|
||
|
|
│ └── main.py
|
||
|
|
├── models/ # Pre-downloaded models (git-ignored)
|
||
|
|
├── Dockerfile
|
||
|
|
├── docker-compose.yml
|
||
|
|
├── pyproject.toml
|
||
|
|
└── README.md
|
||
|
|
```
|
||
|
|
|
||
|
|
## Processing Pipeline
|
||
|
|
|
||
|
|
### Image OCR Flow
|
||
|
|
|
||
|
|
1. **Input**: Accept `image_url` or `image_base64`
|
||
|
|
2. **Preprocessing**: Add 30% whitespace padding using OpenCV
|
||
|
|
3. **Layout Detection**: DocLayout-YOLO detects regions (text, formula, table, figure)
|
||
|
|
4. **Recognition**:
|
||
|
|
- If plain text detected → PP-DocLayoutV2 for mixed content recognition
|
||
|
|
- Otherwise → PaddleOCR-VL with formula prompt
|
||
|
|
5. **Output Conversion**: Generate LaTeX, Markdown, and MathML
|
||
|
|
|
||
|
|
## Hardware Requirements
|
||
|
|
|
||
|
|
- **Minimum**: 8GB GPU VRAM
|
||
|
|
- **Recommended**: RTX 5080 16GB or equivalent
|
||
|
|
- **CPU**: 4+ cores
|
||
|
|
- **RAM**: 16GB+
|
||
|
|
|
||
|
|
## License
|
||
|
|
|
||
|
|
MIT
|
||
|
|
|