Files
doc_processer/README.md

200 lines
4.9 KiB
Markdown
Raw Permalink Normal View History

2025-12-29 17:34:58 +08:00
# DocProcesser
Document processing API built with FastAPI. Converts images to LaTeX/Markdown/MathML and Markdown to DOCX.
## Features
- **Image OCR API** (`POST /doc_process/v1/image/ocr`)
- Accept images via URL or base64
- Automatic layout detection using DocLayout-YOLO
- Text and formula recognition via PaddleOCR-VL
- Output in LaTeX, Markdown, and MathML formats
- **Markdown to DOCX API** (`POST /doc_process/v1/convert/docx`)
- Convert markdown content to Word documents
- Preserve formatting, tables, and code blocks
## Prerequisites
- Python 3.11+
- NVIDIA GPU with CUDA support (RTX 5080 recommended)
- PaddleOCR-VL service running via vLLM (default: `http://localhost:8080/v1`)
- Pre-downloaded models:
- DocLayout-YOLO
- PP-DocLayoutV2
## Quick Start
### 1. Install Dependencies
Using [uv](https://github.com/astral-sh/uv):
```bash
# Install uv if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create virtual environment and install dependencies
uv venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
uv pip install -e .
```
### 2. Download Models
Download the required models and place them in the `models/` directory:
```bash
mkdir -p models/DocLayout models/PP-DocLayout
# DocLayout-YOLO (from HuggingFace)
# https://huggingface.co/juliozhao/DocLayout-YOLO-DocStructBench
# Place the .pt file in models/DocLayout/
# PP-DocLayoutV2 (from PaddlePaddle)
# Place the model files in models/PP-DocLayout/
```
### 3. Configure Environment
Create a `.env` file:
```bash
# PaddleOCR-VL vLLM server URL
PADDLEOCR_VL_URL=http://localhost:8080/v1
# Model paths
DOCLAYOUT_MODEL_PATH=models/DocLayout/doclayout_yolo_docstructbench_imgsz1024.pt
PP_DOCLAYOUT_MODEL_DIR=models/PP-DocLayout
# Server settings
HOST=0.0.0.0
PORT=8053
```
### 4. Run the Server
```bash
uvicorn app.main:app --host 0.0.0.0 --port 8053
```
## Docker Deployment
### Build and Run with GPU
```bash
# Build the image
docker build -t doc-processer .
# Run with GPU support
docker run --gpus all -p 8053:8053 \
-v ./models/DocLayout:/app/models/DocLayout:ro \
-v ./models/PP-DocLayout:/app/models/PP-DocLayout:ro \
-e PADDLEOCR_VL_URL=http://host.docker.internal:8080/v1 \
doc-processer
```
### Using Docker Compose
```bash
# Start the service with GPU
docker-compose up -d doc-processer
# Or without GPU (CPU mode)
docker-compose --profile cpu up -d doc-processer-cpu
```
## API Usage
### Image OCR
```bash
# Using image URL
curl -X POST http://localhost:8053/doc_process/v1/image/ocr \
-H "Content-Type: application/json" \
-d '{"image_url": "https://example.com/document.png"}'
# Using base64 image
curl -X POST http://localhost:8053/doc_process/v1/image/ocr \
-H "Content-Type: application/json" \
-d '{"image_base64": "iVBORw0KGgo..."}'
```
Response:
```json
{
"latex": "\\section{Title}...",
"markdown": "# Title\n...",
"mathml": "<math>...</math>",
"layout_info": {
"regions": [
{"type": "text", "bbox": [10, 20, 100, 50], "confidence": 0.95}
],
"has_plain_text": true,
"has_formula": false
},
"recognition_mode": "mixed_recognition"
}
```
### Markdown to DOCX
```bash
curl -X POST http://localhost:8053/doc_process/v1/convert/docx \
-H "Content-Type: application/json" \
-d '{"markdown": "# Hello World\n\nThis is a test.", "filename": "output"}' \
--output output.docx
```
## Project Structure
```
doc_processer/
├── app/
│ ├── api/v1/
│ │ ├── endpoints/
│ │ │ ├── image.py # Image OCR endpoint
│ │ │ └── convert.py # Markdown to DOCX endpoint
│ │ └── router.py
│ ├── core/
│ │ ├── config.py # Settings
│ │ └── dependencies.py # DI providers
│ ├── services/
│ │ ├── image_processor.py # OpenCV preprocessing
│ │ ├── layout_detector.py # DocLayout-YOLO
│ │ ├── ocr_service.py # PaddleOCR-VL client
│ │ └── docx_converter.py # Markdown to DOCX
│ ├── schemas/
│ │ ├── image.py
│ │ └── convert.py
│ └── main.py
├── models/ # Pre-downloaded models (git-ignored)
├── Dockerfile
├── docker-compose.yml
├── pyproject.toml
└── README.md
```
## Processing Pipeline
### Image OCR Flow
1. **Input**: Accept `image_url` or `image_base64`
2. **Preprocessing**: Add 30% whitespace padding using OpenCV
3. **Layout Detection**: DocLayout-YOLO detects regions (text, formula, table, figure)
4. **Recognition**:
- If plain text detected → PP-DocLayoutV2 for mixed content recognition
- Otherwise → PaddleOCR-VL with formula prompt
5. **Output Conversion**: Generate LaTeX, Markdown, and MathML
## Hardware Requirements
- **Minimum**: 8GB GPU VRAM
- **Recommended**: RTX 5080 16GB or equivalent
- **CPU**: 4+ cores
- **RAM**: 16GB+
## License
MIT