init repo

2025-12-29 17:34:58 +08:00
commit 874fd383cc
36 changed files with 2641 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,199 @@
+# DocProcesser
+
+Document processing API built with FastAPI. Converts images to LaTeX/Markdown/MathML and Markdown to DOCX.
+
+## Features
+
+- **Image OCR API** (`POST /doc_process/v1/image/ocr`)
+  - Accept images via URL or base64
+  - Automatic layout detection using DocLayout-YOLO
+  - Text and formula recognition via PaddleOCR-VL
+  - Output in LaTeX, Markdown, and MathML formats
+
+- **Markdown to DOCX API** (`POST /doc_process/v1/convert/docx`)
+  - Convert markdown content to Word documents
+  - Preserve formatting, tables, and code blocks
+
+## Prerequisites
+
+- Python 3.11+
+- NVIDIA GPU with CUDA support (RTX 5080 recommended)
+- PaddleOCR-VL service running via vLLM (default: `http://localhost:8080/v1`)
+- Pre-downloaded models:
+  - DocLayout-YOLO
+  - PP-DocLayoutV2
+
+## Quick Start
+
+### 1. Install Dependencies
+
+Using [uv](https://github.com/astral-sh/uv):
+
+```bash
+# Install uv if not already installed
+curl -LsSf https://astral.sh/uv/install.sh | sh
+
+# Create virtual environment and install dependencies
+uv venv
+source .venv/bin/activate  # On Windows: .venv\Scripts\activate
+uv pip install -e .
+```
+
+### 2. Download Models
+
+Download the required models and place them in the `models/` directory:
+
+```bash
+mkdir -p models/DocLayout models/PP-DocLayout
+
+# DocLayout-YOLO (from HuggingFace)
+# https://huggingface.co/juliozhao/DocLayout-YOLO-DocStructBench
+# Place the .pt file in models/DocLayout/
+
+# PP-DocLayoutV2 (from PaddlePaddle)
+# Place the model files in models/PP-DocLayout/
+```
+
+### 3. Configure Environment
+
+Create a `.env` file:
+
+```bash
+# PaddleOCR-VL vLLM server URL
+PADDLEOCR_VL_URL=http://localhost:8080/v1
+
+# Model paths
+DOCLAYOUT_MODEL_PATH=models/DocLayout/doclayout_yolo_docstructbench_imgsz1024.pt
+PP_DOCLAYOUT_MODEL_DIR=models/PP-DocLayout
+
+# Server settings
+HOST=0.0.0.0
+PORT=8053
+```
+
+### 4. Run the Server
+
+```bash
+uvicorn app.main:app --host 0.0.0.0 --port 8053
+```
+
+## Docker Deployment
+
+### Build and Run with GPU
+
+```bash
+# Build the image
+docker build -t doc-processer .
+
+# Run with GPU support
+docker run --gpus all -p 8053:8053 \
+  -v ./models/DocLayout:/app/models/DocLayout:ro \
+  -v ./models/PP-DocLayout:/app/models/PP-DocLayout:ro \
+  -e PADDLEOCR_VL_URL=http://host.docker.internal:8080/v1 \
+  doc-processer
+```
+
+### Using Docker Compose
+
+```bash
+# Start the service with GPU
+docker-compose up -d doc-processer
+
+# Or without GPU (CPU mode)
+docker-compose --profile cpu up -d doc-processer-cpu
+```
+
+## API Usage
+
+### Image OCR
+
+```bash
+# Using image URL
+curl -X POST http://localhost:8053/doc_process/v1/image/ocr \
+  -H "Content-Type: application/json" \
+  -d '{"image_url": "https://example.com/document.png"}'
+
+# Using base64 image
+curl -X POST http://localhost:8053/doc_process/v1/image/ocr \
+  -H "Content-Type: application/json" \
+  -d '{"image_base64": "iVBORw0KGgo..."}'
+```
+
+Response:
+```json
+{
+  "latex": "\\section{Title}...",
+  "markdown": "# Title\n...",
+  "mathml": "<math>...</math>",
+  "layout_info": {
+    "regions": [
+      {"type": "text", "bbox": [10, 20, 100, 50], "confidence": 0.95}
+    ],
+    "has_plain_text": true,
+    "has_formula": false
+  },
+  "recognition_mode": "mixed_recognition"
+}
+```
+
+### Markdown to DOCX
+
+```bash
+curl -X POST http://localhost:8053/doc_process/v1/convert/docx \
+  -H "Content-Type: application/json" \
+  -d '{"markdown": "# Hello World\n\nThis is a test.", "filename": "output"}' \
+  --output output.docx
+```
+
+## Project Structure
+
+```
+doc_processer/
+├── app/
+│   ├── api/v1/
+│   │   ├── endpoints/
+│   │   │   ├── image.py      # Image OCR endpoint
+│   │   │   └── convert.py    # Markdown to DOCX endpoint
+│   │   └── router.py
+│   ├── core/
+│   │   ├── config.py         # Settings
+│   │   └── dependencies.py   # DI providers
+│   ├── services/
+│   │   ├── image_processor.py    # OpenCV preprocessing
+│   │   ├── layout_detector.py    # DocLayout-YOLO
+│   │   ├── ocr_service.py        # PaddleOCR-VL client
+│   │   └── docx_converter.py     # Markdown to DOCX
+│   ├── schemas/
+│   │   ├── image.py
+│   │   └── convert.py
+│   └── main.py
+├── models/                   # Pre-downloaded models (git-ignored)
+├── Dockerfile
+├── docker-compose.yml
+├── pyproject.toml
+└── README.md
+```
+
+## Processing Pipeline
+
+### Image OCR Flow
+
+1. **Input**: Accept `image_url` or `image_base64`
+2. **Preprocessing**: Add 30% whitespace padding using OpenCV
+3. **Layout Detection**: DocLayout-YOLO detects regions (text, formula, table, figure)
+4. **Recognition**:
+   - If plain text detected → PP-DocLayoutV2 for mixed content recognition
+   - Otherwise → PaddleOCR-VL with formula prompt
+5. **Output Conversion**: Generate LaTeX, Markdown, and MathML
+
+## Hardware Requirements
+
+- **Minimum**: 8GB GPU VRAM
+- **Recommended**: RTX 5080 16GB or equivalent
+- **CPU**: 4+ cores
+- **RAM**: 16GB+
+
+## License
+
+MIT
+