init repo
This commit is contained in:
199
README.md
Normal file
199
README.md
Normal file
@@ -0,0 +1,199 @@
|
||||
# DocProcesser
|
||||
|
||||
Document processing API built with FastAPI. Converts images to LaTeX/Markdown/MathML and Markdown to DOCX.
|
||||
|
||||
## Features
|
||||
|
||||
- **Image OCR API** (`POST /doc_process/v1/image/ocr`)
|
||||
- Accept images via URL or base64
|
||||
- Automatic layout detection using DocLayout-YOLO
|
||||
- Text and formula recognition via PaddleOCR-VL
|
||||
- Output in LaTeX, Markdown, and MathML formats
|
||||
|
||||
- **Markdown to DOCX API** (`POST /doc_process/v1/convert/docx`)
|
||||
- Convert markdown content to Word documents
|
||||
- Preserve formatting, tables, and code blocks
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Python 3.11+
|
||||
- NVIDIA GPU with CUDA support (RTX 5080 recommended)
|
||||
- PaddleOCR-VL service running via vLLM (default: `http://localhost:8080/v1`)
|
||||
- Pre-downloaded models:
|
||||
- DocLayout-YOLO
|
||||
- PP-DocLayoutV2
|
||||
|
||||
## Quick Start
|
||||
|
||||
### 1. Install Dependencies
|
||||
|
||||
Using [uv](https://github.com/astral-sh/uv):
|
||||
|
||||
```bash
|
||||
# Install uv if not already installed
|
||||
curl -LsSf https://astral.sh/uv/install.sh | sh
|
||||
|
||||
# Create virtual environment and install dependencies
|
||||
uv venv
|
||||
source .venv/bin/activate # On Windows: .venv\Scripts\activate
|
||||
uv pip install -e .
|
||||
```
|
||||
|
||||
### 2. Download Models
|
||||
|
||||
Download the required models and place them in the `models/` directory:
|
||||
|
||||
```bash
|
||||
mkdir -p models/DocLayout models/PP-DocLayout
|
||||
|
||||
# DocLayout-YOLO (from HuggingFace)
|
||||
# https://huggingface.co/juliozhao/DocLayout-YOLO-DocStructBench
|
||||
# Place the .pt file in models/DocLayout/
|
||||
|
||||
# PP-DocLayoutV2 (from PaddlePaddle)
|
||||
# Place the model files in models/PP-DocLayout/
|
||||
```
|
||||
|
||||
### 3. Configure Environment
|
||||
|
||||
Create a `.env` file:
|
||||
|
||||
```bash
|
||||
# PaddleOCR-VL vLLM server URL
|
||||
PADDLEOCR_VL_URL=http://localhost:8080/v1
|
||||
|
||||
# Model paths
|
||||
DOCLAYOUT_MODEL_PATH=models/DocLayout/doclayout_yolo_docstructbench_imgsz1024.pt
|
||||
PP_DOCLAYOUT_MODEL_DIR=models/PP-DocLayout
|
||||
|
||||
# Server settings
|
||||
HOST=0.0.0.0
|
||||
PORT=8053
|
||||
```
|
||||
|
||||
### 4. Run the Server
|
||||
|
||||
```bash
|
||||
uvicorn app.main:app --host 0.0.0.0 --port 8053
|
||||
```
|
||||
|
||||
## Docker Deployment
|
||||
|
||||
### Build and Run with GPU
|
||||
|
||||
```bash
|
||||
# Build the image
|
||||
docker build -t doc-processer .
|
||||
|
||||
# Run with GPU support
|
||||
docker run --gpus all -p 8053:8053 \
|
||||
-v ./models/DocLayout:/app/models/DocLayout:ro \
|
||||
-v ./models/PP-DocLayout:/app/models/PP-DocLayout:ro \
|
||||
-e PADDLEOCR_VL_URL=http://host.docker.internal:8080/v1 \
|
||||
doc-processer
|
||||
```
|
||||
|
||||
### Using Docker Compose
|
||||
|
||||
```bash
|
||||
# Start the service with GPU
|
||||
docker-compose up -d doc-processer
|
||||
|
||||
# Or without GPU (CPU mode)
|
||||
docker-compose --profile cpu up -d doc-processer-cpu
|
||||
```
|
||||
|
||||
## API Usage
|
||||
|
||||
### Image OCR
|
||||
|
||||
```bash
|
||||
# Using image URL
|
||||
curl -X POST http://localhost:8053/doc_process/v1/image/ocr \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"image_url": "https://example.com/document.png"}'
|
||||
|
||||
# Using base64 image
|
||||
curl -X POST http://localhost:8053/doc_process/v1/image/ocr \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"image_base64": "iVBORw0KGgo..."}'
|
||||
```
|
||||
|
||||
Response:
|
||||
```json
|
||||
{
|
||||
"latex": "\\section{Title}...",
|
||||
"markdown": "# Title\n...",
|
||||
"mathml": "<math>...</math>",
|
||||
"layout_info": {
|
||||
"regions": [
|
||||
{"type": "text", "bbox": [10, 20, 100, 50], "confidence": 0.95}
|
||||
],
|
||||
"has_plain_text": true,
|
||||
"has_formula": false
|
||||
},
|
||||
"recognition_mode": "mixed_recognition"
|
||||
}
|
||||
```
|
||||
|
||||
### Markdown to DOCX
|
||||
|
||||
```bash
|
||||
curl -X POST http://localhost:8053/doc_process/v1/convert/docx \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"markdown": "# Hello World\n\nThis is a test.", "filename": "output"}' \
|
||||
--output output.docx
|
||||
```
|
||||
|
||||
## Project Structure
|
||||
|
||||
```
|
||||
doc_processer/
|
||||
├── app/
|
||||
│ ├── api/v1/
|
||||
│ │ ├── endpoints/
|
||||
│ │ │ ├── image.py # Image OCR endpoint
|
||||
│ │ │ └── convert.py # Markdown to DOCX endpoint
|
||||
│ │ └── router.py
|
||||
│ ├── core/
|
||||
│ │ ├── config.py # Settings
|
||||
│ │ └── dependencies.py # DI providers
|
||||
│ ├── services/
|
||||
│ │ ├── image_processor.py # OpenCV preprocessing
|
||||
│ │ ├── layout_detector.py # DocLayout-YOLO
|
||||
│ │ ├── ocr_service.py # PaddleOCR-VL client
|
||||
│ │ └── docx_converter.py # Markdown to DOCX
|
||||
│ ├── schemas/
|
||||
│ │ ├── image.py
|
||||
│ │ └── convert.py
|
||||
│ └── main.py
|
||||
├── models/ # Pre-downloaded models (git-ignored)
|
||||
├── Dockerfile
|
||||
├── docker-compose.yml
|
||||
├── pyproject.toml
|
||||
└── README.md
|
||||
```
|
||||
|
||||
## Processing Pipeline
|
||||
|
||||
### Image OCR Flow
|
||||
|
||||
1. **Input**: Accept `image_url` or `image_base64`
|
||||
2. **Preprocessing**: Add 30% whitespace padding using OpenCV
|
||||
3. **Layout Detection**: DocLayout-YOLO detects regions (text, formula, table, figure)
|
||||
4. **Recognition**:
|
||||
- If plain text detected → PP-DocLayoutV2 for mixed content recognition
|
||||
- Otherwise → PaddleOCR-VL with formula prompt
|
||||
5. **Output Conversion**: Generate LaTeX, Markdown, and MathML
|
||||
|
||||
## Hardware Requirements
|
||||
|
||||
- **Minimum**: 8GB GPU VRAM
|
||||
- **Recommended**: RTX 5080 16GB or equivalent
|
||||
- **CPU**: 4+ cores
|
||||
- **RAM**: 16GB+
|
||||
|
||||
## License
|
||||
|
||||
MIT
|
||||
|
||||
Reference in New Issue
Block a user