doc_processer/openspec/changes/add-doc-processing-api/design.md at 874fd383ccdafbcdeb205c774b30a19fca242eda

Files

liuyuanchuang 874fd383cc init repo

2025-12-29 17:34:58 +08:00

4.0 KiB

Raw Blame History

Context

This is the initial implementation of the DocProcesser service. The system integrates multiple external models and services:

DocLayout-YOLO for document layout analysis
PaddleOCR-VL with PP-DocLayoutV2 for text and formula recognition (deployed via vLLM)
markdown_2_docx for document conversion

Target deployment: Ubuntu machine with RTX 5080 GPU (16GB VRAM), Python 3.11.0.

Goals / Non-Goals

Goals:

Clean FastAPI project structure following best practices
Image preprocessing with OpenCV (30% padding)
Layout-aware OCR routing using DocLayout-YOLO
Text and formula recognition via PaddleOCR-VL
Markdown to DOCX conversion
GPU-enabled Docker deployment

Non-Goals:

Authentication/authorization (can be added later)
Rate limiting
Persistent storage
Training or fine-tuning models

Decisions

Project Structure

Follow FastAPI best practices with modular organization:

app/
├── api/
│   └── v1/
│       ├── endpoints/
│       │   ├── image.py      # Image OCR endpoint
│       │   └── convert.py    # Markdown to DOCX endpoint
│       └── router.py
├── core/
│   └── config.py             # Settings and environment config
|—— model/
|   |—— DocLayout
|   |—— PP-DocLayout
├── services/
│   ├── image_processor.py    # OpenCV preprocessing
│   ├── layout_detector.py    # DocLayout-YOLO wrapper
│   ├── ocr_service.py        # PaddleOCR-VL client
│   └── docx_converter.py     # markdown_2_docx wrapper
├── schemas/
│   ├── image.py              # Request/response models for image OCR
│   └── convert.py            # Request/response models for conversion
└── main.py                   # FastAPI app initialization

Rationale: Separation of concerns between API layer, business logic (services), and data models (schemas).

Image Preprocessing

Use OpenCV cv2.copyMakeBorder() to add 30% whitespace padding
Padding color: white [255, 255, 255]
This matches DocLayout-YOLO's demo.py pattern

Layout Detection Flow

DocLayout-YOLO detects layout regions (plain text, formulas, tables, figures)
Exsit plain text, routes to PaddleOCR-VL with PP-DocLayoutV2, othewise routes to PaddleOCR-VL with prompt
PaddleOCR-VL combined PP-DocLayoutV2 handles mixed content recognition internally, PaddleOCR-VL combined prompt handles formula

External Service Integration

PaddleOCR-VL: Connect to vLLM server at configurable URL (default: http://localhost:8080/v1)
DocLayout-YOLO: Load model from pre-downloaded path (not downloaded in container)

Docker Strategy

Base image: NVIDIA CUDA with Python 3.11
Pre-install OpenCV dependencies (libgl1-mesa-glx, libglib2.0-0)
Mount model directory for DocLayout-YOLO weights
Expose port 8053
Use Uvicorn with multiple workers

Risks / Trade-offs

Risk	Mitigation
PaddleOCR-VL service unavailable	Health check endpoint, retry logic with exponential backoff
Large image memory consumption	Configure max image size, resize before processing
DocLayout-YOLO model loading time	Load model once at startup, keep in memory
GPU memory contention	DocLayout-YOLO uses GPU; PaddleOCR-VL runs on separate vLLM server

Configuration

Environment variables:

PADDLEOCR_VL_URL: vLLM server URL (default: http://localhost:8000/v1)
DOCLAYOUT_MODEL_PATH: Path to DocLayout-YOLO weights
PP_DOCLAYOUT_MODEL_DIR: Path to PP-DocLayoutV3 model directory
MAX_IMAGE_SIZE_MB: Maximum upload size (default: 10)

Open Questions

Should we add async queue for large batch processing? (Defer to future change)
Do we need WebSocket for progress updates? (Defer to future change)

4.0 KiB Raw Blame History