Files
liuyuanchuang 874fd383cc init repo
2025-12-29 17:34:58 +08:00

4.0 KiB

Context

This is the initial implementation of the DocProcesser service. The system integrates multiple external models and services:

  • DocLayout-YOLO for document layout analysis
  • PaddleOCR-VL with PP-DocLayoutV2 for text and formula recognition (deployed via vLLM)
  • markdown_2_docx for document conversion

Target deployment: Ubuntu machine with RTX 5080 GPU (16GB VRAM), Python 3.11.0.

Goals / Non-Goals

Goals:

  • Clean FastAPI project structure following best practices
  • Image preprocessing with OpenCV (30% padding)
  • Layout-aware OCR routing using DocLayout-YOLO
  • Text and formula recognition via PaddleOCR-VL
  • Markdown to DOCX conversion
  • GPU-enabled Docker deployment

Non-Goals:

  • Authentication/authorization (can be added later)
  • Rate limiting
  • Persistent storage
  • Training or fine-tuning models

Decisions

Project Structure

Follow FastAPI best practices with modular organization:

app/
├── api/
│   └── v1/
│       ├── endpoints/
│       │   ├── image.py      # Image OCR endpoint
│       │   └── convert.py    # Markdown to DOCX endpoint
│       └── router.py
├── core/
│   └── config.py             # Settings and environment config
|—— model/
|   |—— DocLayout
|   |—— PP-DocLayout
├── services/
│   ├── image_processor.py    # OpenCV preprocessing
│   ├── layout_detector.py    # DocLayout-YOLO wrapper
│   ├── ocr_service.py        # PaddleOCR-VL client
│   └── docx_converter.py     # markdown_2_docx wrapper
├── schemas/
│   ├── image.py              # Request/response models for image OCR
│   └── convert.py            # Request/response models for conversion
└── main.py                   # FastAPI app initialization

Rationale: Separation of concerns between API layer, business logic (services), and data models (schemas).

Image Preprocessing

  • Use OpenCV cv2.copyMakeBorder() to add 30% whitespace padding
  • Padding color: white [255, 255, 255]
  • This matches DocLayout-YOLO's demo.py pattern

Layout Detection Flow

  1. DocLayout-YOLO detects layout regions (plain text, formulas, tables, figures)
  2. Exsit plain text, routes to PaddleOCR-VL with PP-DocLayoutV2, othewise routes to PaddleOCR-VL with prompt
  3. PaddleOCR-VL combined PP-DocLayoutV2 handles mixed content recognition internally, PaddleOCR-VL combined prompt handles formula

External Service Integration

  • PaddleOCR-VL: Connect to vLLM server at configurable URL (default: http://localhost:8080/v1)
  • DocLayout-YOLO: Load model from pre-downloaded path (not downloaded in container)

Docker Strategy

  • Base image: NVIDIA CUDA with Python 3.11
  • Pre-install OpenCV dependencies (libgl1-mesa-glx, libglib2.0-0)
  • Mount model directory for DocLayout-YOLO weights
  • Expose port 8053
  • Use Uvicorn with multiple workers

Risks / Trade-offs

Risk Mitigation
PaddleOCR-VL service unavailable Health check endpoint, retry logic with exponential backoff
Large image memory consumption Configure max image size, resize before processing
DocLayout-YOLO model loading time Load model once at startup, keep in memory
GPU memory contention DocLayout-YOLO uses GPU; PaddleOCR-VL runs on separate vLLM server

Configuration

Environment variables:

  • PADDLEOCR_VL_URL: vLLM server URL (default: http://localhost:8000/v1)
  • DOCLAYOUT_MODEL_PATH: Path to DocLayout-YOLO weights
  • PP_DOCLAYOUT_MODEL_DIR: Path to PP-DocLayoutV3 model directory
  • MAX_IMAGE_SIZE_MB: Maximum upload size (default: 10)

Open Questions

  • Should we add async queue for large batch processing? (Defer to future change)
  • Do we need WebSocket for progress updates? (Defer to future change)