init repo

2025-12-29 17:34:58 +08:00
commit 874fd383cc
36 changed files with 2641 additions and 0 deletions
--- a/openspec/changes/add-doc-processing-api/design.md
+++ b/openspec/changes/add-doc-processing-api/design.md
@@ -0,0 +1,107 @@
+## Context
+
+This is the initial implementation of the DocProcesser service. The system integrates multiple external models and services:
+
+- DocLayout-YOLO for document layout analysis
+- PaddleOCR-VL with PP-DocLayoutV2 for text and formula recognition (deployed via vLLM)
+- markdown_2_docx for document conversion
+
+Target deployment: Ubuntu machine with RTX 5080 GPU (16GB VRAM), Python 3.11.0.
+
+## Goals / Non-Goals
+
+**Goals:**
+
+- Clean FastAPI project structure following best practices
+- Image preprocessing with OpenCV (30% padding)
+- Layout-aware OCR routing using DocLayout-YOLO
+- Text and formula recognition via PaddleOCR-VL
+- Markdown to DOCX conversion
+- GPU-enabled Docker deployment
+
+**Non-Goals:**
+
+- Authentication/authorization (can be added later)
+- Rate limiting
+- Persistent storage
+- Training or fine-tuning models
+
+## Decisions
+
+### Project Structure
+
+Follow FastAPI best practices with modular organization:
+
+```
+app/
+├── api/
+│   └── v1/
+│       ├── endpoints/
+│       │   ├── image.py      # Image OCR endpoint
+│       │   └── convert.py    # Markdown to DOCX endpoint
+│       └── router.py
+├── core/
+│   └── config.py             # Settings and environment config
+|—— model/
+|   |—— DocLayout
+|   |—— PP-DocLayout
+├── services/
+│   ├── image_processor.py    # OpenCV preprocessing
+│   ├── layout_detector.py    # DocLayout-YOLO wrapper
+│   ├── ocr_service.py        # PaddleOCR-VL client
+│   └── docx_converter.py     # markdown_2_docx wrapper
+├── schemas/
+│   ├── image.py              # Request/response models for image OCR
+│   └── convert.py            # Request/response models for conversion
+└── main.py                   # FastAPI app initialization
+```
+
+**Rationale:** Separation of concerns between API layer, business logic (services), and data models (schemas).
+
+### Image Preprocessing
+
+- Use OpenCV `cv2.copyMakeBorder()` to add 30% whitespace padding
+- Padding color: white `[255, 255, 255]`
+- This matches DocLayout-YOLO's demo.py pattern
+
+### Layout Detection Flow
+
+1. DocLayout-YOLO detects layout regions (plain text, formulas, tables, figures)
+2. Exsit plain text, routes to PaddleOCR-VL with PP-DocLayoutV2, othewise routes to PaddleOCR-VL with prompt
+3. PaddleOCR-VL combined PP-DocLayoutV2 handles mixed content recognition internally, PaddleOCR-VL combined prompt handles formula
+
+### External Service Integration
+
+- PaddleOCR-VL: Connect to vLLM server at configurable URL (default: `http://localhost:8080/v1`)
+- DocLayout-YOLO: Load model from pre-downloaded path (not downloaded in container)
+
+### Docker Strategy
+
+- Base image: NVIDIA CUDA with Python 3.11
+- Pre-install OpenCV dependencies (`libgl1-mesa-glx`, `libglib2.0-0`)
+- Mount model directory for DocLayout-YOLO weights
+- Expose port 8053
+- Use Uvicorn with multiple workers
+
+## Risks / Trade-offs
+
+| Risk                              | Mitigation                                                         |
+| --------------------------------- | ------------------------------------------------------------------ |
+| PaddleOCR-VL service unavailable  | Health check endpoint, retry logic with exponential backoff        |
+| Large image memory consumption    | Configure max image size, resize before processing                 |
+| DocLayout-YOLO model loading time | Load model once at startup, keep in memory                         |
+| GPU memory contention             | DocLayout-YOLO uses GPU; PaddleOCR-VL runs on separate vLLM server |
+
+## Configuration
+
+Environment variables:
+
+- `PADDLEOCR_VL_URL`: vLLM server URL (default: `http://localhost:8000/v1`)
+- `DOCLAYOUT_MODEL_PATH`: Path to DocLayout-YOLO weights
+- `PP_DOCLAYOUT_MODEL_DIR`: Path to PP-DocLayoutV3 model directory
+- `MAX_IMAGE_SIZE_MB`: Maximum upload size (default: 10)
+
+## Open Questions
+
+- Should we add async queue for large batch processing? (Defer to future change)
+- Do we need WebSocket for progress updates? (Defer to future change)