## Context This is the initial implementation of the DocProcesser service. The system integrates multiple external models and services: - DocLayout-YOLO for document layout analysis - PaddleOCR-VL with PP-DocLayoutV2 for text and formula recognition (deployed via vLLM) - markdown_2_docx for document conversion Target deployment: Ubuntu machine with RTX 5080 GPU (16GB VRAM), Python 3.11.0. ## Goals / Non-Goals **Goals:** - Clean FastAPI project structure following best practices - Image preprocessing with OpenCV (30% padding) - Layout-aware OCR routing using DocLayout-YOLO - Text and formula recognition via PaddleOCR-VL - Markdown to DOCX conversion - GPU-enabled Docker deployment **Non-Goals:** - Authentication/authorization (can be added later) - Rate limiting - Persistent storage - Training or fine-tuning models ## Decisions ### Project Structure Follow FastAPI best practices with modular organization: ``` app/ ├── api/ │ └── v1/ │ ├── endpoints/ │ │ ├── image.py # Image OCR endpoint │ │ └── convert.py # Markdown to DOCX endpoint │ └── router.py ├── core/ │ └── config.py # Settings and environment config |—— model/ | |—— DocLayout | |—— PP-DocLayout ├── services/ │ ├── image_processor.py # OpenCV preprocessing │ ├── layout_detector.py # DocLayout-YOLO wrapper │ ├── ocr_service.py # PaddleOCR-VL client │ └── docx_converter.py # markdown_2_docx wrapper ├── schemas/ │ ├── image.py # Request/response models for image OCR │ └── convert.py # Request/response models for conversion └── main.py # FastAPI app initialization ``` **Rationale:** Separation of concerns between API layer, business logic (services), and data models (schemas). ### Image Preprocessing - Use OpenCV `cv2.copyMakeBorder()` to add 30% whitespace padding - Padding color: white `[255, 255, 255]` - This matches DocLayout-YOLO's demo.py pattern ### Layout Detection Flow 1. DocLayout-YOLO detects layout regions (plain text, formulas, tables, figures) 2. Exsit plain text, routes to PaddleOCR-VL with PP-DocLayoutV2, othewise routes to PaddleOCR-VL with prompt 3. PaddleOCR-VL combined PP-DocLayoutV2 handles mixed content recognition internally, PaddleOCR-VL combined prompt handles formula ### External Service Integration - PaddleOCR-VL: Connect to vLLM server at configurable URL (default: `http://localhost:8080/v1`) - DocLayout-YOLO: Load model from pre-downloaded path (not downloaded in container) ### Docker Strategy - Base image: NVIDIA CUDA with Python 3.11 - Pre-install OpenCV dependencies (`libgl1-mesa-glx`, `libglib2.0-0`) - Mount model directory for DocLayout-YOLO weights - Expose port 8053 - Use Uvicorn with multiple workers ## Risks / Trade-offs | Risk | Mitigation | | --------------------------------- | ------------------------------------------------------------------ | | PaddleOCR-VL service unavailable | Health check endpoint, retry logic with exponential backoff | | Large image memory consumption | Configure max image size, resize before processing | | DocLayout-YOLO model loading time | Load model once at startup, keep in memory | | GPU memory contention | DocLayout-YOLO uses GPU; PaddleOCR-VL runs on separate vLLM server | ## Configuration Environment variables: - `PADDLEOCR_VL_URL`: vLLM server URL (default: `http://localhost:8000/v1`) - `DOCLAYOUT_MODEL_PATH`: Path to DocLayout-YOLO weights - `PP_DOCLAYOUT_MODEL_DIR`: Path to PP-DocLayoutV3 model directory - `MAX_IMAGE_SIZE_MB`: Maximum upload size (default: 10) ## Open Questions - Should we add async queue for large batch processing? (Defer to future change) - Do we need WebSocket for progress updates? (Defer to future change)