4.0 KiB
4.0 KiB
Context
This is the initial implementation of the DocProcesser service. The system integrates multiple external models and services:
- DocLayout-YOLO for document layout analysis
- PaddleOCR-VL with PP-DocLayoutV2 for text and formula recognition (deployed via vLLM)
- markdown_2_docx for document conversion
Target deployment: Ubuntu machine with RTX 5080 GPU (16GB VRAM), Python 3.11.0.
Goals / Non-Goals
Goals:
- Clean FastAPI project structure following best practices
- Image preprocessing with OpenCV (30% padding)
- Layout-aware OCR routing using DocLayout-YOLO
- Text and formula recognition via PaddleOCR-VL
- Markdown to DOCX conversion
- GPU-enabled Docker deployment
Non-Goals:
- Authentication/authorization (can be added later)
- Rate limiting
- Persistent storage
- Training or fine-tuning models
Decisions
Project Structure
Follow FastAPI best practices with modular organization:
app/
├── api/
│ └── v1/
│ ├── endpoints/
│ │ ├── image.py # Image OCR endpoint
│ │ └── convert.py # Markdown to DOCX endpoint
│ └── router.py
├── core/
│ └── config.py # Settings and environment config
|—— model/
| |—— DocLayout
| |—— PP-DocLayout
├── services/
│ ├── image_processor.py # OpenCV preprocessing
│ ├── layout_detector.py # DocLayout-YOLO wrapper
│ ├── ocr_service.py # PaddleOCR-VL client
│ └── docx_converter.py # markdown_2_docx wrapper
├── schemas/
│ ├── image.py # Request/response models for image OCR
│ └── convert.py # Request/response models for conversion
└── main.py # FastAPI app initialization
Rationale: Separation of concerns between API layer, business logic (services), and data models (schemas).
Image Preprocessing
- Use OpenCV
cv2.copyMakeBorder()to add 30% whitespace padding - Padding color: white
[255, 255, 255] - This matches DocLayout-YOLO's demo.py pattern
Layout Detection Flow
- DocLayout-YOLO detects layout regions (plain text, formulas, tables, figures)
- Exsit plain text, routes to PaddleOCR-VL with PP-DocLayoutV2, othewise routes to PaddleOCR-VL with prompt
- PaddleOCR-VL combined PP-DocLayoutV2 handles mixed content recognition internally, PaddleOCR-VL combined prompt handles formula
External Service Integration
- PaddleOCR-VL: Connect to vLLM server at configurable URL (default:
http://localhost:8080/v1) - DocLayout-YOLO: Load model from pre-downloaded path (not downloaded in container)
Docker Strategy
- Base image: NVIDIA CUDA with Python 3.11
- Pre-install OpenCV dependencies (
libgl1-mesa-glx,libglib2.0-0) - Mount model directory for DocLayout-YOLO weights
- Expose port 8053
- Use Uvicorn with multiple workers
Risks / Trade-offs
| Risk | Mitigation |
|---|---|
| PaddleOCR-VL service unavailable | Health check endpoint, retry logic with exponential backoff |
| Large image memory consumption | Configure max image size, resize before processing |
| DocLayout-YOLO model loading time | Load model once at startup, keep in memory |
| GPU memory contention | DocLayout-YOLO uses GPU; PaddleOCR-VL runs on separate vLLM server |
Configuration
Environment variables:
PADDLEOCR_VL_URL: vLLM server URL (default:http://localhost:8000/v1)DOCLAYOUT_MODEL_PATH: Path to DocLayout-YOLO weightsPP_DOCLAYOUT_MODEL_DIR: Path to PP-DocLayoutV3 model directoryMAX_IMAGE_SIZE_MB: Maximum upload size (default: 10)
Open Questions
- Should we add async queue for large batch processing? (Defer to future change)
- Do we need WebSocket for progress updates? (Defer to future change)