108 lines
4.0 KiB
Markdown
108 lines
4.0 KiB
Markdown
|
|
## Context
|
||
|
|
|
||
|
|
This is the initial implementation of the DocProcesser service. The system integrates multiple external models and services:
|
||
|
|
|
||
|
|
- DocLayout-YOLO for document layout analysis
|
||
|
|
- PaddleOCR-VL with PP-DocLayoutV2 for text and formula recognition (deployed via vLLM)
|
||
|
|
- markdown_2_docx for document conversion
|
||
|
|
|
||
|
|
Target deployment: Ubuntu machine with RTX 5080 GPU (16GB VRAM), Python 3.11.0.
|
||
|
|
|
||
|
|
## Goals / Non-Goals
|
||
|
|
|
||
|
|
**Goals:**
|
||
|
|
|
||
|
|
- Clean FastAPI project structure following best practices
|
||
|
|
- Image preprocessing with OpenCV (30% padding)
|
||
|
|
- Layout-aware OCR routing using DocLayout-YOLO
|
||
|
|
- Text and formula recognition via PaddleOCR-VL
|
||
|
|
- Markdown to DOCX conversion
|
||
|
|
- GPU-enabled Docker deployment
|
||
|
|
|
||
|
|
**Non-Goals:**
|
||
|
|
|
||
|
|
- Authentication/authorization (can be added later)
|
||
|
|
- Rate limiting
|
||
|
|
- Persistent storage
|
||
|
|
- Training or fine-tuning models
|
||
|
|
|
||
|
|
## Decisions
|
||
|
|
|
||
|
|
### Project Structure
|
||
|
|
|
||
|
|
Follow FastAPI best practices with modular organization:
|
||
|
|
|
||
|
|
```
|
||
|
|
app/
|
||
|
|
├── api/
|
||
|
|
│ └── v1/
|
||
|
|
│ ├── endpoints/
|
||
|
|
│ │ ├── image.py # Image OCR endpoint
|
||
|
|
│ │ └── convert.py # Markdown to DOCX endpoint
|
||
|
|
│ └── router.py
|
||
|
|
├── core/
|
||
|
|
│ └── config.py # Settings and environment config
|
||
|
|
|—— model/
|
||
|
|
| |—— DocLayout
|
||
|
|
| |—— PP-DocLayout
|
||
|
|
├── services/
|
||
|
|
│ ├── image_processor.py # OpenCV preprocessing
|
||
|
|
│ ├── layout_detector.py # DocLayout-YOLO wrapper
|
||
|
|
│ ├── ocr_service.py # PaddleOCR-VL client
|
||
|
|
│ └── docx_converter.py # markdown_2_docx wrapper
|
||
|
|
├── schemas/
|
||
|
|
│ ├── image.py # Request/response models for image OCR
|
||
|
|
│ └── convert.py # Request/response models for conversion
|
||
|
|
└── main.py # FastAPI app initialization
|
||
|
|
```
|
||
|
|
|
||
|
|
**Rationale:** Separation of concerns between API layer, business logic (services), and data models (schemas).
|
||
|
|
|
||
|
|
### Image Preprocessing
|
||
|
|
|
||
|
|
- Use OpenCV `cv2.copyMakeBorder()` to add 30% whitespace padding
|
||
|
|
- Padding color: white `[255, 255, 255]`
|
||
|
|
- This matches DocLayout-YOLO's demo.py pattern
|
||
|
|
|
||
|
|
### Layout Detection Flow
|
||
|
|
|
||
|
|
1. DocLayout-YOLO detects layout regions (plain text, formulas, tables, figures)
|
||
|
|
2. Exsit plain text, routes to PaddleOCR-VL with PP-DocLayoutV2, othewise routes to PaddleOCR-VL with prompt
|
||
|
|
3. PaddleOCR-VL combined PP-DocLayoutV2 handles mixed content recognition internally, PaddleOCR-VL combined prompt handles formula
|
||
|
|
|
||
|
|
### External Service Integration
|
||
|
|
|
||
|
|
- PaddleOCR-VL: Connect to vLLM server at configurable URL (default: `http://localhost:8080/v1`)
|
||
|
|
- DocLayout-YOLO: Load model from pre-downloaded path (not downloaded in container)
|
||
|
|
|
||
|
|
### Docker Strategy
|
||
|
|
|
||
|
|
- Base image: NVIDIA CUDA with Python 3.11
|
||
|
|
- Pre-install OpenCV dependencies (`libgl1-mesa-glx`, `libglib2.0-0`)
|
||
|
|
- Mount model directory for DocLayout-YOLO weights
|
||
|
|
- Expose port 8053
|
||
|
|
- Use Uvicorn with multiple workers
|
||
|
|
|
||
|
|
## Risks / Trade-offs
|
||
|
|
|
||
|
|
| Risk | Mitigation |
|
||
|
|
| --------------------------------- | ------------------------------------------------------------------ |
|
||
|
|
| PaddleOCR-VL service unavailable | Health check endpoint, retry logic with exponential backoff |
|
||
|
|
| Large image memory consumption | Configure max image size, resize before processing |
|
||
|
|
| DocLayout-YOLO model loading time | Load model once at startup, keep in memory |
|
||
|
|
| GPU memory contention | DocLayout-YOLO uses GPU; PaddleOCR-VL runs on separate vLLM server |
|
||
|
|
|
||
|
|
## Configuration
|
||
|
|
|
||
|
|
Environment variables:
|
||
|
|
|
||
|
|
- `PADDLEOCR_VL_URL`: vLLM server URL (default: `http://localhost:8000/v1`)
|
||
|
|
- `DOCLAYOUT_MODEL_PATH`: Path to DocLayout-YOLO weights
|
||
|
|
- `PP_DOCLAYOUT_MODEL_DIR`: Path to PP-DocLayoutV3 model directory
|
||
|
|
- `MAX_IMAGE_SIZE_MB`: Maximum upload size (default: 10)
|
||
|
|
|
||
|
|
## Open Questions
|
||
|
|
|
||
|
|
- Should we add async queue for large batch processing? (Defer to future change)
|
||
|
|
- Do we need WebSocket for progress updates? (Defer to future change)
|