openspec/changes/add-doc-processing-api/design.md

## Context

This is the initial implementation of the DocProcesser service. The system integrates multiple external models and services:

- DocLayout-YOLO for document layout analysis
- PaddleOCR-VL with PP-DocLayoutV2 for text and formula recognition (deployed via vLLM)
- markdown_2_docx for document conversion

Target deployment: Ubuntu machine with RTX 5080 GPU (16GB VRAM), Python 3.11.0.

## Goals / Non-Goals

**Goals:**

- Clean FastAPI project structure following best practices
- Image preprocessing with OpenCV (30% padding)
- Layout-aware OCR routing using DocLayout-YOLO
- Text and formula recognition via PaddleOCR-VL
- Markdown to DOCX conversion
- GPU-enabled Docker deployment

**Non-Goals:**

- Authentication/authorization (can be added later)
- Rate limiting
- Persistent storage
- Training or fine-tuning models

## Decisions

### Project Structure

Follow FastAPI best practices with modular organization:

```
app/
├── api/
│   └── v1/
│       ├── endpoints/
│       │   ├── image.py      # Image OCR endpoint
│       │   └── convert.py    # Markdown to DOCX endpoint
│       └── router.py
├── core/
│   └── config.py             # Settings and environment config
|—— model/
|   |—— DocLayout
|   |—— PP-DocLayout
├── services/
│   ├── image_processor.py    # OpenCV preprocessing
│   ├── layout_detector.py    # DocLayout-YOLO wrapper
│   ├── ocr_service.py        # PaddleOCR-VL client
│   └── docx_converter.py     # markdown_2_docx wrapper
├── schemas/
│   ├── image.py              # Request/response models for image OCR
│   └── convert.py            # Request/response models for conversion
└── main.py                   # FastAPI app initialization
```

**Rationale:** Separation of concerns between API layer, business logic (services), and data models (schemas).

### Image Preprocessing

- Use OpenCV `cv2.copyMakeBorder()` to add 30% whitespace padding
- Padding color: white `[255, 255, 255]`
- This matches DocLayout-YOLO's demo.py pattern

### Layout Detection Flow

1. DocLayout-YOLO detects layout regions (plain text, formulas, tables, figures)
2. Exsit plain text, routes to PaddleOCR-VL with PP-DocLayoutV2, othewise routes to PaddleOCR-VL with prompt
3. PaddleOCR-VL combined PP-DocLayoutV2 handles mixed content recognition internally, PaddleOCR-VL combined prompt handles formula

### External Service Integration

- PaddleOCR-VL: Connect to vLLM server at configurable URL (default: `http://localhost:8080/v1`)
- DocLayout-YOLO: Load model from pre-downloaded path (not downloaded in container)

### Docker Strategy

- Base image: NVIDIA CUDA with Python 3.11
- Pre-install OpenCV dependencies (`libgl1-mesa-glx`, `libglib2.0-0`)
- Mount model directory for DocLayout-YOLO weights
- Expose port 8053
- Use Uvicorn with multiple workers

## Risks / Trade-offs

| Risk                              | Mitigation                                                         |
| --------------------------------- | ------------------------------------------------------------------ |
| PaddleOCR-VL service unavailable  | Health check endpoint, retry logic with exponential backoff        |
| Large image memory consumption    | Configure max image size, resize before processing                 |
| DocLayout-YOLO model loading time | Load model once at startup, keep in memory                         |
| GPU memory contention             | DocLayout-YOLO uses GPU; PaddleOCR-VL runs on separate vLLM server |

## Configuration

Environment variables:

- `PADDLEOCR_VL_URL`: vLLM server URL (default: `http://localhost:8000/v1`)
- `DOCLAYOUT_MODEL_PATH`: Path to DocLayout-YOLO weights
- `PP_DOCLAYOUT_MODEL_DIR`: Path to PP-DocLayoutV3 model directory
- `MAX_IMAGE_SIZE_MB`: Maximum upload size (default: 10)

## Open Questions

- Should we add async queue for large batch processing? (Defer to future change)
- Do we need WebSocket for progress updates? (Defer to future change)
init repo 2025-12-29 17:34:58 +08:00			`## Context`

			`This is the initial implementation of the DocProcesser service. The system integrates multiple external models and services:`

			`- DocLayout-YOLO for document layout analysis`
			`- PaddleOCR-VL with PP-DocLayoutV2 for text and formula recognition (deployed via vLLM)`
			`- markdown_2_docx for document conversion`

			`Target deployment: Ubuntu machine with RTX 5080 GPU (16GB VRAM), Python 3.11.0.`

			`## Goals / Non-Goals`

			`Goals:`

			`- Clean FastAPI project structure following best practices`
			`- Image preprocessing with OpenCV (30% padding)`
			`- Layout-aware OCR routing using DocLayout-YOLO`
			`- Text and formula recognition via PaddleOCR-VL`
			`- Markdown to DOCX conversion`
			`- GPU-enabled Docker deployment`

			`Non-Goals:`

			`- Authentication/authorization (can be added later)`
			`- Rate limiting`
			`- Persistent storage`
			`- Training or fine-tuning models`

			`## Decisions`

			`### Project Structure`

			`Follow FastAPI best practices with modular organization:`

			```
			`app/`
			`├── api/`
			`│ └── v1/`
			`│ ├── endpoints/`
			`│ │ ├── image.py # Image OCR endpoint`
			`│ │ └── convert.py # Markdown to DOCX endpoint`
			`│ └── router.py`
			`├── core/`
			`│ └── config.py # Settings and environment config`
			`\|—— model/`
			`\| \|—— DocLayout`
			`\| \|—— PP-DocLayout`
			`├── services/`
			`│ ├── image_processor.py # OpenCV preprocessing`
			`│ ├── layout_detector.py # DocLayout-YOLO wrapper`
			`│ ├── ocr_service.py # PaddleOCR-VL client`
			`│ └── docx_converter.py # markdown_2_docx wrapper`
			`├── schemas/`
			`│ ├── image.py # Request/response models for image OCR`
			`│ └── convert.py # Request/response models for conversion`
			`└── main.py # FastAPI app initialization`
			```

			`Rationale: Separation of concerns between API layer, business logic (services), and data models (schemas).`

			`### Image Preprocessing`

			- Use OpenCV `cv2.copyMakeBorder()` to add 30% whitespace padding
			- Padding color: white `[255, 255, 255]`
			`- This matches DocLayout-YOLO's demo.py pattern`

			`### Layout Detection Flow`

			`1. DocLayout-YOLO detects layout regions (plain text, formulas, tables, figures)`
			`2. Exsit plain text, routes to PaddleOCR-VL with PP-DocLayoutV2, othewise routes to PaddleOCR-VL with prompt`
			`3. PaddleOCR-VL combined PP-DocLayoutV2 handles mixed content recognition internally, PaddleOCR-VL combined prompt handles formula`

			`### External Service Integration`

			- PaddleOCR-VL: Connect to vLLM server at configurable URL (default: `http://localhost:8080/v1`)
			`- DocLayout-YOLO: Load model from pre-downloaded path (not downloaded in container)`

			`### Docker Strategy`

			`- Base image: NVIDIA CUDA with Python 3.11`
			- Pre-install OpenCV dependencies (`libgl1-mesa-glx`, `libglib2.0-0`)
			`- Mount model directory for DocLayout-YOLO weights`
			`- Expose port 8053`
			`- Use Uvicorn with multiple workers`

			`## Risks / Trade-offs`

			`\| Risk \| Mitigation \|`
			`\| --------------------------------- \| ------------------------------------------------------------------ \|`
			`\| PaddleOCR-VL service unavailable \| Health check endpoint, retry logic with exponential backoff \|`
			`\| Large image memory consumption \| Configure max image size, resize before processing \|`
			`\| DocLayout-YOLO model loading time \| Load model once at startup, keep in memory \|`
			`\| GPU memory contention \| DocLayout-YOLO uses GPU; PaddleOCR-VL runs on separate vLLM server \|`

			`## Configuration`

			`Environment variables:`

			- `PADDLEOCR_VL_URL`: vLLM server URL (default: `http://localhost:8000/v1`)
			- `DOCLAYOUT_MODEL_PATH`: Path to DocLayout-YOLO weights
			- `PP_DOCLAYOUT_MODEL_DIR`: Path to PP-DocLayoutV3 model directory
			- `MAX_IMAGE_SIZE_MB`: Maximum upload size (default: 10)

			`## Open Questions`

			`- Should we add async queue for large batch processing? (Defer to future change)`
			`- Do we need WebSocket for progress updates? (Defer to future change)`