openspec/changes/add-doc-processing-api/specs/image-ocr/spec.md

## ADDED Requirements

### Requirement: Image Input Acceptance

The system SHALL accept images via `POST /api/v1/image/ocr` endpoint with either:

- `image_url`: A publicly accessible URL to the image
- `image_base64`: Base64-encoded image data

The system SHALL return an error if neither input is provided or if both are provided simultaneously.

#### Scenario: Image URL provided

- **WHEN** a valid `image_url` is provided in the request body
- **THEN** the system SHALL download the image and process it
- **AND** return OCR results in the response

#### Scenario: Base64 image provided

- **WHEN** a valid `image_base64` string is provided in the request body
- **THEN** the system SHALL decode the image and process it
- **AND** return OCR results in the response

#### Scenario: Invalid input

- **WHEN** neither `image_url` nor `image_base64` is provided
- **THEN** the system SHALL return HTTP 422 with validation error

---

### Requirement: Image Preprocessing with Padding

The system SHALL preprocess all input images by adding 30% whitespace padding around the image borders using OpenCV.

The padding calculation: `padding = int(max(height, width) * 0.15)` on each side (totaling 30% expansion).

The padding color SHALL be white (`RGB: 255, 255, 255`).

#### Scenario: Image padding applied

- **WHEN** an image of dimensions 1000x800 pixels is received
- **THEN** the system SHALL add approximately 150 pixels of white padding on each side
- **AND** the resulting image dimensions SHALL be approximately 1300x1100 pixels

---

### Requirement: Layout Detection with DocLayout-YOLO

The system SHALL use DocLayout-YOLO model to detect document layout regions including:

- Plain text blocks
- Formulas/equations
- Tables
- Figures

The model SHALL be loaded from a pre-configured local path (not downloaded at runtime).

#### Scenario: Layout detection success

- **WHEN** a padded image is passed to DocLayout-YOLO
- **THEN** the system SHALL return detected regions with bounding boxes and class labels
- **AND** confidence scores for each detection

#### Scenario: Model not available

- **WHEN** the DocLayout-YOLO model file is not found at the configured path
- **THEN** the system SHALL fail startup with a clear error message

---

### Requirement: OCR Processing with PaddleOCR-VL

The system SHALL send images to PaddleOCR-VL (via vLLM backend) for text and formula recognition.

PaddleOCR-VL SHALL be configured with PP-DocLayoutV2 for document layout understanding.

The system SHALL handle both plain text and formula/math content.

#### Scenario: Plain text recognition

- **WHEN** DocLayout-YOLO detects plain text regions
- **THEN** the system SHALL send the image to PaddleOCR-VL
- **AND** return recognized text content

#### Scenario: Formula recognition

- **WHEN** DocLayout-YOLO detects formula/equation regions
- **THEN** the system SHALL send the image to PaddleOCR-VL
- **AND** return formula content in LaTeX format

#### Scenario: Mixed content handling

- **WHEN** DocLayout-YOLO detects both text and formula regions
- **THEN** the system SHALL process all regions via PaddleOCR-VL with PP-DocLayoutV3
- **AND** return combined results preserving document structure

#### Scenario: PaddleOCR-VL service unavailable

- **WHEN** the PaddleOCR-VL vLLM server is unreachable
- **THEN** the system SHALL return HTTP 503 with service unavailable error

---

### Requirement: Multi-Format Output

The system SHALL return OCR results in multiple formats:

- `latex`: LaTeX representation of the content
- `markdown`: Markdown representation of the content
- `mathml`: MathML representation for mathematical content

#### Scenario: Successful OCR response

- **WHEN** image processing completes successfully
- **THEN** the response SHALL include:
  - `latex`: string containing LaTeX output
  - `markdown`: string containing Markdown output
  - `mathml`: string containing MathML output (empty string if no math detected)
- **AND** HTTP status code SHALL be 200

#### Scenario: Response structure

- **WHEN** the OCR endpoint returns successfully
- **THEN** the response body SHALL be JSON with structure:

```json
{
  "latex": "...",
  "markdown": "...",
  "mathml": "...",
  "layout_info": {
    "regions": [
      {"type": "text|formula|table|figure", "bbox": [x1, y1, x2, y2], "confidence": 0.95}
    ]
  }
}
```
init repo 2025-12-29 17:34:58 +08:00			`## ADDED Requirements`

			`### Requirement: Image Input Acceptance`

			The system SHALL accept images via `POST /api/v1/image/ocr` endpoint with either:

			- `image_url`: A publicly accessible URL to the image
			- `image_base64`: Base64-encoded image data

			`The system SHALL return an error if neither input is provided or if both are provided simultaneously.`

			`#### Scenario: Image URL provided`

			- WHEN a valid `image_url` is provided in the request body
			`- THEN the system SHALL download the image and process it`
			`- AND return OCR results in the response`

			`#### Scenario: Base64 image provided`

			- WHEN a valid `image_base64` string is provided in the request body
			`- THEN the system SHALL decode the image and process it`
			`- AND return OCR results in the response`

			`#### Scenario: Invalid input`

			- WHEN neither `image_url` nor `image_base64` is provided
			`- THEN the system SHALL return HTTP 422 with validation error`

			`---`

			`### Requirement: Image Preprocessing with Padding`

			`The system SHALL preprocess all input images by adding 30% whitespace padding around the image borders using OpenCV.`

			The padding calculation: `padding = int(max(height, width) * 0.15)` on each side (totaling 30% expansion).

			The padding color SHALL be white (`RGB: 255, 255, 255`).

			`#### Scenario: Image padding applied`

			`- WHEN an image of dimensions 1000x800 pixels is received`
			`- THEN the system SHALL add approximately 150 pixels of white padding on each side`
			`- AND the resulting image dimensions SHALL be approximately 1300x1100 pixels`

			`---`

			`### Requirement: Layout Detection with DocLayout-YOLO`

			`The system SHALL use DocLayout-YOLO model to detect document layout regions including:`

			`- Plain text blocks`
			`- Formulas/equations`
			`- Tables`
			`- Figures`

			`The model SHALL be loaded from a pre-configured local path (not downloaded at runtime).`

			`#### Scenario: Layout detection success`

			`- WHEN a padded image is passed to DocLayout-YOLO`
			`- THEN the system SHALL return detected regions with bounding boxes and class labels`
			`- AND confidence scores for each detection`

			`#### Scenario: Model not available`

			`- WHEN the DocLayout-YOLO model file is not found at the configured path`
			`- THEN the system SHALL fail startup with a clear error message`

			`---`

			`### Requirement: OCR Processing with PaddleOCR-VL`

			`The system SHALL send images to PaddleOCR-VL (via vLLM backend) for text and formula recognition.`

			`PaddleOCR-VL SHALL be configured with PP-DocLayoutV2 for document layout understanding.`

			`The system SHALL handle both plain text and formula/math content.`

			`#### Scenario: Plain text recognition`

			`- WHEN DocLayout-YOLO detects plain text regions`
			`- THEN the system SHALL send the image to PaddleOCR-VL`
			`- AND return recognized text content`

			`#### Scenario: Formula recognition`

			`- WHEN DocLayout-YOLO detects formula/equation regions`
			`- THEN the system SHALL send the image to PaddleOCR-VL`
			`- AND return formula content in LaTeX format`

			`#### Scenario: Mixed content handling`

			`- WHEN DocLayout-YOLO detects both text and formula regions`
			`- THEN the system SHALL process all regions via PaddleOCR-VL with PP-DocLayoutV3`
			`- AND return combined results preserving document structure`

			`#### Scenario: PaddleOCR-VL service unavailable`

			`- WHEN the PaddleOCR-VL vLLM server is unreachable`
			`- THEN the system SHALL return HTTP 503 with service unavailable error`

			`---`

			`### Requirement: Multi-Format Output`

			`The system SHALL return OCR results in multiple formats:`

			- `latex`: LaTeX representation of the content
			- `markdown`: Markdown representation of the content
			- `mathml`: MathML representation for mathematical content

			`#### Scenario: Successful OCR response`

			`- WHEN image processing completes successfully`
			`- THEN the response SHALL include:`
			- `latex`: string containing LaTeX output
			- `markdown`: string containing Markdown output
			- `mathml`: string containing MathML output (empty string if no math detected)
			`- AND HTTP status code SHALL be 200`

			`#### Scenario: Response structure`

			`- WHEN the OCR endpoint returns successfully`
			`- THEN the response body SHALL be JSON with structure:`

			```json
			`{`
			`"latex": "...",`
			`"markdown": "...",`
			`"mathml": "...",`
			`"layout_info": {`
			`"regions": [`
			`{"type": "text\|formula\|table\|figure", "bbox": [x1, y1, x2, y2], "confidence": 0.95}`
			`]`
			`}`
			`}`
			```