init repo

2025-12-29 17:34:58 +08:00
commit 874fd383cc
36 changed files with 2641 additions and 0 deletions
--- a/openspec/changes/add-doc-processing-api/specs/image-ocr/spec.md
+++ b/openspec/changes/add-doc-processing-api/specs/image-ocr/spec.md
@@ -0,0 +1,137 @@
+## ADDED Requirements
+
+### Requirement: Image Input Acceptance
+
+The system SHALL accept images via `POST /api/v1/image/ocr` endpoint with either:
+
+- `image_url`: A publicly accessible URL to the image
+- `image_base64`: Base64-encoded image data
+
+The system SHALL return an error if neither input is provided or if both are provided simultaneously.
+
+#### Scenario: Image URL provided
+
+- **WHEN** a valid `image_url` is provided in the request body
+- **THEN** the system SHALL download the image and process it
+- **AND** return OCR results in the response
+
+#### Scenario: Base64 image provided
+
+- **WHEN** a valid `image_base64` string is provided in the request body
+- **THEN** the system SHALL decode the image and process it
+- **AND** return OCR results in the response
+
+#### Scenario: Invalid input
+
+- **WHEN** neither `image_url` nor `image_base64` is provided
+- **THEN** the system SHALL return HTTP 422 with validation error
+
+---
+
+### Requirement: Image Preprocessing with Padding
+
+The system SHALL preprocess all input images by adding 30% whitespace padding around the image borders using OpenCV.
+
+The padding calculation: `padding = int(max(height, width) * 0.15)` on each side (totaling 30% expansion).
+
+The padding color SHALL be white (`RGB: 255, 255, 255`).
+
+#### Scenario: Image padding applied
+
+- **WHEN** an image of dimensions 1000x800 pixels is received
+- **THEN** the system SHALL add approximately 150 pixels of white padding on each side
+- **AND** the resulting image dimensions SHALL be approximately 1300x1100 pixels
+
+---
+
+### Requirement: Layout Detection with DocLayout-YOLO
+
+The system SHALL use DocLayout-YOLO model to detect document layout regions including:
+
+- Plain text blocks
+- Formulas/equations
+- Tables
+- Figures
+
+The model SHALL be loaded from a pre-configured local path (not downloaded at runtime).
+
+#### Scenario: Layout detection success
+
+- **WHEN** a padded image is passed to DocLayout-YOLO
+- **THEN** the system SHALL return detected regions with bounding boxes and class labels
+- **AND** confidence scores for each detection
+
+#### Scenario: Model not available
+
+- **WHEN** the DocLayout-YOLO model file is not found at the configured path
+- **THEN** the system SHALL fail startup with a clear error message
+
+---
+
+### Requirement: OCR Processing with PaddleOCR-VL
+
+The system SHALL send images to PaddleOCR-VL (via vLLM backend) for text and formula recognition.
+
+PaddleOCR-VL SHALL be configured with PP-DocLayoutV2 for document layout understanding.
+
+The system SHALL handle both plain text and formula/math content.
+
+#### Scenario: Plain text recognition
+
+- **WHEN** DocLayout-YOLO detects plain text regions
+- **THEN** the system SHALL send the image to PaddleOCR-VL
+- **AND** return recognized text content
+
+#### Scenario: Formula recognition
+
+- **WHEN** DocLayout-YOLO detects formula/equation regions
+- **THEN** the system SHALL send the image to PaddleOCR-VL
+- **AND** return formula content in LaTeX format
+
+#### Scenario: Mixed content handling
+
+- **WHEN** DocLayout-YOLO detects both text and formula regions
+- **THEN** the system SHALL process all regions via PaddleOCR-VL with PP-DocLayoutV3
+- **AND** return combined results preserving document structure
+
+#### Scenario: PaddleOCR-VL service unavailable
+
+- **WHEN** the PaddleOCR-VL vLLM server is unreachable
+- **THEN** the system SHALL return HTTP 503 with service unavailable error
+
+---
+
+### Requirement: Multi-Format Output
+
+The system SHALL return OCR results in multiple formats:
+
+- `latex`: LaTeX representation of the content
+- `markdown`: Markdown representation of the content
+- `mathml`: MathML representation for mathematical content
+
+#### Scenario: Successful OCR response
+
+- **WHEN** image processing completes successfully
+- **THEN** the response SHALL include:
+  - `latex`: string containing LaTeX output
+  - `markdown`: string containing Markdown output
+  - `mathml`: string containing MathML output (empty string if no math detected)
+- **AND** HTTP status code SHALL be 200
+
+#### Scenario: Response structure
+
+- **WHEN** the OCR endpoint returns successfully
+- **THEN** the response body SHALL be JSON with structure:
+
+```json
+{
+  "latex": "...",
+  "markdown": "...",
+  "mathml": "...",
+  "layout_info": {
+    "regions": [
+      {"type": "text|formula|table|figure", "bbox": [x1, y1, x2, y2], "confidence": 0.95}
+    ]
+  }
+}
+```