doc_processer/openspec/changes/add-doc-processing-api/specs/image-ocr/spec.md at 874fd383ccdafbcdeb205c774b30a19fca242eda

YogeLiu/doc_processer

Fork 0

Files

liuyuanchuang 874fd383cc init repo

2025-12-29 17:34:58 +08:00

4.3 KiB

Raw Blame History

ADDED Requirements

Requirement: Image Input Acceptance

The system SHALL accept images via POST /api/v1/image/ocr endpoint with either:

image_url: A publicly accessible URL to the image
image_base64: Base64-encoded image data

The system SHALL return an error if neither input is provided or if both are provided simultaneously.

Scenario: Image URL provided

WHEN a valid image_url is provided in the request body
THEN the system SHALL download the image and process it
AND return OCR results in the response

Scenario: Base64 image provided

WHEN a valid image_base64 string is provided in the request body
THEN the system SHALL decode the image and process it
AND return OCR results in the response

Scenario: Invalid input

WHEN neither image_url nor image_base64 is provided
THEN the system SHALL return HTTP 422 with validation error

Requirement: Image Preprocessing with Padding

The system SHALL preprocess all input images by adding 30% whitespace padding around the image borders using OpenCV.

The padding calculation: padding = int(max(height, width) * 0.15) on each side (totaling 30% expansion).

The padding color SHALL be white (RGB: 255, 255, 255).

Scenario: Image padding applied

WHEN an image of dimensions 1000x800 pixels is received
THEN the system SHALL add approximately 150 pixels of white padding on each side
AND the resulting image dimensions SHALL be approximately 1300x1100 pixels

Requirement: Layout Detection with DocLayout-YOLO

The system SHALL use DocLayout-YOLO model to detect document layout regions including:

Plain text blocks
Formulas/equations
Tables
Figures

The model SHALL be loaded from a pre-configured local path (not downloaded at runtime).

Scenario: Layout detection success

WHEN a padded image is passed to DocLayout-YOLO
THEN the system SHALL return detected regions with bounding boxes and class labels
AND confidence scores for each detection

Scenario: Model not available

WHEN the DocLayout-YOLO model file is not found at the configured path
THEN the system SHALL fail startup with a clear error message

Requirement: OCR Processing with PaddleOCR-VL

The system SHALL send images to PaddleOCR-VL (via vLLM backend) for text and formula recognition.

PaddleOCR-VL SHALL be configured with PP-DocLayoutV2 for document layout understanding.

The system SHALL handle both plain text and formula/math content.

Scenario: Plain text recognition

WHEN DocLayout-YOLO detects plain text regions
THEN the system SHALL send the image to PaddleOCR-VL
AND return recognized text content

Scenario: Formula recognition

WHEN DocLayout-YOLO detects formula/equation regions
THEN the system SHALL send the image to PaddleOCR-VL
AND return formula content in LaTeX format

Scenario: Mixed content handling

WHEN DocLayout-YOLO detects both text and formula regions
THEN the system SHALL process all regions via PaddleOCR-VL with PP-DocLayoutV3
AND return combined results preserving document structure

Scenario: PaddleOCR-VL service unavailable

WHEN the PaddleOCR-VL vLLM server is unreachable
THEN the system SHALL return HTTP 503 with service unavailable error

Requirement: Multi-Format Output

The system SHALL return OCR results in multiple formats:

latex: LaTeX representation of the content
markdown: Markdown representation of the content
mathml: MathML representation for mathematical content

Scenario: Successful OCR response

WHEN image processing completes successfully
THEN the response SHALL include:
- latex: string containing LaTeX output
- markdown: string containing Markdown output
- mathml: string containing MathML output (empty string if no math detected)
AND HTTP status code SHALL be 200

Scenario: Response structure

WHEN the OCR endpoint returns successfully
THEN the response body SHALL be JSON with structure:

{
  "latex": "...",
  "markdown": "...",
  "mathml": "...",
  "layout_info": {
    "regions": [
      {"type": "text|formula|table|figure", "bbox": [x1, y1, x2, y2], "confidence": 0.95}
    ]
  }
}

4.3 KiB Raw Blame History