Files
doc_processer/openspec/changes/add-doc-processing-api/specs/image-ocr/spec.md
liuyuanchuang 874fd383cc init repo
2025-12-29 17:34:58 +08:00

4.3 KiB

ADDED Requirements

Requirement: Image Input Acceptance

The system SHALL accept images via POST /api/v1/image/ocr endpoint with either:

  • image_url: A publicly accessible URL to the image
  • image_base64: Base64-encoded image data

The system SHALL return an error if neither input is provided or if both are provided simultaneously.

Scenario: Image URL provided

  • WHEN a valid image_url is provided in the request body
  • THEN the system SHALL download the image and process it
  • AND return OCR results in the response

Scenario: Base64 image provided

  • WHEN a valid image_base64 string is provided in the request body
  • THEN the system SHALL decode the image and process it
  • AND return OCR results in the response

Scenario: Invalid input

  • WHEN neither image_url nor image_base64 is provided
  • THEN the system SHALL return HTTP 422 with validation error

Requirement: Image Preprocessing with Padding

The system SHALL preprocess all input images by adding 30% whitespace padding around the image borders using OpenCV.

The padding calculation: padding = int(max(height, width) * 0.15) on each side (totaling 30% expansion).

The padding color SHALL be white (RGB: 255, 255, 255).

Scenario: Image padding applied

  • WHEN an image of dimensions 1000x800 pixels is received
  • THEN the system SHALL add approximately 150 pixels of white padding on each side
  • AND the resulting image dimensions SHALL be approximately 1300x1100 pixels

Requirement: Layout Detection with DocLayout-YOLO

The system SHALL use DocLayout-YOLO model to detect document layout regions including:

  • Plain text blocks
  • Formulas/equations
  • Tables
  • Figures

The model SHALL be loaded from a pre-configured local path (not downloaded at runtime).

Scenario: Layout detection success

  • WHEN a padded image is passed to DocLayout-YOLO
  • THEN the system SHALL return detected regions with bounding boxes and class labels
  • AND confidence scores for each detection

Scenario: Model not available

  • WHEN the DocLayout-YOLO model file is not found at the configured path
  • THEN the system SHALL fail startup with a clear error message

Requirement: OCR Processing with PaddleOCR-VL

The system SHALL send images to PaddleOCR-VL (via vLLM backend) for text and formula recognition.

PaddleOCR-VL SHALL be configured with PP-DocLayoutV2 for document layout understanding.

The system SHALL handle both plain text and formula/math content.

Scenario: Plain text recognition

  • WHEN DocLayout-YOLO detects plain text regions
  • THEN the system SHALL send the image to PaddleOCR-VL
  • AND return recognized text content

Scenario: Formula recognition

  • WHEN DocLayout-YOLO detects formula/equation regions
  • THEN the system SHALL send the image to PaddleOCR-VL
  • AND return formula content in LaTeX format

Scenario: Mixed content handling

  • WHEN DocLayout-YOLO detects both text and formula regions
  • THEN the system SHALL process all regions via PaddleOCR-VL with PP-DocLayoutV3
  • AND return combined results preserving document structure

Scenario: PaddleOCR-VL service unavailable

  • WHEN the PaddleOCR-VL vLLM server is unreachable
  • THEN the system SHALL return HTTP 503 with service unavailable error

Requirement: Multi-Format Output

The system SHALL return OCR results in multiple formats:

  • latex: LaTeX representation of the content
  • markdown: Markdown representation of the content
  • mathml: MathML representation for mathematical content

Scenario: Successful OCR response

  • WHEN image processing completes successfully
  • THEN the response SHALL include:
    • latex: string containing LaTeX output
    • markdown: string containing Markdown output
    • mathml: string containing MathML output (empty string if no math detected)
  • AND HTTP status code SHALL be 200

Scenario: Response structure

  • WHEN the OCR endpoint returns successfully
  • THEN the response body SHALL be JSON with structure:
{
  "latex": "...",
  "markdown": "...",
  "mathml": "...",
  "layout_info": {
    "regions": [
      {"type": "text|formula|table|figure", "bbox": [x1, y1, x2, y2], "confidence": 0.95}
    ]
  }
}