init repo

2025-12-29 17:34:58 +08:00
commit 874fd383cc
36 changed files with 2641 additions and 0 deletions
--- a/openspec/changes/add-doc-processing-api/design.md
+++ b/openspec/changes/add-doc-processing-api/design.md
@@ -0,0 +1,107 @@
+## Context
+
+This is the initial implementation of the DocProcesser service. The system integrates multiple external models and services:
+
+- DocLayout-YOLO for document layout analysis
+- PaddleOCR-VL with PP-DocLayoutV2 for text and formula recognition (deployed via vLLM)
+- markdown_2_docx for document conversion
+
+Target deployment: Ubuntu machine with RTX 5080 GPU (16GB VRAM), Python 3.11.0.
+
+## Goals / Non-Goals
+
+**Goals:**
+
+- Clean FastAPI project structure following best practices
+- Image preprocessing with OpenCV (30% padding)
+- Layout-aware OCR routing using DocLayout-YOLO
+- Text and formula recognition via PaddleOCR-VL
+- Markdown to DOCX conversion
+- GPU-enabled Docker deployment
+
+**Non-Goals:**
+
+- Authentication/authorization (can be added later)
+- Rate limiting
+- Persistent storage
+- Training or fine-tuning models
+
+## Decisions
+
+### Project Structure
+
+Follow FastAPI best practices with modular organization:
+
+```
+app/
+├── api/
+│   └── v1/
+│       ├── endpoints/
+│       │   ├── image.py      # Image OCR endpoint
+│       │   └── convert.py    # Markdown to DOCX endpoint
+│       └── router.py
+├── core/
+│   └── config.py             # Settings and environment config
+|—— model/
+|   |—— DocLayout
+|   |—— PP-DocLayout
+├── services/
+│   ├── image_processor.py    # OpenCV preprocessing
+│   ├── layout_detector.py    # DocLayout-YOLO wrapper
+│   ├── ocr_service.py        # PaddleOCR-VL client
+│   └── docx_converter.py     # markdown_2_docx wrapper
+├── schemas/
+│   ├── image.py              # Request/response models for image OCR
+│   └── convert.py            # Request/response models for conversion
+└── main.py                   # FastAPI app initialization
+```
+
+**Rationale:** Separation of concerns between API layer, business logic (services), and data models (schemas).
+
+### Image Preprocessing
+
+- Use OpenCV `cv2.copyMakeBorder()` to add 30% whitespace padding
+- Padding color: white `[255, 255, 255]`
+- This matches DocLayout-YOLO's demo.py pattern
+
+### Layout Detection Flow
+
+1. DocLayout-YOLO detects layout regions (plain text, formulas, tables, figures)
+2. Exsit plain text, routes to PaddleOCR-VL with PP-DocLayoutV2, othewise routes to PaddleOCR-VL with prompt
+3. PaddleOCR-VL combined PP-DocLayoutV2 handles mixed content recognition internally, PaddleOCR-VL combined prompt handles formula
+
+### External Service Integration
+
+- PaddleOCR-VL: Connect to vLLM server at configurable URL (default: `http://localhost:8080/v1`)
+- DocLayout-YOLO: Load model from pre-downloaded path (not downloaded in container)
+
+### Docker Strategy
+
+- Base image: NVIDIA CUDA with Python 3.11
+- Pre-install OpenCV dependencies (`libgl1-mesa-glx`, `libglib2.0-0`)
+- Mount model directory for DocLayout-YOLO weights
+- Expose port 8053
+- Use Uvicorn with multiple workers
+
+## Risks / Trade-offs
+
+| Risk                              | Mitigation                                                         |
+| --------------------------------- | ------------------------------------------------------------------ |
+| PaddleOCR-VL service unavailable  | Health check endpoint, retry logic with exponential backoff        |
+| Large image memory consumption    | Configure max image size, resize before processing                 |
+| DocLayout-YOLO model loading time | Load model once at startup, keep in memory                         |
+| GPU memory contention             | DocLayout-YOLO uses GPU; PaddleOCR-VL runs on separate vLLM server |
+
+## Configuration
+
+Environment variables:
+
+- `PADDLEOCR_VL_URL`: vLLM server URL (default: `http://localhost:8000/v1`)
+- `DOCLAYOUT_MODEL_PATH`: Path to DocLayout-YOLO weights
+- `PP_DOCLAYOUT_MODEL_DIR`: Path to PP-DocLayoutV3 model directory
+- `MAX_IMAGE_SIZE_MB`: Maximum upload size (default: 10)
+
+## Open Questions
+
+- Should we add async queue for large batch processing? (Defer to future change)
+- Do we need WebSocket for progress updates? (Defer to future change)
--- a/openspec/changes/add-doc-processing-api/proposal.md
+++ b/openspec/changes/add-doc-processing-api/proposal.md
@@ -0,0 +1,31 @@
+# Change: Add Document Processing API
+
+## Why
+
+DocProcesser needs a FastAPI backend to accept images (via URL or base64) and convert them to LaTeX/Markdown/MathML, plus a markdown-to-DOCX conversion endpoint. This establishes the core functionality of the project.
+
+## What Changes
+
+- **BREAKING**: Initial project setup (new FastAPI project structure)
+- Add image-to-OCR API endpoint (`POST /doc_process/v1/image/ocr`)
+  - Accept `image_url` or `image_base64` input
+  - Preprocess with OpenCV (30% whitespace padding)
+  - Use DocLayout-YOLO for layout detection
+  - Route to PaddleOCR-VL (with PP-DocLayoutV2) for text/formula recognition
+  - Exists `plain_text` element, use PP-DocLayoutV2 to recognize the image as mixed_recognition , otherwise directly PaddleOCR-VL API combined with prompt Formula Recognition as formula_recognition.
+  - Refrence markdown_2_docx code convert the markdown to latex, mathml for mixed_recognition, convert the latex to markdown, mathml for formula_recognition
+  - Return LaTeX, Markdown, and MathML outputs
+- Add markdown-to-DOCX API endpoint (`POST /doc_process/v1/convert/docx`)
+  - Accept markdown content
+  - Refrence markdown_2_docx library for conversion, the address is http://github.com/YogeLiu/markdown_2_docxdd.
+  - Return DOCX file
+- Add Dockerfile for GPU-enabled deployment (RTX 5080, port 8053)
+
+## Impact
+
+- Affected specs: `image-ocr`, `markdown-docx`
+- Affected code: New project structure under `app/`
+- External dependencies:
+  - DocLayout-YOLO (pre-downloaded model, not fetched in container)
+  - PaddleOCR-VL with vLLM backend (external service at localhost:8080)
+  - markdown_2_docx library
--- a/openspec/changes/add-doc-processing-api/specs/image-ocr/spec.md
+++ b/openspec/changes/add-doc-processing-api/specs/image-ocr/spec.md
@@ -0,0 +1,137 @@
+## ADDED Requirements
+
+### Requirement: Image Input Acceptance
+
+The system SHALL accept images via `POST /api/v1/image/ocr` endpoint with either:
+
+- `image_url`: A publicly accessible URL to the image
+- `image_base64`: Base64-encoded image data
+
+The system SHALL return an error if neither input is provided or if both are provided simultaneously.
+
+#### Scenario: Image URL provided
+
+- **WHEN** a valid `image_url` is provided in the request body
+- **THEN** the system SHALL download the image and process it
+- **AND** return OCR results in the response
+
+#### Scenario: Base64 image provided
+
+- **WHEN** a valid `image_base64` string is provided in the request body
+- **THEN** the system SHALL decode the image and process it
+- **AND** return OCR results in the response
+
+#### Scenario: Invalid input
+
+- **WHEN** neither `image_url` nor `image_base64` is provided
+- **THEN** the system SHALL return HTTP 422 with validation error
+
+---
+
+### Requirement: Image Preprocessing with Padding
+
+The system SHALL preprocess all input images by adding 30% whitespace padding around the image borders using OpenCV.
+
+The padding calculation: `padding = int(max(height, width) * 0.15)` on each side (totaling 30% expansion).
+
+The padding color SHALL be white (`RGB: 255, 255, 255`).
+
+#### Scenario: Image padding applied
+
+- **WHEN** an image of dimensions 1000x800 pixels is received
+- **THEN** the system SHALL add approximately 150 pixels of white padding on each side
+- **AND** the resulting image dimensions SHALL be approximately 1300x1100 pixels
+
+---
+
+### Requirement: Layout Detection with DocLayout-YOLO
+
+The system SHALL use DocLayout-YOLO model to detect document layout regions including:
+
+- Plain text blocks
+- Formulas/equations
+- Tables
+- Figures
+
+The model SHALL be loaded from a pre-configured local path (not downloaded at runtime).
+
+#### Scenario: Layout detection success
+
+- **WHEN** a padded image is passed to DocLayout-YOLO
+- **THEN** the system SHALL return detected regions with bounding boxes and class labels
+- **AND** confidence scores for each detection
+
+#### Scenario: Model not available
+
+- **WHEN** the DocLayout-YOLO model file is not found at the configured path
+- **THEN** the system SHALL fail startup with a clear error message
+
+---
+
+### Requirement: OCR Processing with PaddleOCR-VL
+
+The system SHALL send images to PaddleOCR-VL (via vLLM backend) for text and formula recognition.
+
+PaddleOCR-VL SHALL be configured with PP-DocLayoutV2 for document layout understanding.
+
+The system SHALL handle both plain text and formula/math content.
+
+#### Scenario: Plain text recognition
+
+- **WHEN** DocLayout-YOLO detects plain text regions
+- **THEN** the system SHALL send the image to PaddleOCR-VL
+- **AND** return recognized text content
+
+#### Scenario: Formula recognition
+
+- **WHEN** DocLayout-YOLO detects formula/equation regions
+- **THEN** the system SHALL send the image to PaddleOCR-VL
+- **AND** return formula content in LaTeX format
+
+#### Scenario: Mixed content handling
+
+- **WHEN** DocLayout-YOLO detects both text and formula regions
+- **THEN** the system SHALL process all regions via PaddleOCR-VL with PP-DocLayoutV3
+- **AND** return combined results preserving document structure
+
+#### Scenario: PaddleOCR-VL service unavailable
+
+- **WHEN** the PaddleOCR-VL vLLM server is unreachable
+- **THEN** the system SHALL return HTTP 503 with service unavailable error
+
+---
+
+### Requirement: Multi-Format Output
+
+The system SHALL return OCR results in multiple formats:
+
+- `latex`: LaTeX representation of the content
+- `markdown`: Markdown representation of the content
+- `mathml`: MathML representation for mathematical content
+
+#### Scenario: Successful OCR response
+
+- **WHEN** image processing completes successfully
+- **THEN** the response SHALL include:
+  - `latex`: string containing LaTeX output
+  - `markdown`: string containing Markdown output
+  - `mathml`: string containing MathML output (empty string if no math detected)
+- **AND** HTTP status code SHALL be 200
+
+#### Scenario: Response structure
+
+- **WHEN** the OCR endpoint returns successfully
+- **THEN** the response body SHALL be JSON with structure:
+
+```json
+{
+  "latex": "...",
+  "markdown": "...",
+  "mathml": "...",
+  "layout_info": {
+    "regions": [
+      {"type": "text|formula|table|figure", "bbox": [x1, y1, x2, y2], "confidence": 0.95}
+    ]
+  }
+}
+```
--- a/openspec/changes/add-doc-processing-api/specs/markdown-docx/spec.md
+++ b/openspec/changes/add-doc-processing-api/specs/markdown-docx/spec.md
@@ -0,0 +1,93 @@
+## ADDED Requirements
+
+### Requirement: Markdown Input Acceptance
+
+The system SHALL accept markdown content via `POST /api/v1/convert/docx` endpoint.
+
+The request body SHALL contain:
+- `markdown`: string containing the markdown content to convert
+
+#### Scenario: Valid markdown provided
+
+- **WHEN** valid markdown content is provided in the request body
+- **THEN** the system SHALL process and convert it to DOCX format
+
+#### Scenario: Empty markdown
+
+- **WHEN** an empty `markdown` string is provided
+- **THEN** the system SHALL return HTTP 422 with validation error
+
+---
+
+### Requirement: DOCX Conversion
+
+The system SHALL convert markdown content to DOCX format using the markdown_2_docx library.
+
+The conversion SHALL preserve:
+- Headings (H1-H6)
+- Paragraphs
+- Bold and italic formatting
+- Lists (ordered and unordered)
+- Code blocks
+- Tables
+- Images (if embedded as base64 or accessible URLs)
+
+#### Scenario: Basic markdown conversion
+
+- **WHEN** markdown with headings, paragraphs, and formatting is provided
+- **THEN** the system SHALL generate a valid DOCX file
+- **AND** the DOCX SHALL preserve the document structure
+
+#### Scenario: Complex markdown with tables
+
+- **WHEN** markdown containing tables is provided
+- **THEN** the system SHALL convert tables to Word table format
+- **AND** preserve table structure and content
+
+#### Scenario: Markdown with math formulas
+
+- **WHEN** markdown containing LaTeX math expressions is provided
+- **THEN** the system SHALL convert math to OMML (Office Math Markup Language) format
+- **AND** render correctly in Microsoft Word
+
+---
+
+### Requirement: DOCX File Response
+
+The system SHALL return the generated DOCX file as a binary download.
+
+The response SHALL include:
+- Content-Type: `application/vnd.openxmlformats-officedocument.wordprocessingml.document`
+- Content-Disposition: `attachment; filename="output.docx"`
+
+#### Scenario: Successful conversion response
+
+- **WHEN** markdown conversion completes successfully
+- **THEN** the response SHALL be the DOCX file binary
+- **AND** HTTP status code SHALL be 200
+- **AND** appropriate headers for file download SHALL be set
+
+#### Scenario: Custom filename
+
+- **WHEN** an optional `filename` parameter is provided in the request
+- **THEN** the Content-Disposition header SHALL use the provided filename
+- **AND** append `.docx` extension if not present
+
+---
+
+### Requirement: Error Handling
+
+The system SHALL provide clear error responses for conversion failures.
+
+#### Scenario: Conversion failure
+
+- **WHEN** markdown_2_docx fails to convert the content
+- **THEN** the system SHALL return HTTP 500 with error details
+- **AND** the error message SHALL describe the failure reason
+
+#### Scenario: Malformed markdown
+
+- **WHEN** severely malformed markdown is provided
+- **THEN** the system SHALL attempt best-effort conversion
+- **AND** log a warning about potential formatting issues
+
--- a/openspec/changes/add-doc-processing-api/tasks.md
+++ b/openspec/changes/add-doc-processing-api/tasks.md
@@ -0,0 +1,34 @@
+## 1. Project Scaffolding
+
+- [x] 1.1 Create FastAPI project structure (`app/`, `api/`, `core/`, `services/`, `schemas/`)
+- [x] 1.2 Use uv handle with dependencies (fastapi, uvicorn, opencv-python, python-multipart, pydantic, httpx)
+- [x] 1.3 Create `app/main.py` with FastAPI app initialization
+- [x] 1.4 Create `app/core/config.py` with Pydantic Settings
+
+## 2. Image OCR API
+
+- [x] 2.1 Create request/response schemas in `app/schemas/image.py`
+- [x] 2.2 Implement image preprocessing service with OpenCV padding (`app/services/image_processor.py`)
+- [x] 2.3 Implement DocLayout-YOLO wrapper (`app/services/layout_detector.py`)
+- [x] 2.4 Implement PaddleOCR-VL client (`app/services/ocr_service.py`)
+- [x] 2.5 Create image OCR endpoint (`app/api/v1/endpoints/image.py`)
+- [x] 2.6 Wire up router and test endpoint
+
+## 3. Markdown to DOCX API
+
+- [x] 3.1 Create request/response schemas in `app/schemas/convert.py`
+- [x] 3.2 Integrate markdown_2_docx library (`app/services/docx_converter.py`)
+- [x] 3.3 Create conversion endpoint (`app/api/v1/endpoints/convert.py`)
+- [x] 3.4 Wire up router and test endpoint
+
+## 4. Deployment
+
+- [x] 4.1 Create Dockerfile with CUDA base image for RTX 5080
+- [x] 4.2 Create docker-compose.yml (optional, for local development)
+- [x] 4.3 Document deployment steps in README
+
+## 5. Validation
+
+- [ ] 5.1 Test image OCR endpoint with sample images
+- [ ] 5.2 Test markdown to DOCX conversion
+- [ ] 5.3 Verify Docker build and GPU access