# DocProcesser Document processing API built with FastAPI. Converts images to LaTeX/Markdown/MathML and Markdown to DOCX. ## Features - **Image OCR API** (`POST /doc_process/v1/image/ocr`) - Accept images via URL or base64 - Automatic layout detection using DocLayout-YOLO - Text and formula recognition via PaddleOCR-VL - Output in LaTeX, Markdown, and MathML formats - **Markdown to DOCX API** (`POST /doc_process/v1/convert/docx`) - Convert markdown content to Word documents - Preserve formatting, tables, and code blocks ## Prerequisites - Python 3.11+ - NVIDIA GPU with CUDA support (RTX 5080 recommended) - PaddleOCR-VL service running via vLLM (default: `http://localhost:8080/v1`) - Pre-downloaded models: - DocLayout-YOLO - PP-DocLayoutV2 ## Quick Start ### 1. Install Dependencies Using [uv](https://github.com/astral-sh/uv): ```bash # Install uv if not already installed curl -LsSf https://astral.sh/uv/install.sh | sh # Create virtual environment and install dependencies uv venv source .venv/bin/activate # On Windows: .venv\Scripts\activate uv pip install -e . ``` ### 2. Download Models Download the required models and place them in the `models/` directory: ```bash mkdir -p models/DocLayout models/PP-DocLayout # DocLayout-YOLO (from HuggingFace) # https://huggingface.co/juliozhao/DocLayout-YOLO-DocStructBench # Place the .pt file in models/DocLayout/ # PP-DocLayoutV2 (from PaddlePaddle) # Place the model files in models/PP-DocLayout/ ``` ### 3. Configure Environment Create a `.env` file: ```bash # PaddleOCR-VL vLLM server URL PADDLEOCR_VL_URL=http://localhost:8080/v1 # Model paths DOCLAYOUT_MODEL_PATH=models/DocLayout/doclayout_yolo_docstructbench_imgsz1024.pt PP_DOCLAYOUT_MODEL_DIR=models/PP-DocLayout # Server settings HOST=0.0.0.0 PORT=8053 ``` ### 4. Run the Server ```bash uvicorn app.main:app --host 0.0.0.0 --port 8053 ``` ## Docker Deployment ### Build and Run with GPU ```bash # Build the image docker build -t doc-processer . # Run with GPU support docker run --gpus all -p 8053:8053 \ -v ./models/DocLayout:/app/models/DocLayout:ro \ -v ./models/PP-DocLayout:/app/models/PP-DocLayout:ro \ -e PADDLEOCR_VL_URL=http://host.docker.internal:8080/v1 \ doc-processer ``` ### Using Docker Compose ```bash # Start the service with GPU docker-compose up -d doc-processer # Or without GPU (CPU mode) docker-compose --profile cpu up -d doc-processer-cpu ``` ## API Usage ### Image OCR ```bash # Using image URL curl -X POST http://localhost:8053/doc_process/v1/image/ocr \ -H "Content-Type: application/json" \ -d '{"image_url": "https://example.com/document.png"}' # Using base64 image curl -X POST http://localhost:8053/doc_process/v1/image/ocr \ -H "Content-Type: application/json" \ -d '{"image_base64": "iVBORw0KGgo..."}' ``` Response: ```json { "latex": "\\section{Title}...", "markdown": "# Title\n...", "mathml": "...", "layout_info": { "regions": [ {"type": "text", "bbox": [10, 20, 100, 50], "confidence": 0.95} ], "has_plain_text": true, "has_formula": false }, "recognition_mode": "mixed_recognition" } ``` ### Markdown to DOCX ```bash curl -X POST http://localhost:8053/doc_process/v1/convert/docx \ -H "Content-Type: application/json" \ -d '{"markdown": "# Hello World\n\nThis is a test.", "filename": "output"}' \ --output output.docx ``` ## Project Structure ``` doc_processer/ ├── app/ │ ├── api/v1/ │ │ ├── endpoints/ │ │ │ ├── image.py # Image OCR endpoint │ │ │ └── convert.py # Markdown to DOCX endpoint │ │ └── router.py │ ├── core/ │ │ ├── config.py # Settings │ │ └── dependencies.py # DI providers │ ├── services/ │ │ ├── image_processor.py # OpenCV preprocessing │ │ ├── layout_detector.py # DocLayout-YOLO │ │ ├── ocr_service.py # PaddleOCR-VL client │ │ └── docx_converter.py # Markdown to DOCX │ ├── schemas/ │ │ ├── image.py │ │ └── convert.py │ └── main.py ├── models/ # Pre-downloaded models (git-ignored) ├── Dockerfile ├── docker-compose.yml ├── pyproject.toml └── README.md ``` ## Processing Pipeline ### Image OCR Flow 1. **Input**: Accept `image_url` or `image_base64` 2. **Preprocessing**: Add 30% whitespace padding using OpenCV 3. **Layout Detection**: DocLayout-YOLO detects regions (text, formula, table, figure) 4. **Recognition**: - If plain text detected → PP-DocLayoutV2 for mixed content recognition - Otherwise → PaddleOCR-VL with formula prompt 5. **Output Conversion**: Generate LaTeX, Markdown, and MathML ## Hardware Requirements - **Minimum**: 8GB GPU VRAM - **Recommended**: RTX 5080 16GB or equivalent - **CPU**: 4+ cores - **RAM**: 16GB+ ## License MIT