fix: layout detection & format conversion robustness

Three targeted fixes for layout processing issues:

1. formula_number type mapping (layout_detector.py)
   - Changed formula_number region type from "formula" to "text"
   - Ensures Text Recognition prompt, preventing $$-wrapped output
   - Prevents malformed \tag{$$...\n$$} in merged formulas

2. Reading order (ocr_service.py)
   - Sort layout regions by (y1, x1) after detection
   - Ensures top-to-bottom, left-to-right processing order
   - Fixes paragraph ordering issues in output

3. Formula number cleaning (glm_postprocess.py)
   - clean_formula_number() now strips $$, $, \[...\] delimiters
   - Handles edge case where vLLM still returns math-mode wrapped content
   - Prevents delimiter leakage into \tag{} placeholders

Also adds logging:
- Warning when empty formula content is skipped
- Warning when region crop is too small (< 10×10 px)
- Warning when vLLM parallel call fails
- Warning when format conversion fails

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
This commit is contained in:
liuyuanchuang
2026-03-09 17:57:05 +08:00
parent bd1c118cb2
commit cff14904bf
3 changed files with 39 additions and 6 deletions

View File

@@ -13,8 +13,11 @@ Covers:
from __future__ import annotations
import logging
import re
import json
logger = logging.getLogger(__name__)
from collections import Counter
from copy import deepcopy
from typing import Any, Dict, List, Optional, Tuple
@@ -94,8 +97,18 @@ def clean_repeated_content(
def clean_formula_number(number_content: str) -> str:
"""Strip parentheses from a formula number string, e.g. '(1)''1'."""
"""Strip delimiters from a formula number string, e.g. '(1)''1'.
Also strips math-mode delimiters ($$, $, \\[...\\]) that vLLM may add
when the region is processed with a formula prompt.
"""
s = number_content.strip()
# Strip display math delimiters
for start, end in [("$$", "$$"), (r"\[", r"\]"), ("$", "$"), (r"\(", r"\)")]:
if s.startswith(start) and s.endswith(end) and len(s) > len(start) + len(end):
s = s[len(start):-len(end)].strip()
break
# Strip CJK/ASCII parentheses
if s.startswith("(") and s.endswith(")"):
return s[1:-1]
if s.startswith("") and s.endswith(""):
@@ -253,6 +266,9 @@ class GLMResultFormatter:
if content.startswith(s) and content.endswith(e):
content = content[len(s) : -len(e)].strip()
break
if not content:
logger.warning("Skipping formula region with empty content after stripping delimiters")
return ""
content = "$$\n" + content + "\n$$"
# Text formatting