fix: layout detection & format conversion robustness

Three targeted fixes for layout processing issues: 1. formula_number type mapping (layout_detector.py) - Changed formula_number region type from "formula" to "text" - Ensures Text Recognition prompt, preventing $$-wrapped output - Prevents malformed \tag{$$...\n$$} in merged formulas 2. Reading order (ocr_service.py) - Sort layout regions by (y1, x1) after detection - Ensures top-to-bottom, left-to-right processing order - Fixes paragraph ordering issues in output 3. Formula number cleaning (glm_postprocess.py) - clean_formula_number() now strips $$, $, \[...\] delimiters - Handles edge case where vLLM still returns math-mode wrapped content - Prevents delimiter leakage into \tag{} placeholders Also adds logging: - Warning when empty formula content is skipped - Warning when region crop is too small (< 10×10 px) - Warning when vLLM parallel call fails - Warning when format conversion fails Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-03-09 17:57:05 +08:00
parent bd1c118cb2
commit cff14904bf
3 changed files with 39 additions and 6 deletions
--- a/app/services/glm_postprocess.py
+++ b/app/services/glm_postprocess.py
@@ -13,8 +13,11 @@ Covers:

 from __future__ import annotations

+import logging
 import re
 import json
+
+logger = logging.getLogger(__name__)
 from collections import Counter
 from copy import deepcopy
 from typing import Any, Dict, List, Optional, Tuple
@@ -94,8 +97,18 @@ def clean_repeated_content(


 def clean_formula_number(number_content: str) -> str:
-    """Strip parentheses from a formula number string, e.g. '(1)' → '1'."""
+    """Strip delimiters from a formula number string, e.g. '(1)' → '1'.
+
+    Also strips math-mode delimiters ($$, $, \\[...\\]) that vLLM may add
+    when the region is processed with a formula prompt.
+    """
    s = number_content.strip()
+    # Strip display math delimiters
+    for start, end in [("$$", "$$"), (r"\[", r"\]"), ("$", "$"), (r"\(", r"\)")]:
+        if s.startswith(start) and s.endswith(end) and len(s) > len(start) + len(end):
+            s = s[len(start):-len(end)].strip()
+            break
+    # Strip CJK/ASCII parentheses
    if s.startswith("(") and s.endswith(")"):
        return s[1:-1]
    if s.startswith("（") and s.endswith("）"):
@@ -253,6 +266,9 @@ class GLMResultFormatter:
                if content.startswith(s) and content.endswith(e):
                    content = content[len(s) : -len(e)].strip()
                    break
+            if not content:
+                logger.warning("Skipping formula region with empty content after stripping delimiters")
+                return ""
            content = "$$\n" + content + "\n$$"

        # Text formatting