fix: layout detection & format conversion robustness
Three targeted fixes for layout processing issues:
1. formula_number type mapping (layout_detector.py)
- Changed formula_number region type from "formula" to "text"
- Ensures Text Recognition prompt, preventing $$-wrapped output
- Prevents malformed \tag{$$...\n$$} in merged formulas
2. Reading order (ocr_service.py)
- Sort layout regions by (y1, x1) after detection
- Ensures top-to-bottom, left-to-right processing order
- Fixes paragraph ordering issues in output
3. Formula number cleaning (glm_postprocess.py)
- clean_formula_number() now strips $$, $, \[...\] delimiters
- Handles edge case where vLLM still returns math-mode wrapped content
- Prevents delimiter leakage into \tag{} placeholders
Also adds logging:
- Warning when empty formula content is skipped
- Warning when region crop is too small (< 10×10 px)
- Warning when vLLM parallel call fails
- Warning when format conversion fails
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -13,8 +13,11 @@ Covers:
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import re
|
||||
import json
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
from collections import Counter
|
||||
from copy import deepcopy
|
||||
from typing import Any, Dict, List, Optional, Tuple
|
||||
@@ -94,8 +97,18 @@ def clean_repeated_content(
|
||||
|
||||
|
||||
def clean_formula_number(number_content: str) -> str:
|
||||
"""Strip parentheses from a formula number string, e.g. '(1)' → '1'."""
|
||||
"""Strip delimiters from a formula number string, e.g. '(1)' → '1'.
|
||||
|
||||
Also strips math-mode delimiters ($$, $, \\[...\\]) that vLLM may add
|
||||
when the region is processed with a formula prompt.
|
||||
"""
|
||||
s = number_content.strip()
|
||||
# Strip display math delimiters
|
||||
for start, end in [("$$", "$$"), (r"\[", r"\]"), ("$", "$"), (r"\(", r"\)")]:
|
||||
if s.startswith(start) and s.endswith(end) and len(s) > len(start) + len(end):
|
||||
s = s[len(start):-len(end)].strip()
|
||||
break
|
||||
# Strip CJK/ASCII parentheses
|
||||
if s.startswith("(") and s.endswith(")"):
|
||||
return s[1:-1]
|
||||
if s.startswith("(") and s.endswith(")"):
|
||||
@@ -253,6 +266,9 @@ class GLMResultFormatter:
|
||||
if content.startswith(s) and content.endswith(e):
|
||||
content = content[len(s) : -len(e)].strip()
|
||||
break
|
||||
if not content:
|
||||
logger.warning("Skipping formula region with empty content after stripping delimiters")
|
||||
return ""
|
||||
content = "$$\n" + content + "\n$$"
|
||||
|
||||
# Text formatting
|
||||
|
||||
Reference in New Issue
Block a user