fix: layout detection & format conversion robustness

Three targeted fixes for layout processing issues:

1. formula_number type mapping (layout_detector.py)
   - Changed formula_number region type from "formula" to "text"
   - Ensures Text Recognition prompt, preventing $$-wrapped output
   - Prevents malformed \tag{$$...\n$$} in merged formulas

2. Reading order (ocr_service.py)
   - Sort layout regions by (y1, x1) after detection
   - Ensures top-to-bottom, left-to-right processing order
   - Fixes paragraph ordering issues in output

3. Formula number cleaning (glm_postprocess.py)
   - clean_formula_number() now strips $$, $, \[...\] delimiters
   - Handles edge case where vLLM still returns math-mode wrapped content
   - Prevents delimiter leakage into \tag{} placeholders

Also adds logging:
- Warning when empty formula content is skipped
- Warning when region crop is too small (< 10×10 px)
- Warning when vLLM parallel call fails
- Warning when format conversion fails

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
This commit is contained in:
liuyuanchuang
2026-03-09 17:57:05 +08:00
parent bd1c118cb2
commit cff14904bf
3 changed files with 39 additions and 6 deletions

View File

@@ -66,7 +66,9 @@ class LayoutDetector:
# Formula types
"display_formula": "formula",
"inline_formula": "formula",
"formula_number": "formula",
# formula_number is a plain text annotation "(2.9)" next to a formula,
# not a formula itself — use text prompt so vLLM returns plain text
"formula_number": "text",
# Table types
"table": "table",
# Figure types