63 Commits

Author SHA1 Message Date
liuyuanchuang
39e72a5743 fix: encode non-ASCII filename in Content-Disposition header
Use RFC 5987 filename*=UTF-8'' percent-encoding to support Chinese and
other Unicode characters in DOCX download filenames.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-13 17:41:18 +08:00
aee1a1bf3b fix: single dollar sysmpol 2026-03-12 23:20:14 +08:00
ff82021467 optimize: formula is recognize text 2026-03-12 22:30:27 +08:00
liuyuanchuang
11e9ed780d Merge branch 'main' of https://code.texpixel.com/YogeLiu/doc_processer 2026-03-12 12:41:43 +08:00
liuyuanchuang
d1050acbdc fix: looger path 2026-03-12 12:41:26 +08:00
16399f0929 fix: logger path 2026-03-12 12:38:18 +08:00
liuyuanchuang
92b56d61d8 feat: add log for export api 2026-03-12 11:40:19 +08:00
bb1cf66137 fix: optimize title to formula 2026-03-10 21:45:43 +08:00
a9d3a35dd7 chore: optimize prompt 2026-03-10 21:36:35 +08:00
d98fa7237c Merge pull request 'fix: remove padding from GLMOCREndToEndService and clean up ruff violations' (#2) from fix/tag into main
Reviewed-on: #2
2026-03-10 19:56:43 +08:00
liuyuanchuang
30d2c2f45b fix: remove padding from GLMOCREndToEndService and clean up ruff violations
- Drop image padding in GLMOCREndToEndService.recognize(); use raw image directly
- Fix F821 undefined `padded` references replaced with `image`
- Fix F601 duplicate dict key "&#x2260;" in converter
- Fix F841 unused `image_cls_ids` variable in layout_postprocess
- Fix E702 semicolon-separated statements in layout_postprocess
- Fix UP031 percent-format replaced with f-string in logging_config
- Auto-fix 44 additional ruff violations (import order, UP035/UP045/UP006, F401, F541)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 19:52:22 +08:00
liuyuanchuang
f8173f7c0a feat: optimize padding and formula fallback 2026-03-10 09:54:54 +08:00
liuyuanchuang
cff14904bf fix: layout detection & format conversion robustness
Three targeted fixes for layout processing issues:

1. formula_number type mapping (layout_detector.py)
   - Changed formula_number region type from "formula" to "text"
   - Ensures Text Recognition prompt, preventing $$-wrapped output
   - Prevents malformed \tag{$$...\n$$} in merged formulas

2. Reading order (ocr_service.py)
   - Sort layout regions by (y1, x1) after detection
   - Ensures top-to-bottom, left-to-right processing order
   - Fixes paragraph ordering issues in output

3. Formula number cleaning (glm_postprocess.py)
   - clean_formula_number() now strips $$, $, \[...\] delimiters
   - Handles edge case where vLLM still returns math-mode wrapped content
   - Prevents delimiter leakage into \tag{} placeholders

Also adds logging:
- Warning when empty formula content is skipped
- Warning when region crop is too small (< 10×10 px)
- Warning when vLLM parallel call fails
- Warning when format conversion fails

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-03-09 17:57:05 +08:00
liuyuanchuang
bd1c118cb2 chore: update ignore file 2026-03-09 17:19:53 +08:00
liuyuanchuang
6dfaf9668b feat add glm-ocr core 2026-03-09 17:13:19 +08:00
liuyuanchuang
d74130914c feat: use padding mode 2026-02-26 17:01:23 +08:00
liuyuanchuang
fd91819af0 feat: no padding image 2026-02-25 09:52:45 +08:00
liuyuanchuang
a568149164 fix: update paddle-ocr url 2026-02-09 22:26:31 +08:00
liuyuanchuang
f64bf25f67 fix: image variable not defined 2026-02-09 22:23:52 +08:00
liuyuanchuang
8114abc27a feat: rm csv file 2026-02-09 22:19:12 +08:00
liuyuanchuang
7799e39298 fix: image as element 2026-02-09 22:18:30 +08:00
liuyuanchuang
5504bbbf1e fix:glm max tokens 2026-02-07 21:38:41 +08:00
liuyuanchuang
1a4d54ce34 fix: post hanlde for ocr 2026-02-07 21:28:46 +08:00
liuyuanchuang
f514f98142 feat: add padding 2026-02-07 16:53:09 +08:00
liuyuanchuang
d86107976a feat: update threshold 2026-02-07 13:26:57 +08:00
liuyuanchuang
de66ae24af build: update package 2026-02-07 09:58:00 +08:00
liuyuanchuang
2a962a6271 feat: update dockerfile 2026-02-07 09:40:34 +08:00
liuyuanchuang
fa10d8194a fix: downgrade threshold 2026-02-07 09:34:15 +08:00
liuyuanchuang
05a39d8b2e fix: update type comment 2026-02-07 09:27:51 +08:00
liuyuanchuang
aec030b071 feat: add log 2026-02-07 09:26:45 +08:00
liuyuanchuang
23e2160668 fix: get setting param 2026-02-07 09:11:43 +08:00
liuyuanchuang
f0ad0a4c77 feat: add glm ocr 2026-02-06 15:06:50 +08:00
liuyuanchuang
c372a4afbe fix: update port in dockerfile 2026-02-05 22:20:01 +08:00
liuyuanchuang
36172ba4ff fix: update port 2026-02-05 22:08:04 +08:00
liuyuanchuang
a3ca04856f fix: rm space 2026-02-05 21:50:12 +08:00
liuyuanchuang
eb68843e2c feat: update model name 2026-02-05 21:26:23 +08:00
liuyuanchuang
c93eba2839 refact: add log 2026-02-05 20:50:04 +08:00
liuyuanchuang
15986c8966 feat: update paddleocr-vl port 2026-02-05 20:43:24 +08:00
liuyuanchuang
4de9aefa68 feat: add paddleocr-vl 2026-02-05 20:33:43 +08:00
liuyuanchuang
767006ee38 Merge branch 'feature/converter' 2026-02-05 18:00:20 +08:00
liuyuanchuang
83e9bf0fb1 feat: add rm fake title 2026-02-05 17:59:54 +08:00
d841e7321a Merge pull request 'feature/converter' (#1) from feature/converter into main
Reviewed-on: #1
2026-02-05 13:48:21 +08:00
liuyuanchuang
cee93ab616 feat: rm space in markdown 2026-02-05 13:32:13 +08:00
liuyuanchuang
280a8cdaeb fix: markdown post handel 2026-02-05 13:18:55 +08:00
liuyuanchuang
808d29bd45 refact: rm test file 2026-02-04 17:33:42 +08:00
liuyuanchuang
cd790231ec fix: rm other attr 2026-02-04 16:56:20 +08:00
liuyuanchuang
f1229483bf fix: rm other attr in mathml 2026-02-04 16:12:22 +08:00
liuyuanchuang
35419b2102 fix: mineru post handel 2026-02-04 16:07:04 +08:00
liuyuanchuang
61fd5441b7 fix: add post markdown 2026-02-04 16:04:18 +08:00
liuyuanchuang
720cd05add fix: handle mathml preprocess 2026-02-04 15:52:04 +08:00