liuyuanchuang
39e72a5743
fix: encode non-ASCII filename in Content-Disposition header
...
Use RFC 5987 filename*=UTF-8'' percent-encoding to support Chinese and
other Unicode characters in DOCX download filenames.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-03-13 17:41:18 +08:00
aee1a1bf3b
fix: single dollar sysmpol
2026-03-12 23:20:14 +08:00
ff82021467
optimize: formula is recognize text
2026-03-12 22:30:27 +08:00
liuyuanchuang
11e9ed780d
Merge branch 'main' of https://code.texpixel.com/YogeLiu/doc_processer
2026-03-12 12:41:43 +08:00
liuyuanchuang
d1050acbdc
fix: looger path
2026-03-12 12:41:26 +08:00
16399f0929
fix: logger path
2026-03-12 12:38:18 +08:00
liuyuanchuang
92b56d61d8
feat: add log for export api
2026-03-12 11:40:19 +08:00
bb1cf66137
fix: optimize title to formula
2026-03-10 21:45:43 +08:00
a9d3a35dd7
chore: optimize prompt
2026-03-10 21:36:35 +08:00
d98fa7237c
Merge pull request 'fix: remove padding from GLMOCREndToEndService and clean up ruff violations' ( #2 ) from fix/tag into main
...
Reviewed-on: #2
2026-03-10 19:56:43 +08:00
liuyuanchuang
30d2c2f45b
fix: remove padding from GLMOCREndToEndService and clean up ruff violations
...
- Drop image padding in GLMOCREndToEndService.recognize(); use raw image directly
- Fix F821 undefined `padded` references replaced with `image`
- Fix F601 duplicate dict key "≠" in converter
- Fix F841 unused `image_cls_ids` variable in layout_postprocess
- Fix E702 semicolon-separated statements in layout_postprocess
- Fix UP031 percent-format replaced with f-string in logging_config
- Auto-fix 44 additional ruff violations (import order, UP035/UP045/UP006, F401, F541)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-03-10 19:52:22 +08:00
liuyuanchuang
f8173f7c0a
feat: optimize padding and formula fallback
2026-03-10 09:54:54 +08:00
liuyuanchuang
cff14904bf
fix: layout detection & format conversion robustness
...
Three targeted fixes for layout processing issues:
1. formula_number type mapping (layout_detector.py)
- Changed formula_number region type from "formula" to "text"
- Ensures Text Recognition prompt, preventing $$-wrapped output
- Prevents malformed \tag{$$...\n$$} in merged formulas
2. Reading order (ocr_service.py)
- Sort layout regions by (y1, x1) after detection
- Ensures top-to-bottom, left-to-right processing order
- Fixes paragraph ordering issues in output
3. Formula number cleaning (glm_postprocess.py)
- clean_formula_number() now strips $$, $, \[...\] delimiters
- Handles edge case where vLLM still returns math-mode wrapped content
- Prevents delimiter leakage into \tag{} placeholders
Also adds logging:
- Warning when empty formula content is skipped
- Warning when region crop is too small (< 10×10 px)
- Warning when vLLM parallel call fails
- Warning when format conversion fails
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2026-03-09 17:57:05 +08:00
liuyuanchuang
bd1c118cb2
chore: update ignore file
2026-03-09 17:19:53 +08:00
liuyuanchuang
6dfaf9668b
feat add glm-ocr core
2026-03-09 17:13:19 +08:00
liuyuanchuang
d74130914c
feat: use padding mode
2026-02-26 17:01:23 +08:00
liuyuanchuang
fd91819af0
feat: no padding image
2026-02-25 09:52:45 +08:00
liuyuanchuang
a568149164
fix: update paddle-ocr url
2026-02-09 22:26:31 +08:00
liuyuanchuang
f64bf25f67
fix: image variable not defined
2026-02-09 22:23:52 +08:00
liuyuanchuang
8114abc27a
feat: rm csv file
2026-02-09 22:19:12 +08:00
liuyuanchuang
7799e39298
fix: image as element
2026-02-09 22:18:30 +08:00
liuyuanchuang
5504bbbf1e
fix:glm max tokens
2026-02-07 21:38:41 +08:00
liuyuanchuang
1a4d54ce34
fix: post hanlde for ocr
2026-02-07 21:28:46 +08:00
liuyuanchuang
f514f98142
feat: add padding
2026-02-07 16:53:09 +08:00
liuyuanchuang
d86107976a
feat: update threshold
2026-02-07 13:26:57 +08:00
liuyuanchuang
de66ae24af
build: update package
2026-02-07 09:58:00 +08:00
liuyuanchuang
2a962a6271
feat: update dockerfile
2026-02-07 09:40:34 +08:00
liuyuanchuang
fa10d8194a
fix: downgrade threshold
2026-02-07 09:34:15 +08:00
liuyuanchuang
05a39d8b2e
fix: update type comment
2026-02-07 09:27:51 +08:00
liuyuanchuang
aec030b071
feat: add log
2026-02-07 09:26:45 +08:00
liuyuanchuang
23e2160668
fix: get setting param
2026-02-07 09:11:43 +08:00
liuyuanchuang
f0ad0a4c77
feat: add glm ocr
2026-02-06 15:06:50 +08:00
liuyuanchuang
c372a4afbe
fix: update port in dockerfile
2026-02-05 22:20:01 +08:00
liuyuanchuang
36172ba4ff
fix: update port
2026-02-05 22:08:04 +08:00
liuyuanchuang
a3ca04856f
fix: rm space
2026-02-05 21:50:12 +08:00
liuyuanchuang
eb68843e2c
feat: update model name
2026-02-05 21:26:23 +08:00
liuyuanchuang
c93eba2839
refact: add log
2026-02-05 20:50:04 +08:00
liuyuanchuang
15986c8966
feat: update paddleocr-vl port
2026-02-05 20:43:24 +08:00
liuyuanchuang
4de9aefa68
feat: add paddleocr-vl
2026-02-05 20:33:43 +08:00
liuyuanchuang
767006ee38
Merge branch 'feature/converter'
2026-02-05 18:00:20 +08:00
liuyuanchuang
83e9bf0fb1
feat: add rm fake title
2026-02-05 17:59:54 +08:00
d841e7321a
Merge pull request 'feature/converter' ( #1 ) from feature/converter into main
...
Reviewed-on: #1
2026-02-05 13:48:21 +08:00
liuyuanchuang
cee93ab616
feat: rm space in markdown
2026-02-05 13:32:13 +08:00
liuyuanchuang
280a8cdaeb
fix: markdown post handel
2026-02-05 13:18:55 +08:00
liuyuanchuang
808d29bd45
refact: rm test file
2026-02-04 17:33:42 +08:00
liuyuanchuang
cd790231ec
fix: rm other attr
2026-02-04 16:56:20 +08:00
liuyuanchuang
f1229483bf
fix: rm other attr in mathml
2026-02-04 16:12:22 +08:00
liuyuanchuang
35419b2102
fix: mineru post handel
2026-02-04 16:07:04 +08:00
liuyuanchuang
61fd5441b7
fix: add post markdown
2026-02-04 16:04:18 +08:00
liuyuanchuang
720cd05add
fix: handle mathml preprocess
2026-02-04 15:52:04 +08:00