Commit Graph

57 Commits

Author SHA1 Message Date
liuyuanchuang
5ba835ab44 fix: add deadsnakes PPA for python3.10 on Ubuntu 24.04
Ubuntu 24.04 ships Python 3.12 by default.
python3.10-venv/dev/distutils are not in standard repos.
Must add ppa:deadsnakes/ppa in both builder and runtime stages.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 11:37:32 +08:00
liuyuanchuang
7c7d4bf36a fix: restore wheels/ COPY without invalid shell operators
COPY does not support shell operators (||, 2>/dev/null).
Keep wheels/ for paddlepaddle whl installation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-10 11:36:28 +08:00
liuyuanchuang
ef98f37525 feat: aggressive image optimization for PPDocLayoutV3 only
- Remove doclayout-yolo (~4.8GB, torch/torchvision/triton)
- Replace opencv-python with opencv-python-headless (~200MB)
- Strip debug symbols from .so files (~300-800MB)
- Remove paddle C++ headers (~22MB)
- Use cuda:base instead of runtime (~3GB savings)
- Simplify dependencies: remove doc-parser extras
- Clean venv aggressively: no pip, setuptools, include/, share/

Expected size reduction:
  Before: 17GB
  After:  ~3GB (82% reduction)

Breakdown:
  - CUDA base: 0.4GB
  - Paddle: 0.7GB
  - PaddleOCR: 0.8GB
  - OpenCV-headless: 0.2GB
  - Other deps: 0.6GB
  Total: ~2.7-3GB

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-03-10 11:33:50 +08:00
liuyuanchuang
95c497829f fix: remove VOLUME declaration to prevent anonymous volumes
- Remove VOLUME directive that was creating anonymous volumes
- Keep directory creation (mkdir) for runtime mount points
- Users must explicitly mount volumes with -v flags
- This prevents hidden volume bloat in docker exec

Usage:
  docker run --gpus all -p 8053:8053 \
    -v /home/yoge/.cache/modelscope:/root/.cache/modelscope:ro \
    -v /home/yoge/.cache/huggingface:/root/.cache/huggingface:ro \
    -v /home/yoge/.paddlex:/root/.paddlex:ro \
    doc_processer:latest

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-03-10 11:12:01 +08:00
liuyuanchuang
6579cf55f5 feat: optimize Docker image with multi-stage build
- Use multi-stage build to exclude build dependencies from final image
- Separate builder stage using devel image from runtime stage using smaller base image
- Clean venv: remove __pycache__, .pyc files, and test directories
- Remove embedded model files (243MB) from app/model/ - mount at runtime instead
- Expected size reduction: 18.9GB → 2-3GB (80-90% reduction)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-03-10 10:41:32 +08:00
liuyuanchuang
f8173f7c0a feat: optimize padding and formula fallback 2026-03-10 09:54:54 +08:00
liuyuanchuang
cff14904bf fix: layout detection & format conversion robustness
Three targeted fixes for layout processing issues:

1. formula_number type mapping (layout_detector.py)
   - Changed formula_number region type from "formula" to "text"
   - Ensures Text Recognition prompt, preventing $$-wrapped output
   - Prevents malformed \tag{$$...\n$$} in merged formulas

2. Reading order (ocr_service.py)
   - Sort layout regions by (y1, x1) after detection
   - Ensures top-to-bottom, left-to-right processing order
   - Fixes paragraph ordering issues in output

3. Formula number cleaning (glm_postprocess.py)
   - clean_formula_number() now strips $$, $, \[...\] delimiters
   - Handles edge case where vLLM still returns math-mode wrapped content
   - Prevents delimiter leakage into \tag{} placeholders

Also adds logging:
- Warning when empty formula content is skipped
- Warning when region crop is too small (< 10×10 px)
- Warning when vLLM parallel call fails
- Warning when format conversion fails

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2026-03-09 17:57:05 +08:00
liuyuanchuang
bd1c118cb2 chore: update ignore file 2026-03-09 17:19:53 +08:00
liuyuanchuang
6dfaf9668b feat add glm-ocr core 2026-03-09 17:13:19 +08:00
liuyuanchuang
d74130914c feat: use padding mode 2026-02-26 17:01:23 +08:00
liuyuanchuang
fd91819af0 feat: no padding image 2026-02-25 09:52:45 +08:00
liuyuanchuang
a568149164 fix: update paddle-ocr url 2026-02-09 22:26:31 +08:00
liuyuanchuang
f64bf25f67 fix: image variable not defined 2026-02-09 22:23:52 +08:00
liuyuanchuang
8114abc27a feat: rm csv file 2026-02-09 22:19:12 +08:00
liuyuanchuang
7799e39298 fix: image as element 2026-02-09 22:18:30 +08:00
liuyuanchuang
5504bbbf1e fix:glm max tokens 2026-02-07 21:38:41 +08:00
liuyuanchuang
1a4d54ce34 fix: post hanlde for ocr 2026-02-07 21:28:46 +08:00
liuyuanchuang
f514f98142 feat: add padding 2026-02-07 16:53:09 +08:00
liuyuanchuang
d86107976a feat: update threshold 2026-02-07 13:26:57 +08:00
liuyuanchuang
de66ae24af build: update package 2026-02-07 09:58:00 +08:00
liuyuanchuang
2a962a6271 feat: update dockerfile 2026-02-07 09:40:34 +08:00
liuyuanchuang
fa10d8194a fix: downgrade threshold 2026-02-07 09:34:15 +08:00
liuyuanchuang
05a39d8b2e fix: update type comment 2026-02-07 09:27:51 +08:00
liuyuanchuang
aec030b071 feat: add log 2026-02-07 09:26:45 +08:00
liuyuanchuang
23e2160668 fix: get setting param 2026-02-07 09:11:43 +08:00
liuyuanchuang
f0ad0a4c77 feat: add glm ocr 2026-02-06 15:06:50 +08:00
liuyuanchuang
c372a4afbe fix: update port in dockerfile 2026-02-05 22:20:01 +08:00
liuyuanchuang
36172ba4ff fix: update port 2026-02-05 22:08:04 +08:00
liuyuanchuang
a3ca04856f fix: rm space 2026-02-05 21:50:12 +08:00
liuyuanchuang
eb68843e2c feat: update model name 2026-02-05 21:26:23 +08:00
liuyuanchuang
c93eba2839 refact: add log 2026-02-05 20:50:04 +08:00
liuyuanchuang
15986c8966 feat: update paddleocr-vl port 2026-02-05 20:43:24 +08:00
liuyuanchuang
4de9aefa68 feat: add paddleocr-vl 2026-02-05 20:33:43 +08:00
liuyuanchuang
767006ee38 Merge branch 'feature/converter' 2026-02-05 18:00:20 +08:00
liuyuanchuang
83e9bf0fb1 feat: add rm fake title 2026-02-05 17:59:54 +08:00
d841e7321a Merge pull request 'feature/converter' (#1) from feature/converter into main
Reviewed-on: #1
2026-02-05 13:48:21 +08:00
liuyuanchuang
cee93ab616 feat: rm space in markdown 2026-02-05 13:32:13 +08:00
liuyuanchuang
280a8cdaeb fix: markdown post handel 2026-02-05 13:18:55 +08:00
liuyuanchuang
808d29bd45 refact: rm test file 2026-02-04 17:33:42 +08:00
liuyuanchuang
cd790231ec fix: rm other attr 2026-02-04 16:56:20 +08:00
liuyuanchuang
f1229483bf fix: rm other attr in mathml 2026-02-04 16:12:22 +08:00
liuyuanchuang
35419b2102 fix: mineru post handel 2026-02-04 16:07:04 +08:00
liuyuanchuang
61fd5441b7 fix: add post markdown 2026-02-04 16:04:18 +08:00
liuyuanchuang
720cd05add fix: handle mathml preprocess 2026-02-04 15:52:04 +08:00
liuyuanchuang
56a02eb6da fix: update mathml 2026-02-04 15:49:13 +08:00
liuyuanchuang
e31017cfe7 fix: add preprocess 2026-02-04 12:45:34 +08:00
liuyuanchuang
69f9a70ae5 feat: add omml api 2026-02-04 12:35:14 +08:00
liuyuanchuang
27f25d9f4d feat: update port config 2026-02-04 12:06:17 +08:00
liuyuanchuang
526c1f3a0d feat: optimize the format convert 2026-02-04 12:00:06 +08:00
10dbd59161 fix: matrix not rendor in docx 2026-01-14 14:18:00 +08:00