feat: add PDF document recognition with 10-page pre-hook

- Migrate recognition_results table to JSON schema (meta_data + content), replacing flat latex/markdown/mathml/mml columns - Add TaskTypePDF constant and update all formula read/write paths - Add PDFRecognitionService using pdftoppm (Poppler) for CGO-free page rendering; limits processing to first 10 pages (pre-hook) - Reuse existing downstream OCR endpoint (cloud.texpixel.com) for each page image; stores results as [{page_number, markdown}] JSON array - Add Redis queue + distributed lock for PDF worker goroutine - Add REST endpoints: POST /v1/pdf/recognition, GET /v1/pdf/recognition/:task_no - Add .pdf to OSS upload file type whitelist - Add migrations/pdf_recognition.sql for safe data migration Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-31 14:17:44 +08:00
parent 876e64366b
commit 9d712c921a
14 changed files with 760 additions and 67 deletions
--- a/go.mod
+++ b/go.mod
@@ -1,6 +1,6 @@
 module gitea.com/texpixel/document_ai

-go 1.20
+go 1.23.0

 require (
 	github.com/alibabacloud-go/darabonba-openapi v0.2.1
@@ -75,7 +75,7 @@ require (
 	golang.org/x/arch v0.8.0 // indirect
 	golang.org/x/exp v0.0.0-20230905200255-921286631fa9 // indirect
 	golang.org/x/net v0.25.0 // indirect
-	golang.org/x/sys v0.20.0 // indirect
+	golang.org/x/sys v0.33.0 // indirect
 	golang.org/x/text v0.20.0 // indirect
 	golang.org/x/time v0.5.0 // indirect
 	google.golang.org/protobuf v1.34.1 // indirect