Files
doc_ai_backed/docs/superpowers/plans/2026-03-30-pdf-recognition-glm-ocr.md
2026-03-31 22:32:14 +08:00

1176 lines
32 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# PDF Recognition Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** 支持 PDF 逐页 OCR 识别最多10页同步重构 `recognition_results` 表为 JSON 内容结构,兼容公式识别和 PDF 识别两种场景。
**Architecture:** `recognition_results` 每个任务存一行:`meta_data` JSON 存元信息(`total_num``content` JSON 存识别内容(公式:`{latex, markdown, mml}`PDF`[{page_number, markdown}, ...]`。PDF 处理链:下载 → `go-fitz` 分页渲染 → pre-hook 限前10页 → 逐页调用现有下游 OCR 接口 → 组装 JSON → 写入 DB。
**Tech Stack:** Go 1.20, Gin, GORM/MySQL, Redis, Aliyun OSS, `github.com/gen2brain/go-fitz` v0.24.0, 现有下游 OCR 接口 `cloud.texpixel.com`
---
## 表结构设计
```
recognition_results
├── id BIGINT PK
├── task_id BIGINT INDEX
├── task_type VARCHAR(16) -- FORMULA / PDF
├── meta_data JSON -- {"total_num": 1}
├── content JSON -- 见下方说明
├── created_at DATETIME
└── updated_at DATETIME
content 格式(按 task_type
FORMULA: {"latex":"E=mc^2","markdown":"$$E=mc^2$$","mml":"<math>..."}
PDF: [{"page_number":1,"markdown":"# 第一章\n..."},{"page_number":2,"markdown":"..."}]
```
旧字段 `latex / markdown / mathml / mml` **全部删除**,由 `content` JSON 承接。
---
## 文件变更清单
| 操作 | 文件路径 | 职责 |
|------|---------|------|
| Create | `migrations/pdf_recognition.sql` | ALTER recognition_results删旧字段加 meta_data/content JSON |
| Modify | `internal/storage/dao/task.go` | 增加 TaskTypePDF 常量 |
| Modify | `internal/storage/dao/result.go` | 重构 RecognitionResult struct新增内容类型辅助结构更新 DAO 方法 |
| Create | `internal/model/pdf/request.go` | PDF 识别请求/响应 DTO |
| Create | `internal/storage/cache/pdf.go` | Redis 队列操作PDF 专用) |
| Modify | `internal/service/recognition_service.go` | 更新 processFormulaTask / GetFormualTask 使用新 JSON 格式 |
| Create | `internal/service/pdf_recognition_service.go` | PDF 识别业务逻辑 |
| Create | `api/v1/pdf/handler.go` | HTTP 处理器 |
| Modify | `api/router.go` | 注册 PDF 路由 |
| Modify | `api/v1/oss/handler.go` | 文件类型白名单加 .pdf大小限制放宽至 50MB |
| Modify | `go.mod` / `go.sum` | 添加 go-fitz 依赖 |
---
## 环境前置:安装 MuPDFgo-fitz CGo 依赖)
```bash
# macOS
brew install mupdf
# Ubuntu/Debian
sudo apt-get install -y libmupdf-dev
# 验证
pkg-config --modversion mupdf
```
---
## Task 1: 数据库迁移 — 重构 recognition_results
**Files:**
- Create: `migrations/pdf_recognition.sql`
- [ ] **Step 1: 创建迁移文件**
```sql
-- migrations/pdf_recognition.sql
-- 1. 删除旧的单字段列(已有数据可提前备份)
ALTER TABLE `recognition_results`
DROP COLUMN `latex`,
DROP COLUMN `markdown`,
DROP COLUMN `mathml`,
DROP COLUMN `mml`;
-- 2. 增加 JSON 字段
ALTER TABLE `recognition_results`
ADD COLUMN `meta_data` JSON DEFAULT NULL COMMENT '元数据 {"total_num":1}' AFTER `task_type`,
ADD COLUMN `content` JSON DEFAULT NULL COMMENT '识别内容 JSON' AFTER `meta_data`;
```
- [ ] **Step 2: 执行迁移**
```bash
mysql -u root -p doc_ai < migrations/pdf_recognition.sql
```
Expected: Query OK
- [ ] **Step 3: 验证表结构**
```bash
mysql -u root -p doc_ai -e "DESCRIBE recognition_results;"
```
Expected: 字段为 id, task_id, task_type, meta_data, content, created_at, updated_at无 latex/markdown/mathml/mml
- [ ] **Step 4: Commit**
```bash
git add migrations/pdf_recognition.sql
git commit -m "feat: migrate recognition_results to JSON content schema"
```
---
## Task 2: 添加 go-fitz 依赖
**Files:**
- Modify: `go.mod`
- [ ] **Step 1: 安装依赖**
```bash
go get github.com/gen2brain/go-fitz@v0.24.0
```
Expected: go: added github.com/gen2brain/go-fitz v0.24.0
- [ ] **Step 2: 验证**
```bash
go build ./...
```
- [ ] **Step 3: Commit**
```bash
git add go.mod go.sum
git commit -m "feat: add go-fitz for PDF page rendering"
```
---
## Task 3: 常量扩展
**Files:**
- Modify: `internal/storage/dao/task.go`
- [ ] **Step 1: 添加 TaskTypePDF**
找到 const 块,将:
```go
TaskTypeLayout TaskType = "LAYOUT"
```
改为:
```go
TaskTypeLayout TaskType = "LAYOUT"
TaskTypePDF TaskType = "PDF"
```
- [ ] **Step 2: 验证**
```bash
go build ./internal/storage/dao/...
```
- [ ] **Step 3: Commit**
```bash
git add internal/storage/dao/task.go
git commit -m "feat: add TaskTypePDF constant"
```
---
## Task 4: DAO — 重构 RecognitionResult
**Files:**
- Modify: `internal/storage/dao/result.go`
- [ ] **Step 1: 用新 struct 完整替换 result.go 内容**
```go
package dao
import (
"encoding/json"
"gorm.io/gorm"
)
// FormulaContent 公式识别的 content 字段结构
type FormulaContent struct {
Latex string `json:"latex"`
Markdown string `json:"markdown"`
MathML string `json:"mathml"`
MML string `json:"mml"`
}
// PDFPageContent PDF 单页识别结果
type PDFPageContent struct {
PageNumber int `json:"page_number"`
Markdown string `json:"markdown"`
}
// ResultMetaData recognition_results.meta_data 字段结构
type ResultMetaData struct {
TotalNum int `json:"total_num"`
}
// RecognitionResult recognition_results 表模型
type RecognitionResult struct {
BaseModel
TaskID int64 `gorm:"column:task_id;bigint;not null;default:0;index;comment:任务ID" json:"task_id"`
TaskType TaskType `gorm:"column:task_type;varchar(16);not null;comment:任务类型;default:''" json:"task_type"`
MetaData string `gorm:"column:meta_data;type:json;comment:元数据" json:"meta_data"`
Content string `gorm:"column:content;type:json;comment:识别内容JSON" json:"content"`
}
// SetMetaData 序列化并写入 MetaData 字段
func (r *RecognitionResult) SetMetaData(meta ResultMetaData) error {
b, err := json.Marshal(meta)
if err != nil {
return err
}
r.MetaData = string(b)
return nil
}
// GetFormulaContent 从 Content 字段反序列化公式结果
func (r *RecognitionResult) GetFormulaContent() (*FormulaContent, error) {
var c FormulaContent
if err := json.Unmarshal([]byte(r.Content), &c); err != nil {
return nil, err
}
return &c, nil
}
// GetPDFContent 从 Content 字段反序列化 PDF 分页结果
func (r *RecognitionResult) GetPDFContent() ([]PDFPageContent, error) {
var pages []PDFPageContent
if err := json.Unmarshal([]byte(r.Content), &pages); err != nil {
return nil, err
}
return pages, nil
}
// MarshalFormulaContent 将公式结果序列化为 JSON 字符串(供写入 Content
func MarshalFormulaContent(c FormulaContent) (string, error) {
b, err := json.Marshal(c)
return string(b), err
}
// MarshalPDFContent 将 PDF 分页结果序列化为 JSON 字符串(供写入 Content
func MarshalPDFContent(pages []PDFPageContent) (string, error) {
b, err := json.Marshal(pages)
return string(b), err
}
type RecognitionResultDao struct{}
func NewRecognitionResultDao() *RecognitionResultDao {
return &RecognitionResultDao{}
}
func (dao *RecognitionResultDao) Create(tx *gorm.DB, data RecognitionResult) error {
return tx.Create(&data).Error
}
func (dao *RecognitionResultDao) GetByTaskID(tx *gorm.DB, taskID int64) (*RecognitionResult, error) {
result := &RecognitionResult{}
err := tx.Where("task_id = ?", taskID).First(result).Error
if err != nil && err == gorm.ErrRecordNotFound {
return nil, nil
}
return result, err
}
func (dao *RecognitionResultDao) Update(tx *gorm.DB, id int64, updates map[string]interface{}) error {
return tx.Model(&RecognitionResult{}).Where("id = ?", id).Updates(updates).Error
}
```
- [ ] **Step 2: 验证编译**
```bash
go build ./internal/storage/dao/...
```
Expected: 无报错
- [ ] **Step 3: Commit**
```bash
git add internal/storage/dao/result.go
git commit -m "refactor: RecognitionResult to JSON content schema (meta_data + content)"
```
---
## Task 5: 更新公式识别 + TaskService — 适配新 JSON 格式
**Files:**
- Modify: `internal/service/recognition_service.go`
- Modify: `internal/service/task.go`
> **注意**:迁移删除了 `latex/markdown/mathml/mml` 列,`task.go` 的 `GetTaskList`:98-101和 `ExportTask`:151都直接读这些字段必须在同一个 commit 里一起更新,否则迁移后立即崩溃。
- [ ] **Step 1: 修改 recognition_service.go — processFormulaTask 写入**
找到 `processFormulaTask` 内调用 `resultDao.Create` 的代码约第542行
```go
// 旧代码
err = resultDao.Create(tx, dao.RecognitionResult{
TaskID: taskID,
TaskType: dao.TaskTypeFormula,
Latex: ocrResp.Latex,
Markdown: ocrResp.Markdown,
MathML: ocrResp.MathML,
MML: ocrResp.MML,
})
```
替换为:
```go
// 新代码
contentJSON, err := dao.MarshalFormulaContent(dao.FormulaContent{
Latex: ocrResp.Latex,
Markdown: ocrResp.Markdown,
MathML: ocrResp.MathML,
MML: ocrResp.MML,
})
if err != nil {
log.Error(ctx, "func", "processFormulaTask", "msg", "序列化公式内容失败", "error", err)
return err
}
result := dao.RecognitionResult{
TaskID: taskID,
TaskType: dao.TaskTypeFormula,
Content: contentJSON,
}
if err = result.SetMetaData(dao.ResultMetaData{TotalNum: 1}); err != nil {
log.Error(ctx, "func", "processFormulaTask", "msg", "序列化MetaData失败", "error", err)
return err
}
err = resultDao.Create(tx, result)
if err != nil {
log.Error(ctx, "func", "processFormulaTask", "msg", "保存任务结果失败", "error", err)
return err
}
```
- [ ] **Step 2: 修改 recognition_service.go — processVLFormulaTask 写入**
找到 `processVLFormulaTask` 内对 `resultDao.Create` / `resultDao.Update` 的调用约第665-678行
创建时:
```go
contentJSON, err := dao.MarshalFormulaContent(dao.FormulaContent{Latex: latex})
if err != nil {
log.Error(ctx, "func", "processVLFormulaTask", "msg", "序列化公式内容失败", "error", err)
return err
}
newResult := dao.RecognitionResult{TaskID: taskID, TaskType: dao.TaskTypeFormula, Content: contentJSON}
_ = newResult.SetMetaData(dao.ResultMetaData{TotalNum: 1})
err = resultDao.Create(dao.DB.WithContext(ctx), newResult)
```
更新时:
```go
contentJSON, err := dao.MarshalFormulaContent(dao.FormulaContent{Latex: latex})
if err != nil {
log.Error(ctx, "func", "processVLFormulaTask", "msg", "序列化公式内容失败", "error", err)
return err
}
err = resultDao.Update(dao.DB.WithContext(ctx), result.ID, map[string]interface{}{"content": contentJSON})
```
- [ ] **Step 3: 修改 recognition_service.go — GetFormualTask 读取**
找到 `GetFormualTask`约第134行将读取旧字段的代码
```go
// 旧代码:直接读 taskRet.Latex / taskRet.Markdown / taskRet.MathML / taskRet.MML
markdown := taskRet.Markdown
if markdown == "" {
markdown = fmt.Sprintf("$$%s$$", taskRet.Latex)
}
return &formula.GetFormulaTaskResponse{
TaskNo: taskNo,
Latex: taskRet.Latex,
Markdown: markdown,
MathML: taskRet.MathML,
MML: taskRet.MML,
Status: int(task.Status),
}, nil
```
替换为:
```go
// 新代码
formulaContent, err := taskRet.GetFormulaContent()
if err != nil {
log.Error(ctx, "func", "GetFormualTask", "msg", "解析公式内容失败", "error", err)
return nil, common.NewError(common.CodeSystemError, "解析识别结果失败", err)
}
markdown := formulaContent.Markdown
if markdown == "" {
markdown = fmt.Sprintf("$$%s$$", formulaContent.Latex)
}
return &formula.GetFormulaTaskResponse{
TaskNo: taskNo,
Latex: formulaContent.Latex,
Markdown: markdown,
MathML: formulaContent.MathML,
MML: formulaContent.MML,
Status: int(task.Status),
}, nil
```
- [ ] **Step 4: 修改 task.go — GetTaskList 读取结果(:91-119**
找到 `GetTaskList` 中组装 DTO 的代码:
```go
// 旧代码
var latex, markdown, mathML, mml string
recognitionResult := recognitionResultMap[item.ID]
if recognitionResult != nil {
latex = recognitionResult.Latex
markdown = recognitionResult.Markdown
mathML = recognitionResult.MathML
mml = recognitionResult.MML
}
```
替换为:
```go
// 新代码:按 task_type 反序列化 content
var latex, markdown, mathML, mml string
recognitionResult := recognitionResultMap[item.ID]
if recognitionResult != nil && recognitionResult.TaskType == dao.TaskTypeFormula {
if fc, err := recognitionResult.GetFormulaContent(); err == nil {
latex = fc.Latex
markdown = fc.Markdown
mathML = fc.MathML
mml = fc.MML
}
}
// PDF 类型的 TaskListDTO 暂不展开 content列表页只显示状态
```
- [ ] **Step 5: 修改 task.go — ExportTask 读取 markdown:140-155**
找到 `ExportTask` 中读取 markdown 的代码:
```go
// 旧代码
markdown := recognitionResult.Markdown
if markdown == "" {
log.Error(ctx, "func", "ExportTask", "msg", "markdown not found")
return nil, "", errors.New("markdown not found")
}
```
替换为:
```go
// 新代码:按 task_type 解析 content
var markdown string
switch recognitionResult.TaskType {
case dao.TaskTypeFormula:
fc, err := recognitionResult.GetFormulaContent()
if err != nil || fc.Markdown == "" {
log.Error(ctx, "func", "ExportTask", "msg", "公式结果解析失败或markdown为空", "error", err)
return nil, "", errors.New("markdown not found")
}
markdown = fc.Markdown
default:
log.Error(ctx, "func", "ExportTask", "msg", "不支持的导出任务类型", "task_type", recognitionResult.TaskType)
return nil, "", errors.New("unsupported task type for export")
}
```
- [ ] **Step 6: 验证编译**
```bash
go build ./internal/service/...
```
Expected: 无报错
- [ ] **Step 7: Commit**
```bash
git add internal/service/recognition_service.go internal/service/task.go
git commit -m "refactor: adapt all recognition result reads/writes to JSON content schema"
```
---
## Task 6: Cache — PDF Redis 队列
**Files:**
- Create: `internal/storage/cache/pdf.go`
- [ ] **Step 1: 创建 pdf.go**
```go
// internal/storage/cache/pdf.go
package cache
import (
"context"
"strconv"
)
const (
PDFRecognitionTaskQueue = "pdf_recognition_queue"
PDFRecognitionDistLock = "pdf_recognition_dist_lock"
)
func PushPDFTask(ctx context.Context, taskID int64) (int64, error) {
return RedisClient.LPush(ctx, PDFRecognitionTaskQueue, taskID).Result()
}
func PopPDFTask(ctx context.Context) (int64, error) {
result, err := RedisClient.BRPop(ctx, 0, PDFRecognitionTaskQueue).Result()
if err != nil {
return 0, err
}
return strconv.ParseInt(result[1], 10, 64)
}
func GetPDFDistributedLock(ctx context.Context) (bool, error) {
return RedisClient.SetNX(ctx, PDFRecognitionDistLock, "locked", DefaultLockTimeout).Result()
}
```
- [ ] **Step 2: 验证**
```bash
go build ./internal/storage/cache/...
```
- [ ] **Step 3: Commit**
```bash
git add internal/storage/cache/pdf.go
git commit -m "feat: add PDF recognition Redis queue"
```
---
## Task 7: Model — PDF 请求/响应 DTO
**Files:**
- Create: `internal/model/pdf/request.go`
- [ ] **Step 1: 创建文件**
```go
// internal/model/pdf/request.go
package pdf
// CreatePDFRecognitionRequest 创建PDF识别任务
type CreatePDFRecognitionRequest struct {
FileURL string `json:"file_url" binding:"required"`
FileHash string `json:"file_hash" binding:"required"`
FileName string `json:"file_name" binding:"required"`
UserID int64 `json:"user_id"`
}
// GetPDFTaskRequest URI 参数
type GetPDFTaskRequest struct {
TaskNo string `uri:"task_no" binding:"required"`
}
// CreatePDFTaskResponse 创建任务响应
type CreatePDFTaskResponse struct {
TaskNo string `json:"task_no"`
Status int `json:"status"`
}
// PDFPageResult 单页结果(与 dao.PDFPageContent 对应)
type PDFPageResult struct {
PageNumber int `json:"page_number"`
Markdown string `json:"markdown"`
}
// GetPDFTaskResponse 查询任务状态和结果
type GetPDFTaskResponse struct {
TaskNo string `json:"task_no"`
Status int `json:"status"` // 0=PENDING 1=PROCESSING 2=COMPLETED 3=FAILED
TotalPages int `json:"total_pages"` // 实际处理的页数
Pages []PDFPageResult `json:"pages"` // status=2 时填充
}
```
- [ ] **Step 2: 验证**
```bash
go build ./internal/model/pdf/...
```
- [ ] **Step 3: Commit**
```bash
git add internal/model/pdf/request.go
git commit -m "feat: add PDF recognition request/response models"
```
---
## Task 8: Service — PDFRecognitionService
**Files:**
- Create: `internal/service/pdf_recognition_service.go`
- [ ] **Step 1: 创建服务文件**
```go
// internal/service/pdf_recognition_service.go
package service
import (
"bytes"
"context"
"encoding/base64"
"encoding/json"
"fmt"
"io"
"net/http"
"time"
"github.com/gen2brain/go-fitz"
"gitea.com/texpixel/document_ai/internal/model/formula"
pdfmodel "gitea.com/texpixel/document_ai/internal/model/pdf"
"gitea.com/texpixel/document_ai/internal/storage/cache"
"gitea.com/texpixel/document_ai/internal/storage/dao"
"gitea.com/texpixel/document_ai/pkg/common"
"gitea.com/texpixel/document_ai/pkg/httpclient"
"gitea.com/texpixel/document_ai/pkg/log"
"gitea.com/texpixel/document_ai/pkg/oss"
"gitea.com/texpixel/document_ai/pkg/requestid"
"gitea.com/texpixel/document_ai/pkg/utils"
"gorm.io/gorm"
)
const (
pdfMaxPages = 10
pdfOCREndpoint = "https://cloud.texpixel.com:10443/doc_process/v1/image/ocr"
)
// PDFRecognitionService 处理 PDF 识别任务
type PDFRecognitionService struct {
db *gorm.DB
queueLimit chan struct{}
stopChan chan struct{}
httpClient *httpclient.Client
}
func NewPDFRecognitionService() *PDFRecognitionService {
s := &PDFRecognitionService{
db: dao.DB,
queueLimit: make(chan struct{}, 3),
stopChan: make(chan struct{}),
httpClient: httpclient.NewClient(nil),
}
utils.SafeGo(func() {
lock, err := cache.GetPDFDistributedLock(context.Background())
if err != nil || !lock {
log.Error(context.Background(), "func", "NewPDFRecognitionService", "msg", "获取PDF分布式锁失败")
return
}
s.processPDFQueue(context.Background())
})
return s
}
// CreatePDFTask 创建识别任务并入队
func (s *PDFRecognitionService) CreatePDFTask(ctx context.Context, req *pdfmodel.CreatePDFRecognitionRequest) (*dao.RecognitionTask, error) {
task := &dao.RecognitionTask{
UserID: req.UserID,
TaskUUID: utils.NewUUID(),
TaskType: dao.TaskTypePDF,
Status: dao.TaskStatusPending,
FileURL: req.FileURL,
FileName: req.FileName,
FileHash: req.FileHash,
IP: common.GetIPFromContext(ctx),
}
if err := dao.NewRecognitionTaskDao().Create(dao.DB.WithContext(ctx), task); err != nil {
log.Error(ctx, "func", "CreatePDFTask", "msg", "创建任务失败", "error", err)
return nil, common.NewError(common.CodeDBError, "创建任务失败", err)
}
if _, err := cache.PushPDFTask(ctx, task.ID); err != nil {
log.Error(ctx, "func", "CreatePDFTask", "msg", "推入队列失败", "error", err)
return nil, common.NewError(common.CodeSystemError, "推入队列失败", err)
}
return task, nil
}
// GetPDFTask 查询任务状态和结果
func (s *PDFRecognitionService) GetPDFTask(ctx context.Context, taskNo string) (*pdfmodel.GetPDFTaskResponse, error) {
sess := dao.DB.WithContext(ctx)
task, err := dao.NewRecognitionTaskDao().GetByTaskNo(sess, taskNo)
if err != nil {
if err == gorm.ErrRecordNotFound {
return nil, common.NewError(common.CodeNotFound, "任务不存在", err)
}
return nil, common.NewError(common.CodeDBError, "查询任务失败", err)
}
// 类型校验:防止公式任务被当成 PDF 解析
if task.TaskType != dao.TaskTypePDF {
return nil, common.NewError(common.CodeNotFound, "任务不存在", nil)
}
resp := &pdfmodel.GetPDFTaskResponse{
TaskNo: taskNo,
Status: int(task.Status),
}
if task.Status != dao.TaskStatusCompleted {
return resp, nil
}
result, err := dao.NewRecognitionResultDao().GetByTaskID(sess, task.ID)
if err != nil || result == nil {
return nil, common.NewError(common.CodeDBError, "查询识别结果失败", err)
}
pages, err := result.GetPDFContent()
if err != nil {
return nil, common.NewError(common.CodeSystemError, "解析识别结果失败", err)
}
resp.TotalPages = len(pages)
for _, p := range pages {
resp.Pages = append(resp.Pages, pdfmodel.PDFPageResult{
PageNumber: p.PageNumber,
Markdown: p.Markdown,
})
}
return resp, nil
}
// processPDFQueue 持续消费队列
func (s *PDFRecognitionService) processPDFQueue(ctx context.Context) {
for {
select {
case <-s.stopChan:
return
default:
s.processOnePDFTask(ctx)
}
}
}
func (s *PDFRecognitionService) processOnePDFTask(ctx context.Context) {
s.queueLimit <- struct{}{}
defer func() { <-s.queueLimit }()
taskID, err := cache.PopPDFTask(ctx)
if err != nil {
log.Error(ctx, "func", "processOnePDFTask", "msg", "获取任务失败", "error", err)
return
}
task, err := dao.NewRecognitionTaskDao().GetTaskByID(dao.DB.WithContext(ctx), taskID)
if err != nil || task == nil {
log.Error(ctx, "func", "processOnePDFTask", "msg", "任务不存在", "task_id", taskID)
return
}
ctx = context.WithValue(ctx, utils.RequestIDKey, task.TaskUUID)
requestid.SetRequestID(task.TaskUUID, func() {
if err := s.processPDFTask(ctx, taskID, task.FileURL); err != nil {
log.Error(ctx, "func", "processOnePDFTask", "msg", "处理PDF任务失败", "error", err)
}
})
}
// processPDFTask 核心处理:下载 → pre-hook → 逐页OCR → 写入DB
func (s *PDFRecognitionService) processPDFTask(ctx context.Context, taskID int64, fileURL string) error {
ctx, cancel := context.WithTimeout(ctx, 10*time.Minute)
defer cancel()
taskDao := dao.NewRecognitionTaskDao()
resultDao := dao.NewRecognitionResultDao()
isSuccess := false
defer func() {
status, remark := dao.TaskStatusFailed, "任务处理失败"
if isSuccess {
status, remark = dao.TaskStatusCompleted, ""
}
_ = taskDao.Update(dao.DB.WithContext(context.Background()),
map[string]interface{}{"id": taskID},
map[string]interface{}{"status": status, "completed_at": time.Now(), "remark": remark},
)
}()
// 更新为处理中
if err := taskDao.Update(dao.DB.WithContext(ctx),
map[string]interface{}{"id": taskID},
map[string]interface{}{"status": dao.TaskStatusProcessing},
); err != nil {
return fmt.Errorf("更新任务状态失败: %w", err)
}
// 下载 PDF
reader, err := oss.DownloadFile(ctx, fileURL)
if err != nil {
return fmt.Errorf("下载PDF失败: %w", err)
}
defer reader.Close()
pdfBytes, err := io.ReadAll(reader)
if err != nil {
return fmt.Errorf("读取PDF数据失败: %w", err)
}
// 打开 PDF
doc, err := fitz.NewFromMemory(pdfBytes)
if err != nil {
return fmt.Errorf("解析PDF失败: %w", err)
}
defer doc.Close()
// pre-hook: 限制最多处理前 10 页
totalInDoc := doc.NumPage()
processPages := totalInDoc
if processPages > pdfMaxPages {
processPages = pdfMaxPages
log.Info(ctx, "func", "processPDFTask", "msg", "PDF超过10页只处理前10页",
"task_id", taskID, "doc_total", totalInDoc)
}
log.Info(ctx, "func", "processPDFTask", "msg", "开始处理PDF",
"task_id", taskID, "process_pages", processPages)
// 逐页渲染 + OCR结果收集
var pages []dao.PDFPageContent
for pageNum := 0; pageNum < processPages; pageNum++ {
imgBytes, err := doc.ImagePNG(pageNum, 150) // 150 DPI
if err != nil {
return fmt.Errorf("渲染第%d页失败: %w", pageNum+1, err)
}
ocrResult, err := s.callOCR(ctx, imgBytes)
if err != nil {
return fmt.Errorf("OCR第%d页失败: %w", pageNum+1, err)
}
pages = append(pages, dao.PDFPageContent{
PageNumber: pageNum + 1,
Markdown: ocrResult.Markdown,
})
log.Info(ctx, "func", "processPDFTask", "msg", "页面OCR完成",
"page", pageNum+1, "total", processPages)
}
// 序列化并写入 DB单行
contentJSON, err := dao.MarshalPDFContent(pages)
if err != nil {
return fmt.Errorf("序列化PDF内容失败: %w", err)
}
dbResult := dao.RecognitionResult{
TaskID: taskID,
TaskType: dao.TaskTypePDF,
Content: contentJSON,
}
if err := dbResult.SetMetaData(dao.ResultMetaData{TotalNum: processPages}); err != nil {
return fmt.Errorf("序列化MetaData失败: %w", err)
}
if err := resultDao.Create(dao.DB.WithContext(ctx), dbResult); err != nil {
return fmt.Errorf("保存PDF结果失败: %w", err)
}
isSuccess = true
return nil
}
// callOCR 调用与公式识别相同的下游 OCR 接口
func (s *PDFRecognitionService) callOCR(ctx context.Context, imgBytes []byte) (*formula.ImageOCRResponse, error) {
reqBody := map[string]string{
"image_base64": base64.StdEncoding.EncodeToString(imgBytes),
}
jsonData, err := json.Marshal(reqBody)
if err != nil {
return nil, err
}
headers := map[string]string{
"Content-Type": "application/json",
utils.RequestIDHeaderKey: utils.GetRequestIDFromContext(ctx),
}
resp, err := s.httpClient.RequestWithRetry(ctx, http.MethodPost, pdfOCREndpoint, bytes.NewReader(jsonData), headers)
if err != nil {
return nil, fmt.Errorf("请求OCR接口失败: %w", err)
}
defer resp.Body.Close()
// 下游非 2xx 视为失败,避免把错误响应 body 当成识别结果存库
if resp.StatusCode != http.StatusOK {
body, _ := io.ReadAll(resp.Body)
return nil, fmt.Errorf("OCR接口返回非200状态: %d, body: %s", resp.StatusCode, string(body))
}
var ocrResp formula.ImageOCRResponse
if err := json.NewDecoder(resp.Body).Decode(&ocrResp); err != nil {
return nil, fmt.Errorf("解析OCR响应失败: %w", err)
}
return &ocrResp, nil
}
func (s *PDFRecognitionService) Stop() {
close(s.stopChan)
}
```
- [ ] **Step 2: 验证编译**
```bash
go build ./internal/service/...
```
Expected: 无报错
- [ ] **Step 3: Commit**
```bash
git add internal/service/pdf_recognition_service.go
git commit -m "feat: add PDFRecognitionService with 10-page pre-hook"
```
---
## Task 9: Handler — api/v1/pdf/handler.go
**Files:**
- Create: `api/v1/pdf/handler.go`
- [ ] **Step 1: 创建 handler**
```go
// api/v1/pdf/handler.go
package pdf
import (
"net/http"
"path/filepath"
"strings"
pdfmodel "gitea.com/texpixel/document_ai/internal/model/pdf"
"gitea.com/texpixel/document_ai/internal/service"
"gitea.com/texpixel/document_ai/pkg/common"
"gitea.com/texpixel/document_ai/pkg/constant"
"github.com/gin-gonic/gin"
)
type PDFEndpoint struct {
pdfService *service.PDFRecognitionService
}
func NewPDFEndpoint() *PDFEndpoint {
return &PDFEndpoint{
pdfService: service.NewPDFRecognitionService(),
}
}
func (e *PDFEndpoint) CreateTask(c *gin.Context) {
var req pdfmodel.CreatePDFRecognitionRequest
if err := c.BindJSON(&req); err != nil {
c.JSON(http.StatusOK, common.ErrorResponse(c, common.CodeParamError, "参数错误"))
return
}
req.UserID = c.GetInt64(constant.ContextUserID)
if strings.ToLower(filepath.Ext(req.FileName)) != ".pdf" {
c.JSON(http.StatusOK, common.ErrorResponse(c, common.CodeParamError, "仅支持PDF文件"))
return
}
task, err := e.pdfService.CreatePDFTask(c, &req)
if err != nil {
c.JSON(http.StatusOK, common.ErrorResponse(c, common.CodeSystemError, err.Error()))
return
}
c.JSON(http.StatusOK, common.SuccessResponse(c, &pdfmodel.CreatePDFTaskResponse{
TaskNo: task.TaskUUID,
Status: int(task.Status),
}))
}
func (e *PDFEndpoint) GetTaskStatus(c *gin.Context) {
var req pdfmodel.GetPDFTaskRequest
if err := c.ShouldBindUri(&req); err != nil {
c.JSON(http.StatusOK, common.ErrorResponse(c, common.CodeParamError, "参数错误"))
return
}
resp, err := e.pdfService.GetPDFTask(c, req.TaskNo)
if err != nil {
// 透传 BusinessError 的错误码,让 404 返回 CodeNotFound 而不是统一包成 CodeSystemError
if bizErr, ok := err.(*common.BusinessError); ok {
c.JSON(http.StatusOK, common.ErrorResponse(c, int(bizErr.Code), bizErr.Message))
return
}
c.JSON(http.StatusOK, common.ErrorResponse(c, common.CodeSystemError, err.Error()))
return
}
c.JSON(http.StatusOK, common.SuccessResponse(c, resp))
}
```
- [ ] **Step 2: 验证**
```bash
go build ./api/...
```
- [ ] **Step 3: Commit**
```bash
git add api/v1/pdf/handler.go
git commit -m "feat: add PDF recognition HTTP handler"
```
---
## Task 10: Router + OSS Handler
> **OSS 大小限制说明**:当前 `GetSignatureURL` handler 不做文件大小校验(没有 `file_size` 入参),大小限制由 Aliyun OSS Policy Token 的 `content-length-range` 条件控制。如需放宽 PDF 上传的大小上限,需修改 `pkg/oss` 中生成 Policy Token 的逻辑(在本 Task 范围之外)。本 Task 只处理文件类型白名单。
**Files:**
- Modify: `api/router.go`
- Modify: `api/v1/oss/handler.go`
- [ ] **Step 1: 在 router.go 添加 PDF import 和路由**
import 块添加:
```go
"gitea.com/texpixel/document_ai/api/v1/pdf"
```
`SetupRouter` 的 v1 块末尾添加:
```go
pdfRouter := v1.Group("/pdf", common.GetAuthMiddleware())
{
endpoint := pdf.NewPDFEndpoint()
pdfRouter.POST("/recognition", endpoint.CreateTask)
pdfRouter.GET("/recognition/:task_no", endpoint.GetTaskStatus)
}
```
- [ ] **Step 2: 在 oss/handler.go 的白名单中添加 .pdf**
找到(`handler.go:73`
```go
if !utils.InArray(extend, []string{".jpg", ".jpeg", ".png", ".gif", ".bmp", ".tiff", ".webp"}) {
```
改为:
```go
if !utils.InArray(extend, []string{".jpg", ".jpeg", ".png", ".gif", ".bmp", ".tiff", ".webp", ".pdf"}) {
```
- [ ] **Step 3: 验证整体编译**
```bash
go build ./...
```
Expected: 无报错
- [ ] **Step 4: 冒烟测试路由**
```bash
go run main.go &
curl -X GET http://localhost:8024/v1/pdf/recognition/fake-task-no \
-H "Authorization: Bearer YOUR_TOKEN"
```
Expected: `{"code":404,"message":"任务不存在",...}` — GetByTaskNo 返回 ErrRecordNotFound → service 返回 CodeNotFound BusinessError → handler 透传错误码
- [ ] **Step 5: Commit**
```bash
git add api/router.go api/v1/oss/handler.go
git commit -m "feat: register PDF routes and allow .pdf upload in OSS handler"
```
---
## 前端交互流程
```
1. POST /v1/oss/signature_url { file_name: "doc.pdf", file_hash, file_size }
→ { sign_url, path: "formula/uuid.pdf" }
2. PUT sign_url (直传 PDF 到 OSS
3. POST /v1/pdf/recognition { file_url, file_hash, file_name: "doc.pdf" }
→ { task_no: "uuid", status: 0 }
4. GET /v1/pdf/recognition/:task_no 每3秒轮询
→ status=1 { task_no, status:1, total_pages:0, pages:[] }
5. status=2 时:
{
"task_no": "uuid",
"status": 2,
"total_pages": 8, ← 实际处理页数最多10
"pages": [
{ "page_number": 1, "markdown": "# 第一章\n..." },
{ "page_number": 2, "markdown": "## 1.1\n..." }
]
}
```
---
## 数据库样例
```sql
-- recognition_results 表中 PDF 任务的一行示例
INSERT INTO recognition_results (task_id, task_type, meta_data, content) VALUES (
123,
'PDF',
'{"total_num":8}',
'[{"page_number":1,"markdown":"# 第一章\n正文..."},{"page_number":2,"markdown":"## 1.1\n..."}]'
);
-- FORMULA 任务的一行示例
INSERT INTO recognition_results (task_id, task_type, meta_data, content) VALUES (
456,
'FORMULA',
'{"total_num":1}',
'{"latex":"E=mc^2","markdown":"$$E=mc^2$$","mathml":"<math>...</math>","mml":""}'
);
```
---
## 自检清单
- [x] **Breaking change 全覆盖**: 迁移删旧列后,`recognition_service.go`3处写/读)和 `task.go`GetTaskList + ExportTask 2处读在同一 commit 里全部更新,不存在中间状态崩溃窗口
- [x] **单行存储**: PDF 所有页面的结果存为一行的 JSON array不增加新表
- [x] **pre-hook**: processPDFTask 开头 clamp processPages ≤ 10写日志说明
- [x] **OCR 接口复用**: PDF 与公式识别调用同一下游端点请求格式image_base64完全相同
- [x] **GetPDFTask 类型校验**: 获取任务后校验 TaskType == PDF类型不符返回 CodeNotFound防止公式任务被当 PDF 解析
- [x] **callOCR StatusCode 检查**: 下游非 200 立即返回 error不解析 body防止把错误响应存为识别结果
- [x] **Handler 错误码透传**: GetTaskStatus 检查 `*common.BusinessError`,透传 Code 字段404 正确返回 code=404
- [x] **meta_data.total_num**: 公式=1PDF=实际处理页数
- [x] **错误恢复**: defer 保证异常时任务状态更新为 FAILED
- [x] **超时**: PDF 任务 10 分钟超时10页 × ~45秒
- [x] **OSS 大小限制**: handler 无代码侧大小校验,限制由 OSS Policy Token 的 content-length-range 控制;本计划只扩展文件类型白名单