32 KiB
PDF Recognition Implementation Plan
For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (
- [ ]) syntax for tracking.
Goal: 支持 PDF 逐页 OCR 识别(最多10页),同步重构 recognition_results 表为 JSON 内容结构,兼容公式识别和 PDF 识别两种场景。
Architecture: recognition_results 每个任务存一行:meta_data JSON 存元信息(total_num),content JSON 存识别内容(公式:{latex, markdown, mml};PDF:[{page_number, markdown}, ...])。PDF 处理链:下载 → go-fitz 分页渲染 → pre-hook 限前10页 → 逐页调用现有下游 OCR 接口 → 组装 JSON → 写入 DB。
Tech Stack: Go 1.20, Gin, GORM/MySQL, Redis, Aliyun OSS, github.com/gen2brain/go-fitz v0.24.0, 现有下游 OCR 接口 cloud.texpixel.com
表结构设计
recognition_results
├── id BIGINT PK
├── task_id BIGINT INDEX
├── task_type VARCHAR(16) -- FORMULA / PDF
├── meta_data JSON -- {"total_num": 1}
├── content JSON -- 见下方说明
├── created_at DATETIME
└── updated_at DATETIME
content 格式(按 task_type):
FORMULA: {"latex":"E=mc^2","markdown":"$$E=mc^2$$","mml":"<math>..."}
PDF: [{"page_number":1,"markdown":"# 第一章\n..."},{"page_number":2,"markdown":"..."}]
旧字段 latex / markdown / mathml / mml 全部删除,由 content JSON 承接。
文件变更清单
| 操作 | 文件路径 | 职责 |
|---|---|---|
| Create | migrations/pdf_recognition.sql |
ALTER recognition_results:删旧字段,加 meta_data/content JSON |
| Modify | internal/storage/dao/task.go |
增加 TaskTypePDF 常量 |
| Modify | internal/storage/dao/result.go |
重构 RecognitionResult struct;新增内容类型辅助结构;更新 DAO 方法 |
| Create | internal/model/pdf/request.go |
PDF 识别请求/响应 DTO |
| Create | internal/storage/cache/pdf.go |
Redis 队列操作(PDF 专用) |
| Modify | internal/service/recognition_service.go |
更新 processFormulaTask / GetFormualTask 使用新 JSON 格式 |
| Create | internal/service/pdf_recognition_service.go |
PDF 识别业务逻辑 |
| Create | api/v1/pdf/handler.go |
HTTP 处理器 |
| Modify | api/router.go |
注册 PDF 路由 |
| Modify | api/v1/oss/handler.go |
文件类型白名单加 .pdf,大小限制放宽至 50MB |
| Modify | go.mod / go.sum |
添加 go-fitz 依赖 |
环境前置:安装 MuPDF(go-fitz CGo 依赖)
# macOS
brew install mupdf
# Ubuntu/Debian
sudo apt-get install -y libmupdf-dev
# 验证
pkg-config --modversion mupdf
Task 1: 数据库迁移 — 重构 recognition_results
Files:
-
Create:
migrations/pdf_recognition.sql -
Step 1: 创建迁移文件
-- migrations/pdf_recognition.sql
-- 1. 删除旧的单字段列(已有数据可提前备份)
ALTER TABLE `recognition_results`
DROP COLUMN `latex`,
DROP COLUMN `markdown`,
DROP COLUMN `mathml`,
DROP COLUMN `mml`;
-- 2. 增加 JSON 字段
ALTER TABLE `recognition_results`
ADD COLUMN `meta_data` JSON DEFAULT NULL COMMENT '元数据 {"total_num":1}' AFTER `task_type`,
ADD COLUMN `content` JSON DEFAULT NULL COMMENT '识别内容 JSON' AFTER `meta_data`;
- Step 2: 执行迁移
mysql -u root -p doc_ai < migrations/pdf_recognition.sql
Expected: Query OK
- Step 3: 验证表结构
mysql -u root -p doc_ai -e "DESCRIBE recognition_results;"
Expected: 字段为 id, task_id, task_type, meta_data, content, created_at, updated_at(无 latex/markdown/mathml/mml)
- Step 4: Commit
git add migrations/pdf_recognition.sql
git commit -m "feat: migrate recognition_results to JSON content schema"
Task 2: 添加 go-fitz 依赖
Files:
-
Modify:
go.mod -
Step 1: 安装依赖
go get github.com/gen2brain/go-fitz@v0.24.0
Expected: go: added github.com/gen2brain/go-fitz v0.24.0
- Step 2: 验证
go build ./...
- Step 3: Commit
git add go.mod go.sum
git commit -m "feat: add go-fitz for PDF page rendering"
Task 3: 常量扩展
Files:
-
Modify:
internal/storage/dao/task.go -
Step 1: 添加 TaskTypePDF
找到 const 块,将:
TaskTypeLayout TaskType = "LAYOUT"
改为:
TaskTypeLayout TaskType = "LAYOUT"
TaskTypePDF TaskType = "PDF"
- Step 2: 验证
go build ./internal/storage/dao/...
- Step 3: Commit
git add internal/storage/dao/task.go
git commit -m "feat: add TaskTypePDF constant"
Task 4: DAO — 重构 RecognitionResult
Files:
-
Modify:
internal/storage/dao/result.go -
Step 1: 用新 struct 完整替换 result.go 内容
package dao
import (
"encoding/json"
"gorm.io/gorm"
)
// FormulaContent 公式识别的 content 字段结构
type FormulaContent struct {
Latex string `json:"latex"`
Markdown string `json:"markdown"`
MathML string `json:"mathml"`
MML string `json:"mml"`
}
// PDFPageContent PDF 单页识别结果
type PDFPageContent struct {
PageNumber int `json:"page_number"`
Markdown string `json:"markdown"`
}
// ResultMetaData recognition_results.meta_data 字段结构
type ResultMetaData struct {
TotalNum int `json:"total_num"`
}
// RecognitionResult recognition_results 表模型
type RecognitionResult struct {
BaseModel
TaskID int64 `gorm:"column:task_id;bigint;not null;default:0;index;comment:任务ID" json:"task_id"`
TaskType TaskType `gorm:"column:task_type;varchar(16);not null;comment:任务类型;default:''" json:"task_type"`
MetaData string `gorm:"column:meta_data;type:json;comment:元数据" json:"meta_data"`
Content string `gorm:"column:content;type:json;comment:识别内容JSON" json:"content"`
}
// SetMetaData 序列化并写入 MetaData 字段
func (r *RecognitionResult) SetMetaData(meta ResultMetaData) error {
b, err := json.Marshal(meta)
if err != nil {
return err
}
r.MetaData = string(b)
return nil
}
// GetFormulaContent 从 Content 字段反序列化公式结果
func (r *RecognitionResult) GetFormulaContent() (*FormulaContent, error) {
var c FormulaContent
if err := json.Unmarshal([]byte(r.Content), &c); err != nil {
return nil, err
}
return &c, nil
}
// GetPDFContent 从 Content 字段反序列化 PDF 分页结果
func (r *RecognitionResult) GetPDFContent() ([]PDFPageContent, error) {
var pages []PDFPageContent
if err := json.Unmarshal([]byte(r.Content), &pages); err != nil {
return nil, err
}
return pages, nil
}
// MarshalFormulaContent 将公式结果序列化为 JSON 字符串(供写入 Content)
func MarshalFormulaContent(c FormulaContent) (string, error) {
b, err := json.Marshal(c)
return string(b), err
}
// MarshalPDFContent 将 PDF 分页结果序列化为 JSON 字符串(供写入 Content)
func MarshalPDFContent(pages []PDFPageContent) (string, error) {
b, err := json.Marshal(pages)
return string(b), err
}
type RecognitionResultDao struct{}
func NewRecognitionResultDao() *RecognitionResultDao {
return &RecognitionResultDao{}
}
func (dao *RecognitionResultDao) Create(tx *gorm.DB, data RecognitionResult) error {
return tx.Create(&data).Error
}
func (dao *RecognitionResultDao) GetByTaskID(tx *gorm.DB, taskID int64) (*RecognitionResult, error) {
result := &RecognitionResult{}
err := tx.Where("task_id = ?", taskID).First(result).Error
if err != nil && err == gorm.ErrRecordNotFound {
return nil, nil
}
return result, err
}
func (dao *RecognitionResultDao) Update(tx *gorm.DB, id int64, updates map[string]interface{}) error {
return tx.Model(&RecognitionResult{}).Where("id = ?", id).Updates(updates).Error
}
- Step 2: 验证编译
go build ./internal/storage/dao/...
Expected: 无报错
- Step 3: Commit
git add internal/storage/dao/result.go
git commit -m "refactor: RecognitionResult to JSON content schema (meta_data + content)"
Task 5: 更新公式识别 + TaskService — 适配新 JSON 格式
Files:
- Modify:
internal/service/recognition_service.go - Modify:
internal/service/task.go
注意:迁移删除了
latex/markdown/mathml/mml列,task.go的GetTaskList(:98-101)和ExportTask(:151)都直接读这些字段,必须在同一个 commit 里一起更新,否则迁移后立即崩溃。
- Step 1: 修改 recognition_service.go — processFormulaTask 写入
找到 processFormulaTask 内调用 resultDao.Create 的代码(约第542行):
// 旧代码
err = resultDao.Create(tx, dao.RecognitionResult{
TaskID: taskID,
TaskType: dao.TaskTypeFormula,
Latex: ocrResp.Latex,
Markdown: ocrResp.Markdown,
MathML: ocrResp.MathML,
MML: ocrResp.MML,
})
替换为:
// 新代码
contentJSON, err := dao.MarshalFormulaContent(dao.FormulaContent{
Latex: ocrResp.Latex,
Markdown: ocrResp.Markdown,
MathML: ocrResp.MathML,
MML: ocrResp.MML,
})
if err != nil {
log.Error(ctx, "func", "processFormulaTask", "msg", "序列化公式内容失败", "error", err)
return err
}
result := dao.RecognitionResult{
TaskID: taskID,
TaskType: dao.TaskTypeFormula,
Content: contentJSON,
}
if err = result.SetMetaData(dao.ResultMetaData{TotalNum: 1}); err != nil {
log.Error(ctx, "func", "processFormulaTask", "msg", "序列化MetaData失败", "error", err)
return err
}
err = resultDao.Create(tx, result)
if err != nil {
log.Error(ctx, "func", "processFormulaTask", "msg", "保存任务结果失败", "error", err)
return err
}
- Step 2: 修改 recognition_service.go — processVLFormulaTask 写入
找到 processVLFormulaTask 内对 resultDao.Create / resultDao.Update 的调用(约第665-678行):
创建时:
contentJSON, err := dao.MarshalFormulaContent(dao.FormulaContent{Latex: latex})
if err != nil {
log.Error(ctx, "func", "processVLFormulaTask", "msg", "序列化公式内容失败", "error", err)
return err
}
newResult := dao.RecognitionResult{TaskID: taskID, TaskType: dao.TaskTypeFormula, Content: contentJSON}
_ = newResult.SetMetaData(dao.ResultMetaData{TotalNum: 1})
err = resultDao.Create(dao.DB.WithContext(ctx), newResult)
更新时:
contentJSON, err := dao.MarshalFormulaContent(dao.FormulaContent{Latex: latex})
if err != nil {
log.Error(ctx, "func", "processVLFormulaTask", "msg", "序列化公式内容失败", "error", err)
return err
}
err = resultDao.Update(dao.DB.WithContext(ctx), result.ID, map[string]interface{}{"content": contentJSON})
- Step 3: 修改 recognition_service.go — GetFormualTask 读取
找到 GetFormualTask(约第134行),将读取旧字段的代码:
// 旧代码:直接读 taskRet.Latex / taskRet.Markdown / taskRet.MathML / taskRet.MML
markdown := taskRet.Markdown
if markdown == "" {
markdown = fmt.Sprintf("$$%s$$", taskRet.Latex)
}
return &formula.GetFormulaTaskResponse{
TaskNo: taskNo,
Latex: taskRet.Latex,
Markdown: markdown,
MathML: taskRet.MathML,
MML: taskRet.MML,
Status: int(task.Status),
}, nil
替换为:
// 新代码
formulaContent, err := taskRet.GetFormulaContent()
if err != nil {
log.Error(ctx, "func", "GetFormualTask", "msg", "解析公式内容失败", "error", err)
return nil, common.NewError(common.CodeSystemError, "解析识别结果失败", err)
}
markdown := formulaContent.Markdown
if markdown == "" {
markdown = fmt.Sprintf("$$%s$$", formulaContent.Latex)
}
return &formula.GetFormulaTaskResponse{
TaskNo: taskNo,
Latex: formulaContent.Latex,
Markdown: markdown,
MathML: formulaContent.MathML,
MML: formulaContent.MML,
Status: int(task.Status),
}, nil
- Step 4: 修改 task.go — GetTaskList 读取结果(:91-119)
找到 GetTaskList 中组装 DTO 的代码:
// 旧代码
var latex, markdown, mathML, mml string
recognitionResult := recognitionResultMap[item.ID]
if recognitionResult != nil {
latex = recognitionResult.Latex
markdown = recognitionResult.Markdown
mathML = recognitionResult.MathML
mml = recognitionResult.MML
}
替换为:
// 新代码:按 task_type 反序列化 content
var latex, markdown, mathML, mml string
recognitionResult := recognitionResultMap[item.ID]
if recognitionResult != nil && recognitionResult.TaskType == dao.TaskTypeFormula {
if fc, err := recognitionResult.GetFormulaContent(); err == nil {
latex = fc.Latex
markdown = fc.Markdown
mathML = fc.MathML
mml = fc.MML
}
}
// PDF 类型的 TaskListDTO 暂不展开 content(列表页只显示状态)
- Step 5: 修改 task.go — ExportTask 读取 markdown(:140-155)
找到 ExportTask 中读取 markdown 的代码:
// 旧代码
markdown := recognitionResult.Markdown
if markdown == "" {
log.Error(ctx, "func", "ExportTask", "msg", "markdown not found")
return nil, "", errors.New("markdown not found")
}
替换为:
// 新代码:按 task_type 解析 content
var markdown string
switch recognitionResult.TaskType {
case dao.TaskTypeFormula:
fc, err := recognitionResult.GetFormulaContent()
if err != nil || fc.Markdown == "" {
log.Error(ctx, "func", "ExportTask", "msg", "公式结果解析失败或markdown为空", "error", err)
return nil, "", errors.New("markdown not found")
}
markdown = fc.Markdown
default:
log.Error(ctx, "func", "ExportTask", "msg", "不支持的导出任务类型", "task_type", recognitionResult.TaskType)
return nil, "", errors.New("unsupported task type for export")
}
- Step 6: 验证编译
go build ./internal/service/...
Expected: 无报错
- Step 7: Commit
git add internal/service/recognition_service.go internal/service/task.go
git commit -m "refactor: adapt all recognition result reads/writes to JSON content schema"
Task 6: Cache — PDF Redis 队列
Files:
-
Create:
internal/storage/cache/pdf.go -
Step 1: 创建 pdf.go
// internal/storage/cache/pdf.go
package cache
import (
"context"
"strconv"
)
const (
PDFRecognitionTaskQueue = "pdf_recognition_queue"
PDFRecognitionDistLock = "pdf_recognition_dist_lock"
)
func PushPDFTask(ctx context.Context, taskID int64) (int64, error) {
return RedisClient.LPush(ctx, PDFRecognitionTaskQueue, taskID).Result()
}
func PopPDFTask(ctx context.Context) (int64, error) {
result, err := RedisClient.BRPop(ctx, 0, PDFRecognitionTaskQueue).Result()
if err != nil {
return 0, err
}
return strconv.ParseInt(result[1], 10, 64)
}
func GetPDFDistributedLock(ctx context.Context) (bool, error) {
return RedisClient.SetNX(ctx, PDFRecognitionDistLock, "locked", DefaultLockTimeout).Result()
}
- Step 2: 验证
go build ./internal/storage/cache/...
- Step 3: Commit
git add internal/storage/cache/pdf.go
git commit -m "feat: add PDF recognition Redis queue"
Task 7: Model — PDF 请求/响应 DTO
Files:
-
Create:
internal/model/pdf/request.go -
Step 1: 创建文件
// internal/model/pdf/request.go
package pdf
// CreatePDFRecognitionRequest 创建PDF识别任务
type CreatePDFRecognitionRequest struct {
FileURL string `json:"file_url" binding:"required"`
FileHash string `json:"file_hash" binding:"required"`
FileName string `json:"file_name" binding:"required"`
UserID int64 `json:"user_id"`
}
// GetPDFTaskRequest URI 参数
type GetPDFTaskRequest struct {
TaskNo string `uri:"task_no" binding:"required"`
}
// CreatePDFTaskResponse 创建任务响应
type CreatePDFTaskResponse struct {
TaskNo string `json:"task_no"`
Status int `json:"status"`
}
// PDFPageResult 单页结果(与 dao.PDFPageContent 对应)
type PDFPageResult struct {
PageNumber int `json:"page_number"`
Markdown string `json:"markdown"`
}
// GetPDFTaskResponse 查询任务状态和结果
type GetPDFTaskResponse struct {
TaskNo string `json:"task_no"`
Status int `json:"status"` // 0=PENDING 1=PROCESSING 2=COMPLETED 3=FAILED
TotalPages int `json:"total_pages"` // 实际处理的页数
Pages []PDFPageResult `json:"pages"` // status=2 时填充
}
- Step 2: 验证
go build ./internal/model/pdf/...
- Step 3: Commit
git add internal/model/pdf/request.go
git commit -m "feat: add PDF recognition request/response models"
Task 8: Service — PDFRecognitionService
Files:
-
Create:
internal/service/pdf_recognition_service.go -
Step 1: 创建服务文件
// internal/service/pdf_recognition_service.go
package service
import (
"bytes"
"context"
"encoding/base64"
"encoding/json"
"fmt"
"io"
"net/http"
"time"
"github.com/gen2brain/go-fitz"
"gitea.com/texpixel/document_ai/internal/model/formula"
pdfmodel "gitea.com/texpixel/document_ai/internal/model/pdf"
"gitea.com/texpixel/document_ai/internal/storage/cache"
"gitea.com/texpixel/document_ai/internal/storage/dao"
"gitea.com/texpixel/document_ai/pkg/common"
"gitea.com/texpixel/document_ai/pkg/httpclient"
"gitea.com/texpixel/document_ai/pkg/log"
"gitea.com/texpixel/document_ai/pkg/oss"
"gitea.com/texpixel/document_ai/pkg/requestid"
"gitea.com/texpixel/document_ai/pkg/utils"
"gorm.io/gorm"
)
const (
pdfMaxPages = 10
pdfOCREndpoint = "https://cloud.texpixel.com:10443/doc_process/v1/image/ocr"
)
// PDFRecognitionService 处理 PDF 识别任务
type PDFRecognitionService struct {
db *gorm.DB
queueLimit chan struct{}
stopChan chan struct{}
httpClient *httpclient.Client
}
func NewPDFRecognitionService() *PDFRecognitionService {
s := &PDFRecognitionService{
db: dao.DB,
queueLimit: make(chan struct{}, 3),
stopChan: make(chan struct{}),
httpClient: httpclient.NewClient(nil),
}
utils.SafeGo(func() {
lock, err := cache.GetPDFDistributedLock(context.Background())
if err != nil || !lock {
log.Error(context.Background(), "func", "NewPDFRecognitionService", "msg", "获取PDF分布式锁失败")
return
}
s.processPDFQueue(context.Background())
})
return s
}
// CreatePDFTask 创建识别任务并入队
func (s *PDFRecognitionService) CreatePDFTask(ctx context.Context, req *pdfmodel.CreatePDFRecognitionRequest) (*dao.RecognitionTask, error) {
task := &dao.RecognitionTask{
UserID: req.UserID,
TaskUUID: utils.NewUUID(),
TaskType: dao.TaskTypePDF,
Status: dao.TaskStatusPending,
FileURL: req.FileURL,
FileName: req.FileName,
FileHash: req.FileHash,
IP: common.GetIPFromContext(ctx),
}
if err := dao.NewRecognitionTaskDao().Create(dao.DB.WithContext(ctx), task); err != nil {
log.Error(ctx, "func", "CreatePDFTask", "msg", "创建任务失败", "error", err)
return nil, common.NewError(common.CodeDBError, "创建任务失败", err)
}
if _, err := cache.PushPDFTask(ctx, task.ID); err != nil {
log.Error(ctx, "func", "CreatePDFTask", "msg", "推入队列失败", "error", err)
return nil, common.NewError(common.CodeSystemError, "推入队列失败", err)
}
return task, nil
}
// GetPDFTask 查询任务状态和结果
func (s *PDFRecognitionService) GetPDFTask(ctx context.Context, taskNo string) (*pdfmodel.GetPDFTaskResponse, error) {
sess := dao.DB.WithContext(ctx)
task, err := dao.NewRecognitionTaskDao().GetByTaskNo(sess, taskNo)
if err != nil {
if err == gorm.ErrRecordNotFound {
return nil, common.NewError(common.CodeNotFound, "任务不存在", err)
}
return nil, common.NewError(common.CodeDBError, "查询任务失败", err)
}
// 类型校验:防止公式任务被当成 PDF 解析
if task.TaskType != dao.TaskTypePDF {
return nil, common.NewError(common.CodeNotFound, "任务不存在", nil)
}
resp := &pdfmodel.GetPDFTaskResponse{
TaskNo: taskNo,
Status: int(task.Status),
}
if task.Status != dao.TaskStatusCompleted {
return resp, nil
}
result, err := dao.NewRecognitionResultDao().GetByTaskID(sess, task.ID)
if err != nil || result == nil {
return nil, common.NewError(common.CodeDBError, "查询识别结果失败", err)
}
pages, err := result.GetPDFContent()
if err != nil {
return nil, common.NewError(common.CodeSystemError, "解析识别结果失败", err)
}
resp.TotalPages = len(pages)
for _, p := range pages {
resp.Pages = append(resp.Pages, pdfmodel.PDFPageResult{
PageNumber: p.PageNumber,
Markdown: p.Markdown,
})
}
return resp, nil
}
// processPDFQueue 持续消费队列
func (s *PDFRecognitionService) processPDFQueue(ctx context.Context) {
for {
select {
case <-s.stopChan:
return
default:
s.processOnePDFTask(ctx)
}
}
}
func (s *PDFRecognitionService) processOnePDFTask(ctx context.Context) {
s.queueLimit <- struct{}{}
defer func() { <-s.queueLimit }()
taskID, err := cache.PopPDFTask(ctx)
if err != nil {
log.Error(ctx, "func", "processOnePDFTask", "msg", "获取任务失败", "error", err)
return
}
task, err := dao.NewRecognitionTaskDao().GetTaskByID(dao.DB.WithContext(ctx), taskID)
if err != nil || task == nil {
log.Error(ctx, "func", "processOnePDFTask", "msg", "任务不存在", "task_id", taskID)
return
}
ctx = context.WithValue(ctx, utils.RequestIDKey, task.TaskUUID)
requestid.SetRequestID(task.TaskUUID, func() {
if err := s.processPDFTask(ctx, taskID, task.FileURL); err != nil {
log.Error(ctx, "func", "processOnePDFTask", "msg", "处理PDF任务失败", "error", err)
}
})
}
// processPDFTask 核心处理:下载 → pre-hook → 逐页OCR → 写入DB
func (s *PDFRecognitionService) processPDFTask(ctx context.Context, taskID int64, fileURL string) error {
ctx, cancel := context.WithTimeout(ctx, 10*time.Minute)
defer cancel()
taskDao := dao.NewRecognitionTaskDao()
resultDao := dao.NewRecognitionResultDao()
isSuccess := false
defer func() {
status, remark := dao.TaskStatusFailed, "任务处理失败"
if isSuccess {
status, remark = dao.TaskStatusCompleted, ""
}
_ = taskDao.Update(dao.DB.WithContext(context.Background()),
map[string]interface{}{"id": taskID},
map[string]interface{}{"status": status, "completed_at": time.Now(), "remark": remark},
)
}()
// 更新为处理中
if err := taskDao.Update(dao.DB.WithContext(ctx),
map[string]interface{}{"id": taskID},
map[string]interface{}{"status": dao.TaskStatusProcessing},
); err != nil {
return fmt.Errorf("更新任务状态失败: %w", err)
}
// 下载 PDF
reader, err := oss.DownloadFile(ctx, fileURL)
if err != nil {
return fmt.Errorf("下载PDF失败: %w", err)
}
defer reader.Close()
pdfBytes, err := io.ReadAll(reader)
if err != nil {
return fmt.Errorf("读取PDF数据失败: %w", err)
}
// 打开 PDF
doc, err := fitz.NewFromMemory(pdfBytes)
if err != nil {
return fmt.Errorf("解析PDF失败: %w", err)
}
defer doc.Close()
// pre-hook: 限制最多处理前 10 页
totalInDoc := doc.NumPage()
processPages := totalInDoc
if processPages > pdfMaxPages {
processPages = pdfMaxPages
log.Info(ctx, "func", "processPDFTask", "msg", "PDF超过10页,只处理前10页",
"task_id", taskID, "doc_total", totalInDoc)
}
log.Info(ctx, "func", "processPDFTask", "msg", "开始处理PDF",
"task_id", taskID, "process_pages", processPages)
// 逐页渲染 + OCR,结果收集
var pages []dao.PDFPageContent
for pageNum := 0; pageNum < processPages; pageNum++ {
imgBytes, err := doc.ImagePNG(pageNum, 150) // 150 DPI
if err != nil {
return fmt.Errorf("渲染第%d页失败: %w", pageNum+1, err)
}
ocrResult, err := s.callOCR(ctx, imgBytes)
if err != nil {
return fmt.Errorf("OCR第%d页失败: %w", pageNum+1, err)
}
pages = append(pages, dao.PDFPageContent{
PageNumber: pageNum + 1,
Markdown: ocrResult.Markdown,
})
log.Info(ctx, "func", "processPDFTask", "msg", "页面OCR完成",
"page", pageNum+1, "total", processPages)
}
// 序列化并写入 DB(单行)
contentJSON, err := dao.MarshalPDFContent(pages)
if err != nil {
return fmt.Errorf("序列化PDF内容失败: %w", err)
}
dbResult := dao.RecognitionResult{
TaskID: taskID,
TaskType: dao.TaskTypePDF,
Content: contentJSON,
}
if err := dbResult.SetMetaData(dao.ResultMetaData{TotalNum: processPages}); err != nil {
return fmt.Errorf("序列化MetaData失败: %w", err)
}
if err := resultDao.Create(dao.DB.WithContext(ctx), dbResult); err != nil {
return fmt.Errorf("保存PDF结果失败: %w", err)
}
isSuccess = true
return nil
}
// callOCR 调用与公式识别相同的下游 OCR 接口
func (s *PDFRecognitionService) callOCR(ctx context.Context, imgBytes []byte) (*formula.ImageOCRResponse, error) {
reqBody := map[string]string{
"image_base64": base64.StdEncoding.EncodeToString(imgBytes),
}
jsonData, err := json.Marshal(reqBody)
if err != nil {
return nil, err
}
headers := map[string]string{
"Content-Type": "application/json",
utils.RequestIDHeaderKey: utils.GetRequestIDFromContext(ctx),
}
resp, err := s.httpClient.RequestWithRetry(ctx, http.MethodPost, pdfOCREndpoint, bytes.NewReader(jsonData), headers)
if err != nil {
return nil, fmt.Errorf("请求OCR接口失败: %w", err)
}
defer resp.Body.Close()
// 下游非 2xx 视为失败,避免把错误响应 body 当成识别结果存库
if resp.StatusCode != http.StatusOK {
body, _ := io.ReadAll(resp.Body)
return nil, fmt.Errorf("OCR接口返回非200状态: %d, body: %s", resp.StatusCode, string(body))
}
var ocrResp formula.ImageOCRResponse
if err := json.NewDecoder(resp.Body).Decode(&ocrResp); err != nil {
return nil, fmt.Errorf("解析OCR响应失败: %w", err)
}
return &ocrResp, nil
}
func (s *PDFRecognitionService) Stop() {
close(s.stopChan)
}
- Step 2: 验证编译
go build ./internal/service/...
Expected: 无报错
- Step 3: Commit
git add internal/service/pdf_recognition_service.go
git commit -m "feat: add PDFRecognitionService with 10-page pre-hook"
Task 9: Handler — api/v1/pdf/handler.go
Files:
-
Create:
api/v1/pdf/handler.go -
Step 1: 创建 handler
// api/v1/pdf/handler.go
package pdf
import (
"net/http"
"path/filepath"
"strings"
pdfmodel "gitea.com/texpixel/document_ai/internal/model/pdf"
"gitea.com/texpixel/document_ai/internal/service"
"gitea.com/texpixel/document_ai/pkg/common"
"gitea.com/texpixel/document_ai/pkg/constant"
"github.com/gin-gonic/gin"
)
type PDFEndpoint struct {
pdfService *service.PDFRecognitionService
}
func NewPDFEndpoint() *PDFEndpoint {
return &PDFEndpoint{
pdfService: service.NewPDFRecognitionService(),
}
}
func (e *PDFEndpoint) CreateTask(c *gin.Context) {
var req pdfmodel.CreatePDFRecognitionRequest
if err := c.BindJSON(&req); err != nil {
c.JSON(http.StatusOK, common.ErrorResponse(c, common.CodeParamError, "参数错误"))
return
}
req.UserID = c.GetInt64(constant.ContextUserID)
if strings.ToLower(filepath.Ext(req.FileName)) != ".pdf" {
c.JSON(http.StatusOK, common.ErrorResponse(c, common.CodeParamError, "仅支持PDF文件"))
return
}
task, err := e.pdfService.CreatePDFTask(c, &req)
if err != nil {
c.JSON(http.StatusOK, common.ErrorResponse(c, common.CodeSystemError, err.Error()))
return
}
c.JSON(http.StatusOK, common.SuccessResponse(c, &pdfmodel.CreatePDFTaskResponse{
TaskNo: task.TaskUUID,
Status: int(task.Status),
}))
}
func (e *PDFEndpoint) GetTaskStatus(c *gin.Context) {
var req pdfmodel.GetPDFTaskRequest
if err := c.ShouldBindUri(&req); err != nil {
c.JSON(http.StatusOK, common.ErrorResponse(c, common.CodeParamError, "参数错误"))
return
}
resp, err := e.pdfService.GetPDFTask(c, req.TaskNo)
if err != nil {
// 透传 BusinessError 的错误码,让 404 返回 CodeNotFound 而不是统一包成 CodeSystemError
if bizErr, ok := err.(*common.BusinessError); ok {
c.JSON(http.StatusOK, common.ErrorResponse(c, int(bizErr.Code), bizErr.Message))
return
}
c.JSON(http.StatusOK, common.ErrorResponse(c, common.CodeSystemError, err.Error()))
return
}
c.JSON(http.StatusOK, common.SuccessResponse(c, resp))
}
- Step 2: 验证
go build ./api/...
- Step 3: Commit
git add api/v1/pdf/handler.go
git commit -m "feat: add PDF recognition HTTP handler"
Task 10: Router + OSS Handler
OSS 大小限制说明:当前
GetSignatureURLhandler 不做文件大小校验(没有file_size入参),大小限制由 Aliyun OSS Policy Token 的content-length-range条件控制。如需放宽 PDF 上传的大小上限,需修改pkg/oss中生成 Policy Token 的逻辑(在本 Task 范围之外)。本 Task 只处理文件类型白名单。
Files:
-
Modify:
api/router.go -
Modify:
api/v1/oss/handler.go -
Step 1: 在 router.go 添加 PDF import 和路由
import 块添加:
"gitea.com/texpixel/document_ai/api/v1/pdf"
SetupRouter 的 v1 块末尾添加:
pdfRouter := v1.Group("/pdf", common.GetAuthMiddleware())
{
endpoint := pdf.NewPDFEndpoint()
pdfRouter.POST("/recognition", endpoint.CreateTask)
pdfRouter.GET("/recognition/:task_no", endpoint.GetTaskStatus)
}
- Step 2: 在 oss/handler.go 的白名单中添加 .pdf
找到(handler.go:73):
if !utils.InArray(extend, []string{".jpg", ".jpeg", ".png", ".gif", ".bmp", ".tiff", ".webp"}) {
改为:
if !utils.InArray(extend, []string{".jpg", ".jpeg", ".png", ".gif", ".bmp", ".tiff", ".webp", ".pdf"}) {
- Step 3: 验证整体编译
go build ./...
Expected: 无报错
- Step 4: 冒烟测试路由
go run main.go &
curl -X GET http://localhost:8024/v1/pdf/recognition/fake-task-no \
-H "Authorization: Bearer YOUR_TOKEN"
Expected: {"code":404,"message":"任务不存在",...} — GetByTaskNo 返回 ErrRecordNotFound → service 返回 CodeNotFound BusinessError → handler 透传错误码
- Step 5: Commit
git add api/router.go api/v1/oss/handler.go
git commit -m "feat: register PDF routes and allow .pdf upload in OSS handler"
前端交互流程
1. POST /v1/oss/signature_url { file_name: "doc.pdf", file_hash, file_size }
→ { sign_url, path: "formula/uuid.pdf" }
2. PUT sign_url (直传 PDF 到 OSS)
3. POST /v1/pdf/recognition { file_url, file_hash, file_name: "doc.pdf" }
→ { task_no: "uuid", status: 0 }
4. GET /v1/pdf/recognition/:task_no (每3秒轮询)
→ status=1 { task_no, status:1, total_pages:0, pages:[] }
5. status=2 时:
{
"task_no": "uuid",
"status": 2,
"total_pages": 8, ← 实际处理页数(最多10)
"pages": [
{ "page_number": 1, "markdown": "# 第一章\n..." },
{ "page_number": 2, "markdown": "## 1.1\n..." }
]
}
数据库样例
-- recognition_results 表中 PDF 任务的一行示例
INSERT INTO recognition_results (task_id, task_type, meta_data, content) VALUES (
123,
'PDF',
'{"total_num":8}',
'[{"page_number":1,"markdown":"# 第一章\n正文..."},{"page_number":2,"markdown":"## 1.1\n..."}]'
);
-- FORMULA 任务的一行示例
INSERT INTO recognition_results (task_id, task_type, meta_data, content) VALUES (
456,
'FORMULA',
'{"total_num":1}',
'{"latex":"E=mc^2","markdown":"$$E=mc^2$$","mathml":"<math>...</math>","mml":""}'
);
自检清单
- Breaking change 全覆盖: 迁移删旧列后,
recognition_service.go(3处写/读)和task.go(GetTaskList + ExportTask 2处读)在同一 commit 里全部更新,不存在中间状态崩溃窗口 - 单行存储: PDF 所有页面的结果存为一行的 JSON array,不增加新表
- pre-hook: processPDFTask 开头 clamp processPages ≤ 10,写日志说明
- OCR 接口复用: PDF 与公式识别调用同一下游端点,请求格式(image_base64)完全相同
- GetPDFTask 类型校验: 获取任务后校验 TaskType == PDF,类型不符返回 CodeNotFound,防止公式任务被当 PDF 解析
- callOCR StatusCode 检查: 下游非 200 立即返回 error,不解析 body,防止把错误响应存为识别结果
- Handler 错误码透传: GetTaskStatus 检查
*common.BusinessError,透传 Code 字段,404 正确返回 code=404 - meta_data.total_num: 公式=1,PDF=实际处理页数
- 错误恢复: defer 保证异常时任务状态更新为 FAILED
- 超时: PDF 任务 10 分钟超时(10页 × ~45秒)
- OSS 大小限制: handler 无代码侧大小校验,限制由 OSS Policy Token 的 content-length-range 控制;本计划只扩展文件类型白名单