Files
doc_ai_backed/docs/superpowers/plans/2026-03-30-pdf-recognition-glm-ocr.md
2026-03-31 22:32:14 +08:00

32 KiB
Raw Blame History

PDF Recognition Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: 支持 PDF 逐页 OCR 识别最多10页同步重构 recognition_results 表为 JSON 内容结构,兼容公式识别和 PDF 识别两种场景。

Architecture: recognition_results 每个任务存一行:meta_data JSON 存元信息(total_numcontent JSON 存识别内容(公式:{latex, markdown, mml}PDF[{page_number, markdown}, ...]。PDF 处理链:下载 → go-fitz 分页渲染 → pre-hook 限前10页 → 逐页调用现有下游 OCR 接口 → 组装 JSON → 写入 DB。

Tech Stack: Go 1.20, Gin, GORM/MySQL, Redis, Aliyun OSS, github.com/gen2brain/go-fitz v0.24.0, 现有下游 OCR 接口 cloud.texpixel.com


表结构设计

recognition_results
├── id              BIGINT PK
├── task_id         BIGINT INDEX
├── task_type       VARCHAR(16)    -- FORMULA / PDF
├── meta_data       JSON           -- {"total_num": 1}
├── content         JSON           -- 见下方说明
├── created_at      DATETIME
└── updated_at      DATETIME

content 格式(按 task_type
  FORMULA: {"latex":"E=mc^2","markdown":"$$E=mc^2$$","mml":"<math>..."}
  PDF:     [{"page_number":1,"markdown":"# 第一章\n..."},{"page_number":2,"markdown":"..."}]

旧字段 latex / markdown / mathml / mml 全部删除,由 content JSON 承接。


文件变更清单

操作 文件路径 职责
Create migrations/pdf_recognition.sql ALTER recognition_results删旧字段加 meta_data/content JSON
Modify internal/storage/dao/task.go 增加 TaskTypePDF 常量
Modify internal/storage/dao/result.go 重构 RecognitionResult struct新增内容类型辅助结构更新 DAO 方法
Create internal/model/pdf/request.go PDF 识别请求/响应 DTO
Create internal/storage/cache/pdf.go Redis 队列操作PDF 专用)
Modify internal/service/recognition_service.go 更新 processFormulaTask / GetFormualTask 使用新 JSON 格式
Create internal/service/pdf_recognition_service.go PDF 识别业务逻辑
Create api/v1/pdf/handler.go HTTP 处理器
Modify api/router.go 注册 PDF 路由
Modify api/v1/oss/handler.go 文件类型白名单加 .pdf大小限制放宽至 50MB
Modify go.mod / go.sum 添加 go-fitz 依赖

环境前置:安装 MuPDFgo-fitz CGo 依赖)

# macOS
brew install mupdf

# Ubuntu/Debian
sudo apt-get install -y libmupdf-dev

# 验证
pkg-config --modversion mupdf

Task 1: 数据库迁移 — 重构 recognition_results

Files:

  • Create: migrations/pdf_recognition.sql

  • Step 1: 创建迁移文件

-- migrations/pdf_recognition.sql

-- 1. 删除旧的单字段列(已有数据可提前备份)
ALTER TABLE `recognition_results`
  DROP COLUMN `latex`,
  DROP COLUMN `markdown`,
  DROP COLUMN `mathml`,
  DROP COLUMN `mml`;

-- 2. 增加 JSON 字段
ALTER TABLE `recognition_results`
  ADD COLUMN `meta_data` JSON DEFAULT NULL COMMENT '元数据 {"total_num":1}' AFTER `task_type`,
  ADD COLUMN `content`   JSON DEFAULT NULL COMMENT '识别内容 JSON' AFTER `meta_data`;
  • Step 2: 执行迁移
mysql -u root -p doc_ai < migrations/pdf_recognition.sql

Expected: Query OK

  • Step 3: 验证表结构
mysql -u root -p doc_ai -e "DESCRIBE recognition_results;"

Expected: 字段为 id, task_id, task_type, meta_data, content, created_at, updated_at无 latex/markdown/mathml/mml

  • Step 4: Commit
git add migrations/pdf_recognition.sql
git commit -m "feat: migrate recognition_results to JSON content schema"

Task 2: 添加 go-fitz 依赖

Files:

  • Modify: go.mod

  • Step 1: 安装依赖

go get github.com/gen2brain/go-fitz@v0.24.0

Expected: go: added github.com/gen2brain/go-fitz v0.24.0

  • Step 2: 验证
go build ./...
  • Step 3: Commit
git add go.mod go.sum
git commit -m "feat: add go-fitz for PDF page rendering"

Task 3: 常量扩展

Files:

  • Modify: internal/storage/dao/task.go

  • Step 1: 添加 TaskTypePDF

找到 const 块,将:

TaskTypeLayout  TaskType = "LAYOUT"

改为:

TaskTypeLayout  TaskType = "LAYOUT"
TaskTypePDF     TaskType = "PDF"
  • Step 2: 验证
go build ./internal/storage/dao/...
  • Step 3: Commit
git add internal/storage/dao/task.go
git commit -m "feat: add TaskTypePDF constant"

Task 4: DAO — 重构 RecognitionResult

Files:

  • Modify: internal/storage/dao/result.go

  • Step 1: 用新 struct 完整替换 result.go 内容

package dao

import (
	"encoding/json"

	"gorm.io/gorm"
)

// FormulaContent 公式识别的 content 字段结构
type FormulaContent struct {
	Latex    string `json:"latex"`
	Markdown string `json:"markdown"`
	MathML   string `json:"mathml"`
	MML      string `json:"mml"`
}

// PDFPageContent PDF 单页识别结果
type PDFPageContent struct {
	PageNumber int    `json:"page_number"`
	Markdown   string `json:"markdown"`
}

// ResultMetaData recognition_results.meta_data 字段结构
type ResultMetaData struct {
	TotalNum int `json:"total_num"`
}

// RecognitionResult recognition_results 表模型
type RecognitionResult struct {
	BaseModel
	TaskID   int64    `gorm:"column:task_id;bigint;not null;default:0;index;comment:任务ID" json:"task_id"`
	TaskType TaskType `gorm:"column:task_type;varchar(16);not null;comment:任务类型;default:''" json:"task_type"`
	MetaData string   `gorm:"column:meta_data;type:json;comment:元数据" json:"meta_data"`
	Content  string   `gorm:"column:content;type:json;comment:识别内容JSON" json:"content"`
}

// SetMetaData 序列化并写入 MetaData 字段
func (r *RecognitionResult) SetMetaData(meta ResultMetaData) error {
	b, err := json.Marshal(meta)
	if err != nil {
		return err
	}
	r.MetaData = string(b)
	return nil
}

// GetFormulaContent 从 Content 字段反序列化公式结果
func (r *RecognitionResult) GetFormulaContent() (*FormulaContent, error) {
	var c FormulaContent
	if err := json.Unmarshal([]byte(r.Content), &c); err != nil {
		return nil, err
	}
	return &c, nil
}

// GetPDFContent 从 Content 字段反序列化 PDF 分页结果
func (r *RecognitionResult) GetPDFContent() ([]PDFPageContent, error) {
	var pages []PDFPageContent
	if err := json.Unmarshal([]byte(r.Content), &pages); err != nil {
		return nil, err
	}
	return pages, nil
}

// MarshalFormulaContent 将公式结果序列化为 JSON 字符串(供写入 Content
func MarshalFormulaContent(c FormulaContent) (string, error) {
	b, err := json.Marshal(c)
	return string(b), err
}

// MarshalPDFContent 将 PDF 分页结果序列化为 JSON 字符串(供写入 Content
func MarshalPDFContent(pages []PDFPageContent) (string, error) {
	b, err := json.Marshal(pages)
	return string(b), err
}

type RecognitionResultDao struct{}

func NewRecognitionResultDao() *RecognitionResultDao {
	return &RecognitionResultDao{}
}

func (dao *RecognitionResultDao) Create(tx *gorm.DB, data RecognitionResult) error {
	return tx.Create(&data).Error
}

func (dao *RecognitionResultDao) GetByTaskID(tx *gorm.DB, taskID int64) (*RecognitionResult, error) {
	result := &RecognitionResult{}
	err := tx.Where("task_id = ?", taskID).First(result).Error
	if err != nil && err == gorm.ErrRecordNotFound {
		return nil, nil
	}
	return result, err
}

func (dao *RecognitionResultDao) Update(tx *gorm.DB, id int64, updates map[string]interface{}) error {
	return tx.Model(&RecognitionResult{}).Where("id = ?", id).Updates(updates).Error
}
  • Step 2: 验证编译
go build ./internal/storage/dao/...

Expected: 无报错

  • Step 3: Commit
git add internal/storage/dao/result.go
git commit -m "refactor: RecognitionResult to JSON content schema (meta_data + content)"

Task 5: 更新公式识别 + TaskService — 适配新 JSON 格式

Files:

  • Modify: internal/service/recognition_service.go
  • Modify: internal/service/task.go

注意:迁移删除了 latex/markdown/mathml/mml 列,task.goGetTaskList:98-101ExportTask:151都直接读这些字段必须在同一个 commit 里一起更新,否则迁移后立即崩溃。

  • Step 1: 修改 recognition_service.go — processFormulaTask 写入

找到 processFormulaTask 内调用 resultDao.Create 的代码约第542行

// 旧代码
err = resultDao.Create(tx, dao.RecognitionResult{
    TaskID:   taskID,
    TaskType: dao.TaskTypeFormula,
    Latex:    ocrResp.Latex,
    Markdown: ocrResp.Markdown,
    MathML:   ocrResp.MathML,
    MML:      ocrResp.MML,
})

替换为:

// 新代码
contentJSON, err := dao.MarshalFormulaContent(dao.FormulaContent{
    Latex:    ocrResp.Latex,
    Markdown: ocrResp.Markdown,
    MathML:   ocrResp.MathML,
    MML:      ocrResp.MML,
})
if err != nil {
    log.Error(ctx, "func", "processFormulaTask", "msg", "序列化公式内容失败", "error", err)
    return err
}
result := dao.RecognitionResult{
    TaskID:   taskID,
    TaskType: dao.TaskTypeFormula,
    Content:  contentJSON,
}
if err = result.SetMetaData(dao.ResultMetaData{TotalNum: 1}); err != nil {
    log.Error(ctx, "func", "processFormulaTask", "msg", "序列化MetaData失败", "error", err)
    return err
}
err = resultDao.Create(tx, result)
if err != nil {
    log.Error(ctx, "func", "processFormulaTask", "msg", "保存任务结果失败", "error", err)
    return err
}
  • Step 2: 修改 recognition_service.go — processVLFormulaTask 写入

找到 processVLFormulaTask 内对 resultDao.Create / resultDao.Update 的调用约第665-678行

创建时:

contentJSON, err := dao.MarshalFormulaContent(dao.FormulaContent{Latex: latex})
if err != nil {
    log.Error(ctx, "func", "processVLFormulaTask", "msg", "序列化公式内容失败", "error", err)
    return err
}
newResult := dao.RecognitionResult{TaskID: taskID, TaskType: dao.TaskTypeFormula, Content: contentJSON}
_ = newResult.SetMetaData(dao.ResultMetaData{TotalNum: 1})
err = resultDao.Create(dao.DB.WithContext(ctx), newResult)

更新时:

contentJSON, err := dao.MarshalFormulaContent(dao.FormulaContent{Latex: latex})
if err != nil {
    log.Error(ctx, "func", "processVLFormulaTask", "msg", "序列化公式内容失败", "error", err)
    return err
}
err = resultDao.Update(dao.DB.WithContext(ctx), result.ID, map[string]interface{}{"content": contentJSON})
  • Step 3: 修改 recognition_service.go — GetFormualTask 读取

找到 GetFormualTask约第134行将读取旧字段的代码

// 旧代码:直接读 taskRet.Latex / taskRet.Markdown / taskRet.MathML / taskRet.MML
markdown := taskRet.Markdown
if markdown == "" {
    markdown = fmt.Sprintf("$$%s$$", taskRet.Latex)
}
return &formula.GetFormulaTaskResponse{
    TaskNo:   taskNo,
    Latex:    taskRet.Latex,
    Markdown: markdown,
    MathML:   taskRet.MathML,
    MML:      taskRet.MML,
    Status:   int(task.Status),
}, nil

替换为:

// 新代码
formulaContent, err := taskRet.GetFormulaContent()
if err != nil {
    log.Error(ctx, "func", "GetFormualTask", "msg", "解析公式内容失败", "error", err)
    return nil, common.NewError(common.CodeSystemError, "解析识别结果失败", err)
}
markdown := formulaContent.Markdown
if markdown == "" {
    markdown = fmt.Sprintf("$$%s$$", formulaContent.Latex)
}
return &formula.GetFormulaTaskResponse{
    TaskNo:   taskNo,
    Latex:    formulaContent.Latex,
    Markdown: markdown,
    MathML:   formulaContent.MathML,
    MML:      formulaContent.MML,
    Status:   int(task.Status),
}, nil
  • Step 4: 修改 task.go — GetTaskList 读取结果(:91-119

找到 GetTaskList 中组装 DTO 的代码:

// 旧代码
var latex, markdown, mathML, mml string
recognitionResult := recognitionResultMap[item.ID]
if recognitionResult != nil {
    latex = recognitionResult.Latex
    markdown = recognitionResult.Markdown
    mathML = recognitionResult.MathML
    mml = recognitionResult.MML
}

替换为:

// 新代码:按 task_type 反序列化 content
var latex, markdown, mathML, mml string
recognitionResult := recognitionResultMap[item.ID]
if recognitionResult != nil && recognitionResult.TaskType == dao.TaskTypeFormula {
    if fc, err := recognitionResult.GetFormulaContent(); err == nil {
        latex = fc.Latex
        markdown = fc.Markdown
        mathML = fc.MathML
        mml = fc.MML
    }
}
// PDF 类型的 TaskListDTO 暂不展开 content列表页只显示状态
  • Step 5: 修改 task.go — ExportTask 读取 markdown:140-155

找到 ExportTask 中读取 markdown 的代码:

// 旧代码
markdown := recognitionResult.Markdown
if markdown == "" {
    log.Error(ctx, "func", "ExportTask", "msg", "markdown not found")
    return nil, "", errors.New("markdown not found")
}

替换为:

// 新代码:按 task_type 解析 content
var markdown string
switch recognitionResult.TaskType {
case dao.TaskTypeFormula:
    fc, err := recognitionResult.GetFormulaContent()
    if err != nil || fc.Markdown == "" {
        log.Error(ctx, "func", "ExportTask", "msg", "公式结果解析失败或markdown为空", "error", err)
        return nil, "", errors.New("markdown not found")
    }
    markdown = fc.Markdown
default:
    log.Error(ctx, "func", "ExportTask", "msg", "不支持的导出任务类型", "task_type", recognitionResult.TaskType)
    return nil, "", errors.New("unsupported task type for export")
}
  • Step 6: 验证编译
go build ./internal/service/...

Expected: 无报错

  • Step 7: Commit
git add internal/service/recognition_service.go internal/service/task.go
git commit -m "refactor: adapt all recognition result reads/writes to JSON content schema"

Task 6: Cache — PDF Redis 队列

Files:

  • Create: internal/storage/cache/pdf.go

  • Step 1: 创建 pdf.go

// internal/storage/cache/pdf.go
package cache

import (
	"context"
	"strconv"
)

const (
	PDFRecognitionTaskQueue = "pdf_recognition_queue"
	PDFRecognitionDistLock  = "pdf_recognition_dist_lock"
)

func PushPDFTask(ctx context.Context, taskID int64) (int64, error) {
	return RedisClient.LPush(ctx, PDFRecognitionTaskQueue, taskID).Result()
}

func PopPDFTask(ctx context.Context) (int64, error) {
	result, err := RedisClient.BRPop(ctx, 0, PDFRecognitionTaskQueue).Result()
	if err != nil {
		return 0, err
	}
	return strconv.ParseInt(result[1], 10, 64)
}

func GetPDFDistributedLock(ctx context.Context) (bool, error) {
	return RedisClient.SetNX(ctx, PDFRecognitionDistLock, "locked", DefaultLockTimeout).Result()
}
  • Step 2: 验证
go build ./internal/storage/cache/...
  • Step 3: Commit
git add internal/storage/cache/pdf.go
git commit -m "feat: add PDF recognition Redis queue"

Task 7: Model — PDF 请求/响应 DTO

Files:

  • Create: internal/model/pdf/request.go

  • Step 1: 创建文件

// internal/model/pdf/request.go
package pdf

// CreatePDFRecognitionRequest 创建PDF识别任务
type CreatePDFRecognitionRequest struct {
	FileURL  string `json:"file_url" binding:"required"`
	FileHash string `json:"file_hash" binding:"required"`
	FileName string `json:"file_name" binding:"required"`
	UserID   int64  `json:"user_id"`
}

// GetPDFTaskRequest URI 参数
type GetPDFTaskRequest struct {
	TaskNo string `uri:"task_no" binding:"required"`
}

// CreatePDFTaskResponse 创建任务响应
type CreatePDFTaskResponse struct {
	TaskNo string `json:"task_no"`
	Status int    `json:"status"`
}

// PDFPageResult 单页结果(与 dao.PDFPageContent 对应)
type PDFPageResult struct {
	PageNumber int    `json:"page_number"`
	Markdown   string `json:"markdown"`
}

// GetPDFTaskResponse 查询任务状态和结果
type GetPDFTaskResponse struct {
	TaskNo     string          `json:"task_no"`
	Status     int             `json:"status"`       // 0=PENDING 1=PROCESSING 2=COMPLETED 3=FAILED
	TotalPages int             `json:"total_pages"`  // 实际处理的页数
	Pages      []PDFPageResult `json:"pages"`        // status=2 时填充
}
  • Step 2: 验证
go build ./internal/model/pdf/...
  • Step 3: Commit
git add internal/model/pdf/request.go
git commit -m "feat: add PDF recognition request/response models"

Task 8: Service — PDFRecognitionService

Files:

  • Create: internal/service/pdf_recognition_service.go

  • Step 1: 创建服务文件

// internal/service/pdf_recognition_service.go
package service

import (
	"bytes"
	"context"
	"encoding/base64"
	"encoding/json"
	"fmt"
	"io"
	"net/http"
	"time"

	"github.com/gen2brain/go-fitz"

	"gitea.com/texpixel/document_ai/internal/model/formula"
	pdfmodel "gitea.com/texpixel/document_ai/internal/model/pdf"
	"gitea.com/texpixel/document_ai/internal/storage/cache"
	"gitea.com/texpixel/document_ai/internal/storage/dao"
	"gitea.com/texpixel/document_ai/pkg/common"
	"gitea.com/texpixel/document_ai/pkg/httpclient"
	"gitea.com/texpixel/document_ai/pkg/log"
	"gitea.com/texpixel/document_ai/pkg/oss"
	"gitea.com/texpixel/document_ai/pkg/requestid"
	"gitea.com/texpixel/document_ai/pkg/utils"
	"gorm.io/gorm"
)

const (
	pdfMaxPages    = 10
	pdfOCREndpoint = "https://cloud.texpixel.com:10443/doc_process/v1/image/ocr"
)

// PDFRecognitionService 处理 PDF 识别任务
type PDFRecognitionService struct {
	db         *gorm.DB
	queueLimit chan struct{}
	stopChan   chan struct{}
	httpClient *httpclient.Client
}

func NewPDFRecognitionService() *PDFRecognitionService {
	s := &PDFRecognitionService{
		db:         dao.DB,
		queueLimit: make(chan struct{}, 3),
		stopChan:   make(chan struct{}),
		httpClient: httpclient.NewClient(nil),
	}

	utils.SafeGo(func() {
		lock, err := cache.GetPDFDistributedLock(context.Background())
		if err != nil || !lock {
			log.Error(context.Background(), "func", "NewPDFRecognitionService", "msg", "获取PDF分布式锁失败")
			return
		}
		s.processPDFQueue(context.Background())
	})

	return s
}

// CreatePDFTask 创建识别任务并入队
func (s *PDFRecognitionService) CreatePDFTask(ctx context.Context, req *pdfmodel.CreatePDFRecognitionRequest) (*dao.RecognitionTask, error) {
	task := &dao.RecognitionTask{
		UserID:   req.UserID,
		TaskUUID: utils.NewUUID(),
		TaskType: dao.TaskTypePDF,
		Status:   dao.TaskStatusPending,
		FileURL:  req.FileURL,
		FileName: req.FileName,
		FileHash: req.FileHash,
		IP:       common.GetIPFromContext(ctx),
	}

	if err := dao.NewRecognitionTaskDao().Create(dao.DB.WithContext(ctx), task); err != nil {
		log.Error(ctx, "func", "CreatePDFTask", "msg", "创建任务失败", "error", err)
		return nil, common.NewError(common.CodeDBError, "创建任务失败", err)
	}

	if _, err := cache.PushPDFTask(ctx, task.ID); err != nil {
		log.Error(ctx, "func", "CreatePDFTask", "msg", "推入队列失败", "error", err)
		return nil, common.NewError(common.CodeSystemError, "推入队列失败", err)
	}

	return task, nil
}

// GetPDFTask 查询任务状态和结果
func (s *PDFRecognitionService) GetPDFTask(ctx context.Context, taskNo string) (*pdfmodel.GetPDFTaskResponse, error) {
	sess := dao.DB.WithContext(ctx)
	task, err := dao.NewRecognitionTaskDao().GetByTaskNo(sess, taskNo)
	if err != nil {
		if err == gorm.ErrRecordNotFound {
			return nil, common.NewError(common.CodeNotFound, "任务不存在", err)
		}
		return nil, common.NewError(common.CodeDBError, "查询任务失败", err)
	}

	// 类型校验:防止公式任务被当成 PDF 解析
	if task.TaskType != dao.TaskTypePDF {
		return nil, common.NewError(common.CodeNotFound, "任务不存在", nil)
	}

	resp := &pdfmodel.GetPDFTaskResponse{
		TaskNo: taskNo,
		Status: int(task.Status),
	}

	if task.Status != dao.TaskStatusCompleted {
		return resp, nil
	}

	result, err := dao.NewRecognitionResultDao().GetByTaskID(sess, task.ID)
	if err != nil || result == nil {
		return nil, common.NewError(common.CodeDBError, "查询识别结果失败", err)
	}

	pages, err := result.GetPDFContent()
	if err != nil {
		return nil, common.NewError(common.CodeSystemError, "解析识别结果失败", err)
	}

	resp.TotalPages = len(pages)
	for _, p := range pages {
		resp.Pages = append(resp.Pages, pdfmodel.PDFPageResult{
			PageNumber: p.PageNumber,
			Markdown:   p.Markdown,
		})
	}

	return resp, nil
}

// processPDFQueue 持续消费队列
func (s *PDFRecognitionService) processPDFQueue(ctx context.Context) {
	for {
		select {
		case <-s.stopChan:
			return
		default:
			s.processOnePDFTask(ctx)
		}
	}
}

func (s *PDFRecognitionService) processOnePDFTask(ctx context.Context) {
	s.queueLimit <- struct{}{}
	defer func() { <-s.queueLimit }()

	taskID, err := cache.PopPDFTask(ctx)
	if err != nil {
		log.Error(ctx, "func", "processOnePDFTask", "msg", "获取任务失败", "error", err)
		return
	}

	task, err := dao.NewRecognitionTaskDao().GetTaskByID(dao.DB.WithContext(ctx), taskID)
	if err != nil || task == nil {
		log.Error(ctx, "func", "processOnePDFTask", "msg", "任务不存在", "task_id", taskID)
		return
	}

	ctx = context.WithValue(ctx, utils.RequestIDKey, task.TaskUUID)
	requestid.SetRequestID(task.TaskUUID, func() {
		if err := s.processPDFTask(ctx, taskID, task.FileURL); err != nil {
			log.Error(ctx, "func", "processOnePDFTask", "msg", "处理PDF任务失败", "error", err)
		}
	})
}

// processPDFTask 核心处理:下载 → pre-hook → 逐页OCR → 写入DB
func (s *PDFRecognitionService) processPDFTask(ctx context.Context, taskID int64, fileURL string) error {
	ctx, cancel := context.WithTimeout(ctx, 10*time.Minute)
	defer cancel()

	taskDao := dao.NewRecognitionTaskDao()
	resultDao := dao.NewRecognitionResultDao()

	isSuccess := false
	defer func() {
		status, remark := dao.TaskStatusFailed, "任务处理失败"
		if isSuccess {
			status, remark = dao.TaskStatusCompleted, ""
		}
		_ = taskDao.Update(dao.DB.WithContext(context.Background()),
			map[string]interface{}{"id": taskID},
			map[string]interface{}{"status": status, "completed_at": time.Now(), "remark": remark},
		)
	}()

	// 更新为处理中
	if err := taskDao.Update(dao.DB.WithContext(ctx),
		map[string]interface{}{"id": taskID},
		map[string]interface{}{"status": dao.TaskStatusProcessing},
	); err != nil {
		return fmt.Errorf("更新任务状态失败: %w", err)
	}

	// 下载 PDF
	reader, err := oss.DownloadFile(ctx, fileURL)
	if err != nil {
		return fmt.Errorf("下载PDF失败: %w", err)
	}
	defer reader.Close()

	pdfBytes, err := io.ReadAll(reader)
	if err != nil {
		return fmt.Errorf("读取PDF数据失败: %w", err)
	}

	// 打开 PDF
	doc, err := fitz.NewFromMemory(pdfBytes)
	if err != nil {
		return fmt.Errorf("解析PDF失败: %w", err)
	}
	defer doc.Close()

	// pre-hook: 限制最多处理前 10 页
	totalInDoc := doc.NumPage()
	processPages := totalInDoc
	if processPages > pdfMaxPages {
		processPages = pdfMaxPages
		log.Info(ctx, "func", "processPDFTask", "msg", "PDF超过10页只处理前10页",
			"task_id", taskID, "doc_total", totalInDoc)
	}

	log.Info(ctx, "func", "processPDFTask", "msg", "开始处理PDF",
		"task_id", taskID, "process_pages", processPages)

	// 逐页渲染 + OCR结果收集
	var pages []dao.PDFPageContent
	for pageNum := 0; pageNum < processPages; pageNum++ {
		imgBytes, err := doc.ImagePNG(pageNum, 150) // 150 DPI
		if err != nil {
			return fmt.Errorf("渲染第%d页失败: %w", pageNum+1, err)
		}

		ocrResult, err := s.callOCR(ctx, imgBytes)
		if err != nil {
			return fmt.Errorf("OCR第%d页失败: %w", pageNum+1, err)
		}

		pages = append(pages, dao.PDFPageContent{
			PageNumber: pageNum + 1,
			Markdown:   ocrResult.Markdown,
		})
		log.Info(ctx, "func", "processPDFTask", "msg", "页面OCR完成",
			"page", pageNum+1, "total", processPages)
	}

	// 序列化并写入 DB单行
	contentJSON, err := dao.MarshalPDFContent(pages)
	if err != nil {
		return fmt.Errorf("序列化PDF内容失败: %w", err)
	}

	dbResult := dao.RecognitionResult{
		TaskID:   taskID,
		TaskType: dao.TaskTypePDF,
		Content:  contentJSON,
	}
	if err := dbResult.SetMetaData(dao.ResultMetaData{TotalNum: processPages}); err != nil {
		return fmt.Errorf("序列化MetaData失败: %w", err)
	}
	if err := resultDao.Create(dao.DB.WithContext(ctx), dbResult); err != nil {
		return fmt.Errorf("保存PDF结果失败: %w", err)
	}

	isSuccess = true
	return nil
}

// callOCR 调用与公式识别相同的下游 OCR 接口
func (s *PDFRecognitionService) callOCR(ctx context.Context, imgBytes []byte) (*formula.ImageOCRResponse, error) {
	reqBody := map[string]string{
		"image_base64": base64.StdEncoding.EncodeToString(imgBytes),
	}
	jsonData, err := json.Marshal(reqBody)
	if err != nil {
		return nil, err
	}

	headers := map[string]string{
		"Content-Type":           "application/json",
		utils.RequestIDHeaderKey: utils.GetRequestIDFromContext(ctx),
	}

	resp, err := s.httpClient.RequestWithRetry(ctx, http.MethodPost, pdfOCREndpoint, bytes.NewReader(jsonData), headers)
	if err != nil {
		return nil, fmt.Errorf("请求OCR接口失败: %w", err)
	}
	defer resp.Body.Close()

	// 下游非 2xx 视为失败,避免把错误响应 body 当成识别结果存库
	if resp.StatusCode != http.StatusOK {
		body, _ := io.ReadAll(resp.Body)
		return nil, fmt.Errorf("OCR接口返回非200状态: %d, body: %s", resp.StatusCode, string(body))
	}

	var ocrResp formula.ImageOCRResponse
	if err := json.NewDecoder(resp.Body).Decode(&ocrResp); err != nil {
		return nil, fmt.Errorf("解析OCR响应失败: %w", err)
	}

	return &ocrResp, nil
}

func (s *PDFRecognitionService) Stop() {
	close(s.stopChan)
}
  • Step 2: 验证编译
go build ./internal/service/...

Expected: 无报错

  • Step 3: Commit
git add internal/service/pdf_recognition_service.go
git commit -m "feat: add PDFRecognitionService with 10-page pre-hook"

Task 9: Handler — api/v1/pdf/handler.go

Files:

  • Create: api/v1/pdf/handler.go

  • Step 1: 创建 handler

// api/v1/pdf/handler.go
package pdf

import (
	"net/http"
	"path/filepath"
	"strings"

	pdfmodel "gitea.com/texpixel/document_ai/internal/model/pdf"
	"gitea.com/texpixel/document_ai/internal/service"
	"gitea.com/texpixel/document_ai/pkg/common"
	"gitea.com/texpixel/document_ai/pkg/constant"

	"github.com/gin-gonic/gin"
)

type PDFEndpoint struct {
	pdfService *service.PDFRecognitionService
}

func NewPDFEndpoint() *PDFEndpoint {
	return &PDFEndpoint{
		pdfService: service.NewPDFRecognitionService(),
	}
}

func (e *PDFEndpoint) CreateTask(c *gin.Context) {
	var req pdfmodel.CreatePDFRecognitionRequest
	if err := c.BindJSON(&req); err != nil {
		c.JSON(http.StatusOK, common.ErrorResponse(c, common.CodeParamError, "参数错误"))
		return
	}
	req.UserID = c.GetInt64(constant.ContextUserID)

	if strings.ToLower(filepath.Ext(req.FileName)) != ".pdf" {
		c.JSON(http.StatusOK, common.ErrorResponse(c, common.CodeParamError, "仅支持PDF文件"))
		return
	}

	task, err := e.pdfService.CreatePDFTask(c, &req)
	if err != nil {
		c.JSON(http.StatusOK, common.ErrorResponse(c, common.CodeSystemError, err.Error()))
		return
	}

	c.JSON(http.StatusOK, common.SuccessResponse(c, &pdfmodel.CreatePDFTaskResponse{
		TaskNo: task.TaskUUID,
		Status: int(task.Status),
	}))
}

func (e *PDFEndpoint) GetTaskStatus(c *gin.Context) {
	var req pdfmodel.GetPDFTaskRequest
	if err := c.ShouldBindUri(&req); err != nil {
		c.JSON(http.StatusOK, common.ErrorResponse(c, common.CodeParamError, "参数错误"))
		return
	}

	resp, err := e.pdfService.GetPDFTask(c, req.TaskNo)
	if err != nil {
		// 透传 BusinessError 的错误码,让 404 返回 CodeNotFound 而不是统一包成 CodeSystemError
		if bizErr, ok := err.(*common.BusinessError); ok {
			c.JSON(http.StatusOK, common.ErrorResponse(c, int(bizErr.Code), bizErr.Message))
			return
		}
		c.JSON(http.StatusOK, common.ErrorResponse(c, common.CodeSystemError, err.Error()))
		return
	}

	c.JSON(http.StatusOK, common.SuccessResponse(c, resp))
}
  • Step 2: 验证
go build ./api/...
  • Step 3: Commit
git add api/v1/pdf/handler.go
git commit -m "feat: add PDF recognition HTTP handler"

Task 10: Router + OSS Handler

OSS 大小限制说明:当前 GetSignatureURL handler 不做文件大小校验(没有 file_size 入参),大小限制由 Aliyun OSS Policy Token 的 content-length-range 条件控制。如需放宽 PDF 上传的大小上限,需修改 pkg/oss 中生成 Policy Token 的逻辑(在本 Task 范围之外)。本 Task 只处理文件类型白名单。

Files:

  • Modify: api/router.go

  • Modify: api/v1/oss/handler.go

  • Step 1: 在 router.go 添加 PDF import 和路由

import 块添加:

"gitea.com/texpixel/document_ai/api/v1/pdf"

SetupRouter 的 v1 块末尾添加:

pdfRouter := v1.Group("/pdf", common.GetAuthMiddleware())
{
    endpoint := pdf.NewPDFEndpoint()
    pdfRouter.POST("/recognition", endpoint.CreateTask)
    pdfRouter.GET("/recognition/:task_no", endpoint.GetTaskStatus)
}
  • Step 2: 在 oss/handler.go 的白名单中添加 .pdf

找到(handler.go:73

if !utils.InArray(extend, []string{".jpg", ".jpeg", ".png", ".gif", ".bmp", ".tiff", ".webp"}) {

改为:

if !utils.InArray(extend, []string{".jpg", ".jpeg", ".png", ".gif", ".bmp", ".tiff", ".webp", ".pdf"}) {
  • Step 3: 验证整体编译
go build ./...

Expected: 无报错

  • Step 4: 冒烟测试路由
go run main.go &
curl -X GET http://localhost:8024/v1/pdf/recognition/fake-task-no \
  -H "Authorization: Bearer YOUR_TOKEN"

Expected: {"code":404,"message":"任务不存在",...} — GetByTaskNo 返回 ErrRecordNotFound → service 返回 CodeNotFound BusinessError → handler 透传错误码

  • Step 5: Commit
git add api/router.go api/v1/oss/handler.go
git commit -m "feat: register PDF routes and allow .pdf upload in OSS handler"

前端交互流程

1. POST /v1/oss/signature_url  { file_name: "doc.pdf", file_hash, file_size }
   → { sign_url, path: "formula/uuid.pdf" }

2. PUT sign_url  (直传 PDF 到 OSS

3. POST /v1/pdf/recognition  { file_url, file_hash, file_name: "doc.pdf" }
   → { task_no: "uuid", status: 0 }

4. GET /v1/pdf/recognition/:task_no  每3秒轮询
   → status=1 { task_no, status:1, total_pages:0, pages:[] }

5. status=2 时:
   {
     "task_no": "uuid",
     "status": 2,
     "total_pages": 8,       ← 实际处理页数最多10
     "pages": [
       { "page_number": 1, "markdown": "# 第一章\n..." },
       { "page_number": 2, "markdown": "## 1.1\n..." }
     ]
   }

数据库样例

-- recognition_results 表中 PDF 任务的一行示例
INSERT INTO recognition_results (task_id, task_type, meta_data, content) VALUES (
  123,
  'PDF',
  '{"total_num":8}',
  '[{"page_number":1,"markdown":"# 第一章\n正文..."},{"page_number":2,"markdown":"## 1.1\n..."}]'
);

-- FORMULA 任务的一行示例
INSERT INTO recognition_results (task_id, task_type, meta_data, content) VALUES (
  456,
  'FORMULA',
  '{"total_num":1}',
  '{"latex":"E=mc^2","markdown":"$$E=mc^2$$","mathml":"<math>...</math>","mml":""}'
);

自检清单

  • Breaking change 全覆盖: 迁移删旧列后,recognition_service.go3处写/读)和 task.goGetTaskList + ExportTask 2处读在同一 commit 里全部更新,不存在中间状态崩溃窗口
  • 单行存储: PDF 所有页面的结果存为一行的 JSON array不增加新表
  • pre-hook: processPDFTask 开头 clamp processPages ≤ 10写日志说明
  • OCR 接口复用: PDF 与公式识别调用同一下游端点请求格式image_base64完全相同
  • GetPDFTask 类型校验: 获取任务后校验 TaskType == PDF类型不符返回 CodeNotFound防止公式任务被当 PDF 解析
  • callOCR StatusCode 检查: 下游非 200 立即返回 error不解析 body防止把错误响应存为识别结果
  • Handler 错误码透传: GetTaskStatus 检查 *common.BusinessError,透传 Code 字段404 正确返回 code=404
  • meta_data.total_num: 公式=1PDF=实际处理页数
  • 错误恢复: defer 保证异常时任务状态更新为 FAILED
  • 超时: PDF 任务 10 分钟超时10页 × ~45秒
  • OSS 大小限制: handler 无代码侧大小校验,限制由 OSS Policy Token 的 content-length-range 控制;本计划只扩展文件类型白名单