Files
doc_ai_frontend/content/blog/en/2026-02-15-pdf-formula-issues.md
yoge 76f1bde56d feat: add 5 new blog posts (en + zh)
- how-ai-reads-math: plain-English explainer of the recognition pipeline
- student-workflow: lecture-to-LaTeX workflow for students
- pdf-formula-issues: troubleshooting guide for PDF extraction errors
- copy-math-to-word: 3 methods for getting formulas into Word, ranked
- researcher-workflow: digitizing handwritten research notes at scale

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 16:46:31 +08:00

74 lines
3.9 KiB
Markdown

---
title: "Why Your PDF Formulas Come Out Wrong (and How to Fix It)"
description: The most common reasons PDF formula extraction produces errors, and exactly how to fix each one
slug: pdf-formula-issues
date: 2026-02-15
tags: [troubleshooting, PDF, tips]
---
# Why Your PDF Formulas Come Out Wrong (and How to Fix It)
PDF formula extraction should be simple — upload, get LaTeX, done. But sometimes the output looks garbled, symbols are missing, or the extractor says no formulas were found. Here's a breakdown of the most common causes and how to fix each one.
## Problem 1: The PDF is a Scan
**Symptoms:** Symbols look correct on screen but extraction output is garbage or empty.
**Why it happens:** A scanned PDF is just a collection of images — there's no actual text layer. The text you see in your PDF reader is either from OCR performed at scan time (often poor quality) or from the image itself.
**Fix:** Run TexPixel's image-based pipeline instead. Export individual pages as PNG at 300 DPI using any PDF viewer (File → Export as Image in Preview, or Adobe Acrobat's Export PDF feature), then upload the PNG directly. Image-based recognition handles scans correctly; direct PDF text extraction does not.
## Problem 2: Low-DPI Scan
**Symptoms:** Some symbols recognized correctly, others replaced with wrong characters or dropped entirely.
**Why it happens:** Below about 150 DPI, strokes in small symbols like `\prime`, `\cdot`, or subscript characters become a few pixels wide — too blurry to reliably distinguish.
**Fix:** Rescan at 300 DPI. Most modern flatbed scanners default to 200 DPI; bumping to 300 produces dramatically better results without significantly increasing file size. For phone scans, use a dedicated scanner app (e.g., Adobe Scan, Microsoft Lens) which applies automatic sharpening and perspective correction.
## Problem 3: Password-Protected PDF
**Symptoms:** "No formulas found" or upload fails entirely.
**Why it happens:** Encrypted PDFs require a password to access their content stream. TexPixel cannot process the content of a locked file.
**Fix:** Remove the password protection before uploading. In Preview (Mac), open with the password, then File → Export as PDF — the exported file won't have the password. In Adobe Reader, use File → Print → Save as PDF.
## Problem 4: Formulas Stored as Vector Paths
**Symptoms:** PDF looks perfect, but extraction returns nothing or incorrect text.
**Why it happens:** Some PDF generators (certain Word versions, some online LaTeX renderers) rasterize or vectorize math into paths — the formulas are essentially drawings, not characters. There's no character stream to extract.
**Fix:** Export the page as a high-resolution PNG (300 DPI), then upload as an image. TexPixel's visual recognition pipeline handles vector-rendered formulas well.
## Problem 5: Multi-Column Layout
**Symptoms:** Formulas from two columns are merged or interleaved in the output.
**Why it happens:** PDF text streams don't always encode reading order correctly, especially in two-column academic papers.
**Fix:** Crop to a single column before uploading. Use any image editor to crop the page into left and right halves, then upload each separately.
## Problem 6: Handwritten Annotations
**Symptoms:** Handwritten notes over a printed formula confuse the output.
**Why it happens:** TexPixel sees both the printed formula and the handwritten annotations together. It may try to recognize the annotations as part of the formula.
**Fix:** Crop tightly to just the printed formula, excluding any handwriting around it.
## Quick Diagnostic Checklist
Before uploading a problematic PDF:
- [ ] Is it a scan or a born-digital PDF?
- [ ] If a scan, what DPI was it scanned at?
- [ ] Is it password-protected?
- [ ] Does it have a two-column layout?
- [ ] Are there handwritten annotations?
Working through this list resolves the issue 90% of the time.
[Upload your PDF →](/app)