Files

yoge 76f1bde56d feat: add 5 new blog posts (en + zh)

- how-ai-reads-math: plain-English explainer of the recognition pipeline
- student-workflow: lecture-to-LaTeX workflow for students
- pdf-formula-issues: troubleshooting guide for PDF extraction errors
- copy-math-to-word: 3 methods for getting formulas into Word, ranked
- researcher-workflow: digitizing handwritten research notes at scale

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-03-26 16:46:31 +08:00

3.9 KiB

Raw Blame History

title, description, slug, date, tags

title

description

slug

date

Why Your PDF Formulas Come Out Wrong (and How to Fix It)

PDF formula extraction should be simple — upload, get LaTeX, done. But sometimes the output looks garbled, symbols are missing, or the extractor says no formulas were found. Here's a breakdown of the most common causes and how to fix each one.

Problem 1: The PDF is a Scan

Symptoms: Symbols look correct on screen but extraction output is garbage or empty.

Why it happens: A scanned PDF is just a collection of images — there's no actual text layer. The text you see in your PDF reader is either from OCR performed at scan time (often poor quality) or from the image itself.

Fix: Run TexPixel's image-based pipeline instead. Export individual pages as PNG at 300 DPI using any PDF viewer (File → Export as Image in Preview, or Adobe Acrobat's Export PDF feature), then upload the PNG directly. Image-based recognition handles scans correctly; direct PDF text extraction does not.

Problem 2: Low-DPI Scan

Symptoms: Some symbols recognized correctly, others replaced with wrong characters or dropped entirely.

Why it happens: Below about 150 DPI, strokes in small symbols like \prime, \cdot, or subscript characters become a few pixels wide — too blurry to reliably distinguish.

Fix: Rescan at 300 DPI. Most modern flatbed scanners default to 200 DPI; bumping to 300 produces dramatically better results without significantly increasing file size. For phone scans, use a dedicated scanner app (e.g., Adobe Scan, Microsoft Lens) which applies automatic sharpening and perspective correction.

Problem 3: Password-Protected PDF

Symptoms: "No formulas found" or upload fails entirely.

Why it happens: Encrypted PDFs require a password to access their content stream. TexPixel cannot process the content of a locked file.

Fix: Remove the password protection before uploading. In Preview (Mac), open with the password, then File → Export as PDF — the exported file won't have the password. In Adobe Reader, use File → Print → Save as PDF.

Problem 4: Formulas Stored as Vector Paths

Symptoms: PDF looks perfect, but extraction returns nothing or incorrect text.

Why it happens: Some PDF generators (certain Word versions, some online LaTeX renderers) rasterize or vectorize math into paths — the formulas are essentially drawings, not characters. There's no character stream to extract.

Fix: Export the page as a high-resolution PNG (300 DPI), then upload as an image. TexPixel's visual recognition pipeline handles vector-rendered formulas well.

Problem 5: Multi-Column Layout

Symptoms: Formulas from two columns are merged or interleaved in the output.

Why it happens: PDF text streams don't always encode reading order correctly, especially in two-column academic papers.

Fix: Crop to a single column before uploading. Use any image editor to crop the page into left and right halves, then upload each separately.

Problem 6: Handwritten Annotations

Symptoms: Handwritten notes over a printed formula confuse the output.

Why it happens: TexPixel sees both the printed formula and the handwritten annotations together. It may try to recognize the annotations as part of the formula.

Fix: Crop tightly to just the printed formula, excluding any handwriting around it.

Quick Diagnostic Checklist

Before uploading a problematic PDF:

Is it a scan or a born-digital PDF?
If a scan, what DPI was it scanned at?
Is it password-protected?
Does it have a two-column layout?
Are there handwritten annotations?

Working through this list resolves the issue 90% of the time.

Upload your PDF →

3.9 KiB Raw Blame History