Files
doc_ai_frontend/content/blog/en/2026-02-15-pdf-formula-issues.md
yoge 99e1314bf9 refact: eliminate blog/docs content overlap
- Delete blog/copy-math-to-word (EN+ZH) — identical to docs/copy-to-word
- Rewrite blog/pdf-formula-issues as narrative troubleshooting story;
  operational steps now link out to docs/pdf-extraction
- Add "Further reading" cross-links: 4 docs → relevant blog posts
- Add "See also" cross-links: 3 blog posts → relevant docs

Docs = product reference; Blog = narrative/use cases/opinions

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 16:52:27 +08:00

54 lines
3.6 KiB
Markdown

---
title: "I Tried to Extract Formulas from My Professor's PDF. Here's What I Learned."
description: A real-world account of what goes wrong with PDF formula extraction — and why most problems come down to one of three root causes
slug: pdf-formula-issues
date: 2026-02-15
tags: [troubleshooting, PDF]
---
# I Tried to Extract Formulas from My Professor's PDF. Here's What I Learned.
Last semester I was working through a 200-page lecture notes PDF — the kind that gets scanned from printed transparencies, emailed as a file attachment, and opens with a slightly-off angle on every page. I wanted to pull the key equations into my own notes. What followed was an education in how PDFs actually store (or don't store) mathematical content.
## The First Surprise: Not All PDFs Are the Same
I naively assumed "PDF with formulas" meant "formulas I can extract." Not true.
There are at least three fundamentally different kinds of PDFs floating around in academic circles, and they behave completely differently:
**Born-digital PDFs** (generated from LaTeX, Word, or typesetting software) contain actual vector math. Extraction from these is fast and 95%+ accurate — the formula structure is essentially already there.
**Scanned PDFs** are just photographs of printed pages packaged into a container. There's no text layer. Extraction works through image recognition, and accuracy depends entirely on scan quality. My professor's notes were this kind.
**Hybrid PDFs** have a text layer added by OCR software after scanning. Quality varies wildly — sometimes great, sometimes the "text" layer is completely wrong. These are the most unpredictable.
## The Three Root Causes of Most Failures
After a lot of trial and error, I found that failed extractions almost always come back to one of three things:
**1. Resolution.** The scan was done at 150 DPI instead of 300. At low resolution, small symbols — subscripts, primes, dots — become a few pixels wide. The model can't reliably distinguish `\prime` from a stray speck. Rescanning at 300 DPI fixed more than half my problems.
**2. Encryption.** Some PDFs are password-protected or have content restrictions that prevent any tool from reading the content stream. The PDF appears to open fine, but nothing can extract from it. Removing the password (File → Export as PDF in Preview, without the password lock) solved this.
**3. Formulas stored as vector paths.** Some PDF generators draw equations as shapes rather than encoding them as characters. To any extraction tool, these formulas are invisible — just abstract geometry. The only way around this is to render the page as an image and run visual recognition on that instead.
## What Actually Worked
For my professor's scanned notes, the workflow that worked:
1. Export each page as a 300 DPI PNG using Preview
2. Upload the PNG to TexPixel
3. Get clean LaTeX back in under a second
Not the direct-PDF workflow I was hoping for, but reliable. The image-based pipeline doesn't care whether the original was scanned or born-digital — it just sees pixels and reads the math.
## The Bigger Lesson
PDF is a presentation format, not a data format. It's optimized for how things look, not for what they mean. Mathematical notation in particular gets mangled in transit — rendered, rasterized, path-converted — in ways that destroy the underlying structure.
The most reliable signal is always the image. When in doubt, export to PNG and let visual recognition do the work.
---
For a systematic reference on PDF types, file limits, and what TexPixel can handle, see the [PDF Extraction documentation →](/docs/pdf-extraction)