Files

yoge 99e1314bf9 refact: eliminate blog/docs content overlap

- Delete blog/copy-math-to-word (EN+ZH) — identical to docs/copy-to-word
- Rewrite blog/pdf-formula-issues as narrative troubleshooting story;
  operational steps now link out to docs/pdf-extraction
- Add "Further reading" cross-links: 4 docs → relevant blog posts
- Add "See also" cross-links: 3 blog posts → relevant docs

Docs = product reference; Blog = narrative/use cases/opinions

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-03-26 16:52:27 +08:00

3.6 KiB

Raw Blame History

title, description, slug, date, tags

title

description

slug

date

I Tried to Extract Formulas from My Professor's PDF. Here's What I Learned.

Last semester I was working through a 200-page lecture notes PDF — the kind that gets scanned from printed transparencies, emailed as a file attachment, and opens with a slightly-off angle on every page. I wanted to pull the key equations into my own notes. What followed was an education in how PDFs actually store (or don't store) mathematical content.

The First Surprise: Not All PDFs Are the Same

I naively assumed "PDF with formulas" meant "formulas I can extract." Not true.

There are at least three fundamentally different kinds of PDFs floating around in academic circles, and they behave completely differently:

Born-digital PDFs (generated from LaTeX, Word, or typesetting software) contain actual vector math. Extraction from these is fast and 95%+ accurate — the formula structure is essentially already there.

Scanned PDFs are just photographs of printed pages packaged into a container. There's no text layer. Extraction works through image recognition, and accuracy depends entirely on scan quality. My professor's notes were this kind.

Hybrid PDFs have a text layer added by OCR software after scanning. Quality varies wildly — sometimes great, sometimes the "text" layer is completely wrong. These are the most unpredictable.

The Three Root Causes of Most Failures

After a lot of trial and error, I found that failed extractions almost always come back to one of three things:

1. Resolution. The scan was done at 150 DPI instead of 300. At low resolution, small symbols — subscripts, primes, dots — become a few pixels wide. The model can't reliably distinguish \prime from a stray speck. Rescanning at 300 DPI fixed more than half my problems.

2. Encryption. Some PDFs are password-protected or have content restrictions that prevent any tool from reading the content stream. The PDF appears to open fine, but nothing can extract from it. Removing the password (File → Export as PDF in Preview, without the password lock) solved this.

3. Formulas stored as vector paths. Some PDF generators draw equations as shapes rather than encoding them as characters. To any extraction tool, these formulas are invisible — just abstract geometry. The only way around this is to render the page as an image and run visual recognition on that instead.

What Actually Worked

For my professor's scanned notes, the workflow that worked:

Export each page as a 300 DPI PNG using Preview
Upload the PNG to TexPixel
Get clean LaTeX back in under a second

Not the direct-PDF workflow I was hoping for, but reliable. The image-based pipeline doesn't care whether the original was scanned or born-digital — it just sees pixels and reads the math.

The Bigger Lesson

PDF is a presentation format, not a data format. It's optimized for how things look, not for what they mean. Mathematical notation in particular gets mangled in transit — rendered, rasterized, path-converted — in ways that destroy the underlying structure.

The most reliable signal is always the image. When in doubt, export to PNG and let visual recognition do the work.

For a systematic reference on PDF types, file limits, and what TexPixel can handle, see the PDF Extraction documentation →

3.6 KiB Raw Blame History