refact: eliminate blog/docs content overlap

- Delete blog/copy-math-to-word (EN+ZH) — identical to docs/copy-to-word
- Rewrite blog/pdf-formula-issues as narrative troubleshooting story;
  operational steps now link out to docs/pdf-extraction
- Add "Further reading" cross-links: 4 docs → relevant blog posts
- Add "See also" cross-links: 3 blog posts → relevant docs

Docs = product reference; Blog = narrative/use cases/opinions

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-03-26 16:52:27 +08:00
parent 76f1bde56d
commit 99e1314bf9
18 changed files with 82 additions and 242 deletions

View File

@@ -1,73 +1,53 @@
---
title: "Why Your PDF Formulas Come Out Wrong (and How to Fix It)"
description: The most common reasons PDF formula extraction produces errors, and exactly how to fix each one
title: "I Tried to Extract Formulas from My Professor's PDF. Here's What I Learned."
description: A real-world account of what goes wrong with PDF formula extraction — and why most problems come down to one of three root causes
slug: pdf-formula-issues
date: 2026-02-15
tags: [troubleshooting, PDF, tips]
tags: [troubleshooting, PDF]
---
# Why Your PDF Formulas Come Out Wrong (and How to Fix It)
# I Tried to Extract Formulas from My Professor's PDF. Here's What I Learned.
PDF formula extraction should be simple — upload, get LaTeX, done. But sometimes the output looks garbled, symbols are missing, or the extractor says no formulas were found. Here's a breakdown of the most common causes and how to fix each one.
Last semester I was working through a 200-page lecture notes PDF — the kind that gets scanned from printed transparencies, emailed as a file attachment, and opens with a slightly-off angle on every page. I wanted to pull the key equations into my own notes. What followed was an education in how PDFs actually store (or don't store) mathematical content.
## Problem 1: The PDF is a Scan
## The First Surprise: Not All PDFs Are the Same
**Symptoms:** Symbols look correct on screen but extraction output is garbage or empty.
I naively assumed "PDF with formulas" meant "formulas I can extract." Not true.
**Why it happens:** A scanned PDF is just a collection of images — there's no actual text layer. The text you see in your PDF reader is either from OCR performed at scan time (often poor quality) or from the image itself.
There are at least three fundamentally different kinds of PDFs floating around in academic circles, and they behave completely differently:
**Fix:** Run TexPixel's image-based pipeline instead. Export individual pages as PNG at 300 DPI using any PDF viewer (File → Export as Image in Preview, or Adobe Acrobat's Export PDF feature), then upload the PNG directly. Image-based recognition handles scans correctly; direct PDF text extraction does not.
**Born-digital PDFs** (generated from LaTeX, Word, or typesetting software) contain actual vector math. Extraction from these is fast and 95%+ accurate — the formula structure is essentially already there.
## Problem 2: Low-DPI Scan
**Scanned PDFs** are just photographs of printed pages packaged into a container. There's no text layer. Extraction works through image recognition, and accuracy depends entirely on scan quality. My professor's notes were this kind.
**Symptoms:** Some symbols recognized correctly, others replaced with wrong characters or dropped entirely.
**Hybrid PDFs** have a text layer added by OCR software after scanning. Quality varies wildly — sometimes great, sometimes the "text" layer is completely wrong. These are the most unpredictable.
**Why it happens:** Below about 150 DPI, strokes in small symbols like `\prime`, `\cdot`, or subscript characters become a few pixels wide — too blurry to reliably distinguish.
## The Three Root Causes of Most Failures
**Fix:** Rescan at 300 DPI. Most modern flatbed scanners default to 200 DPI; bumping to 300 produces dramatically better results without significantly increasing file size. For phone scans, use a dedicated scanner app (e.g., Adobe Scan, Microsoft Lens) which applies automatic sharpening and perspective correction.
After a lot of trial and error, I found that failed extractions almost always come back to one of three things:
## Problem 3: Password-Protected PDF
**1. Resolution.** The scan was done at 150 DPI instead of 300. At low resolution, small symbols — subscripts, primes, dots — become a few pixels wide. The model can't reliably distinguish `\prime` from a stray speck. Rescanning at 300 DPI fixed more than half my problems.
**Symptoms:** "No formulas found" or upload fails entirely.
**2. Encryption.** Some PDFs are password-protected or have content restrictions that prevent any tool from reading the content stream. The PDF appears to open fine, but nothing can extract from it. Removing the password (File → Export as PDF in Preview, without the password lock) solved this.
**Why it happens:** Encrypted PDFs require a password to access their content stream. TexPixel cannot process the content of a locked file.
**3. Formulas stored as vector paths.** Some PDF generators draw equations as shapes rather than encoding them as characters. To any extraction tool, these formulas are invisible — just abstract geometry. The only way around this is to render the page as an image and run visual recognition on that instead.
**Fix:** Remove the password protection before uploading. In Preview (Mac), open with the password, then File → Export as PDF — the exported file won't have the password. In Adobe Reader, use File → Print → Save as PDF.
## What Actually Worked
## Problem 4: Formulas Stored as Vector Paths
For my professor's scanned notes, the workflow that worked:
**Symptoms:** PDF looks perfect, but extraction returns nothing or incorrect text.
1. Export each page as a 300 DPI PNG using Preview
2. Upload the PNG to TexPixel
3. Get clean LaTeX back in under a second
**Why it happens:** Some PDF generators (certain Word versions, some online LaTeX renderers) rasterize or vectorize math into paths — the formulas are essentially drawings, not characters. There's no character stream to extract.
Not the direct-PDF workflow I was hoping for, but reliable. The image-based pipeline doesn't care whether the original was scanned or born-digital — it just sees pixels and reads the math.
**Fix:** Export the page as a high-resolution PNG (300 DPI), then upload as an image. TexPixel's visual recognition pipeline handles vector-rendered formulas well.
## The Bigger Lesson
## Problem 5: Multi-Column Layout
PDF is a presentation format, not a data format. It's optimized for how things look, not for what they mean. Mathematical notation in particular gets mangled in transit — rendered, rasterized, path-converted — in ways that destroy the underlying structure.
**Symptoms:** Formulas from two columns are merged or interleaved in the output.
The most reliable signal is always the image. When in doubt, export to PNG and let visual recognition do the work.
**Why it happens:** PDF text streams don't always encode reading order correctly, especially in two-column academic papers.
---
**Fix:** Crop to a single column before uploading. Use any image editor to crop the page into left and right halves, then upload each separately.
## Problem 6: Handwritten Annotations
**Symptoms:** Handwritten notes over a printed formula confuse the output.
**Why it happens:** TexPixel sees both the printed formula and the handwritten annotations together. It may try to recognize the annotations as part of the formula.
**Fix:** Crop tightly to just the printed formula, excluding any handwriting around it.
## Quick Diagnostic Checklist
Before uploading a problematic PDF:
- [ ] Is it a scan or a born-digital PDF?
- [ ] If a scan, what DPI was it scanned at?
- [ ] Is it password-protected?
- [ ] Does it have a two-column layout?
- [ ] Are there handwritten annotations?
Working through this list resolves the issue 90% of the time.
[Upload your PDF →](/app)
For a systematic reference on PDF types, file limits, and what TexPixel can handle, see the [PDF Extraction documentation →](/docs/pdf-extraction)