|
|
|
|
@@ -1,73 +1,53 @@
|
|
|
|
|
---
|
|
|
|
|
title: "Why Your PDF Formulas Come Out Wrong (and How to Fix It)"
|
|
|
|
|
description: The most common reasons PDF formula extraction produces errors, and exactly how to fix each one
|
|
|
|
|
title: "I Tried to Extract Formulas from My Professor's PDF. Here's What I Learned."
|
|
|
|
|
description: A real-world account of what goes wrong with PDF formula extraction — and why most problems come down to one of three root causes
|
|
|
|
|
slug: pdf-formula-issues
|
|
|
|
|
date: 2026-02-15
|
|
|
|
|
tags: [troubleshooting, PDF, tips]
|
|
|
|
|
tags: [troubleshooting, PDF]
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
# Why Your PDF Formulas Come Out Wrong (and How to Fix It)
|
|
|
|
|
# I Tried to Extract Formulas from My Professor's PDF. Here's What I Learned.
|
|
|
|
|
|
|
|
|
|
PDF formula extraction should be simple — upload, get LaTeX, done. But sometimes the output looks garbled, symbols are missing, or the extractor says no formulas were found. Here's a breakdown of the most common causes and how to fix each one.
|
|
|
|
|
Last semester I was working through a 200-page lecture notes PDF — the kind that gets scanned from printed transparencies, emailed as a file attachment, and opens with a slightly-off angle on every page. I wanted to pull the key equations into my own notes. What followed was an education in how PDFs actually store (or don't store) mathematical content.
|
|
|
|
|
|
|
|
|
|
## Problem 1: The PDF is a Scan
|
|
|
|
|
## The First Surprise: Not All PDFs Are the Same
|
|
|
|
|
|
|
|
|
|
**Symptoms:** Symbols look correct on screen but extraction output is garbage or empty.
|
|
|
|
|
I naively assumed "PDF with formulas" meant "formulas I can extract." Not true.
|
|
|
|
|
|
|
|
|
|
**Why it happens:** A scanned PDF is just a collection of images — there's no actual text layer. The text you see in your PDF reader is either from OCR performed at scan time (often poor quality) or from the image itself.
|
|
|
|
|
There are at least three fundamentally different kinds of PDFs floating around in academic circles, and they behave completely differently:
|
|
|
|
|
|
|
|
|
|
**Fix:** Run TexPixel's image-based pipeline instead. Export individual pages as PNG at 300 DPI using any PDF viewer (File → Export as Image in Preview, or Adobe Acrobat's Export PDF feature), then upload the PNG directly. Image-based recognition handles scans correctly; direct PDF text extraction does not.
|
|
|
|
|
**Born-digital PDFs** (generated from LaTeX, Word, or typesetting software) contain actual vector math. Extraction from these is fast and 95%+ accurate — the formula structure is essentially already there.
|
|
|
|
|
|
|
|
|
|
## Problem 2: Low-DPI Scan
|
|
|
|
|
**Scanned PDFs** are just photographs of printed pages packaged into a container. There's no text layer. Extraction works through image recognition, and accuracy depends entirely on scan quality. My professor's notes were this kind.
|
|
|
|
|
|
|
|
|
|
**Symptoms:** Some symbols recognized correctly, others replaced with wrong characters or dropped entirely.
|
|
|
|
|
**Hybrid PDFs** have a text layer added by OCR software after scanning. Quality varies wildly — sometimes great, sometimes the "text" layer is completely wrong. These are the most unpredictable.
|
|
|
|
|
|
|
|
|
|
**Why it happens:** Below about 150 DPI, strokes in small symbols like `\prime`, `\cdot`, or subscript characters become a few pixels wide — too blurry to reliably distinguish.
|
|
|
|
|
## The Three Root Causes of Most Failures
|
|
|
|
|
|
|
|
|
|
**Fix:** Rescan at 300 DPI. Most modern flatbed scanners default to 200 DPI; bumping to 300 produces dramatically better results without significantly increasing file size. For phone scans, use a dedicated scanner app (e.g., Adobe Scan, Microsoft Lens) which applies automatic sharpening and perspective correction.
|
|
|
|
|
After a lot of trial and error, I found that failed extractions almost always come back to one of three things:
|
|
|
|
|
|
|
|
|
|
## Problem 3: Password-Protected PDF
|
|
|
|
|
**1. Resolution.** The scan was done at 150 DPI instead of 300. At low resolution, small symbols — subscripts, primes, dots — become a few pixels wide. The model can't reliably distinguish `\prime` from a stray speck. Rescanning at 300 DPI fixed more than half my problems.
|
|
|
|
|
|
|
|
|
|
**Symptoms:** "No formulas found" or upload fails entirely.
|
|
|
|
|
**2. Encryption.** Some PDFs are password-protected or have content restrictions that prevent any tool from reading the content stream. The PDF appears to open fine, but nothing can extract from it. Removing the password (File → Export as PDF in Preview, without the password lock) solved this.
|
|
|
|
|
|
|
|
|
|
**Why it happens:** Encrypted PDFs require a password to access their content stream. TexPixel cannot process the content of a locked file.
|
|
|
|
|
**3. Formulas stored as vector paths.** Some PDF generators draw equations as shapes rather than encoding them as characters. To any extraction tool, these formulas are invisible — just abstract geometry. The only way around this is to render the page as an image and run visual recognition on that instead.
|
|
|
|
|
|
|
|
|
|
**Fix:** Remove the password protection before uploading. In Preview (Mac), open with the password, then File → Export as PDF — the exported file won't have the password. In Adobe Reader, use File → Print → Save as PDF.
|
|
|
|
|
## What Actually Worked
|
|
|
|
|
|
|
|
|
|
## Problem 4: Formulas Stored as Vector Paths
|
|
|
|
|
For my professor's scanned notes, the workflow that worked:
|
|
|
|
|
|
|
|
|
|
**Symptoms:** PDF looks perfect, but extraction returns nothing or incorrect text.
|
|
|
|
|
1. Export each page as a 300 DPI PNG using Preview
|
|
|
|
|
2. Upload the PNG to TexPixel
|
|
|
|
|
3. Get clean LaTeX back in under a second
|
|
|
|
|
|
|
|
|
|
**Why it happens:** Some PDF generators (certain Word versions, some online LaTeX renderers) rasterize or vectorize math into paths — the formulas are essentially drawings, not characters. There's no character stream to extract.
|
|
|
|
|
Not the direct-PDF workflow I was hoping for, but reliable. The image-based pipeline doesn't care whether the original was scanned or born-digital — it just sees pixels and reads the math.
|
|
|
|
|
|
|
|
|
|
**Fix:** Export the page as a high-resolution PNG (300 DPI), then upload as an image. TexPixel's visual recognition pipeline handles vector-rendered formulas well.
|
|
|
|
|
## The Bigger Lesson
|
|
|
|
|
|
|
|
|
|
## Problem 5: Multi-Column Layout
|
|
|
|
|
PDF is a presentation format, not a data format. It's optimized for how things look, not for what they mean. Mathematical notation in particular gets mangled in transit — rendered, rasterized, path-converted — in ways that destroy the underlying structure.
|
|
|
|
|
|
|
|
|
|
**Symptoms:** Formulas from two columns are merged or interleaved in the output.
|
|
|
|
|
The most reliable signal is always the image. When in doubt, export to PNG and let visual recognition do the work.
|
|
|
|
|
|
|
|
|
|
**Why it happens:** PDF text streams don't always encode reading order correctly, especially in two-column academic papers.
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
**Fix:** Crop to a single column before uploading. Use any image editor to crop the page into left and right halves, then upload each separately.
|
|
|
|
|
|
|
|
|
|
## Problem 6: Handwritten Annotations
|
|
|
|
|
|
|
|
|
|
**Symptoms:** Handwritten notes over a printed formula confuse the output.
|
|
|
|
|
|
|
|
|
|
**Why it happens:** TexPixel sees both the printed formula and the handwritten annotations together. It may try to recognize the annotations as part of the formula.
|
|
|
|
|
|
|
|
|
|
**Fix:** Crop tightly to just the printed formula, excluding any handwriting around it.
|
|
|
|
|
|
|
|
|
|
## Quick Diagnostic Checklist
|
|
|
|
|
|
|
|
|
|
Before uploading a problematic PDF:
|
|
|
|
|
|
|
|
|
|
- [ ] Is it a scan or a born-digital PDF?
|
|
|
|
|
- [ ] If a scan, what DPI was it scanned at?
|
|
|
|
|
- [ ] Is it password-protected?
|
|
|
|
|
- [ ] Does it have a two-column layout?
|
|
|
|
|
- [ ] Are there handwritten annotations?
|
|
|
|
|
|
|
|
|
|
Working through this list resolves the issue 90% of the time.
|
|
|
|
|
|
|
|
|
|
[Upload your PDF →](/app)
|
|
|
|
|
For a systematic reference on PDF types, file limits, and what TexPixel can handle, see the [PDF Extraction documentation →](/docs/pdf-extraction)
|
|
|
|
|
|