refact: eliminate blog/docs content overlap

- Delete blog/copy-math-to-word (EN+ZH) — identical to docs/copy-to-word
- Rewrite blog/pdf-formula-issues as narrative troubleshooting story;
  operational steps now link out to docs/pdf-extraction
- Add "Further reading" cross-links: 4 docs → relevant blog posts
- Add "See also" cross-links: 3 blog posts → relevant docs

Docs = product reference; Blog = narrative/use cases/opinions

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-03-26 16:52:27 +08:00
parent 76f1bde56d
commit 99e1314bf9
18 changed files with 82 additions and 242 deletions

View File

@@ -68,4 +68,6 @@ With TexPixel: photograph it, get `A = U \Sigma V^T` in one second, paste. For m
Over a semester, you'll accumulate dozens of recognized formulas. Consider organizing them: paste each into a reference `.tex` file with a short comment. By exam time, you'll have a searchable personal formula sheet that took almost no effort to build.
**See also:** For supported file types, size limits, and copy options, see the [Image to LaTeX documentation →](/docs/image-to-latex)
[Start digitizing your notes →](/app)

View File

@@ -1,73 +1,53 @@
---
title: "Why Your PDF Formulas Come Out Wrong (and How to Fix It)"
description: The most common reasons PDF formula extraction produces errors, and exactly how to fix each one
title: "I Tried to Extract Formulas from My Professor's PDF. Here's What I Learned."
description: A real-world account of what goes wrong with PDF formula extraction — and why most problems come down to one of three root causes
slug: pdf-formula-issues
date: 2026-02-15
tags: [troubleshooting, PDF, tips]
tags: [troubleshooting, PDF]
---
# Why Your PDF Formulas Come Out Wrong (and How to Fix It)
# I Tried to Extract Formulas from My Professor's PDF. Here's What I Learned.
PDF formula extraction should be simple — upload, get LaTeX, done. But sometimes the output looks garbled, symbols are missing, or the extractor says no formulas were found. Here's a breakdown of the most common causes and how to fix each one.
Last semester I was working through a 200-page lecture notes PDF — the kind that gets scanned from printed transparencies, emailed as a file attachment, and opens with a slightly-off angle on every page. I wanted to pull the key equations into my own notes. What followed was an education in how PDFs actually store (or don't store) mathematical content.
## Problem 1: The PDF is a Scan
## The First Surprise: Not All PDFs Are the Same
**Symptoms:** Symbols look correct on screen but extraction output is garbage or empty.
I naively assumed "PDF with formulas" meant "formulas I can extract." Not true.
**Why it happens:** A scanned PDF is just a collection of images — there's no actual text layer. The text you see in your PDF reader is either from OCR performed at scan time (often poor quality) or from the image itself.
There are at least three fundamentally different kinds of PDFs floating around in academic circles, and they behave completely differently:
**Fix:** Run TexPixel's image-based pipeline instead. Export individual pages as PNG at 300 DPI using any PDF viewer (File → Export as Image in Preview, or Adobe Acrobat's Export PDF feature), then upload the PNG directly. Image-based recognition handles scans correctly; direct PDF text extraction does not.
**Born-digital PDFs** (generated from LaTeX, Word, or typesetting software) contain actual vector math. Extraction from these is fast and 95%+ accurate — the formula structure is essentially already there.
## Problem 2: Low-DPI Scan
**Scanned PDFs** are just photographs of printed pages packaged into a container. There's no text layer. Extraction works through image recognition, and accuracy depends entirely on scan quality. My professor's notes were this kind.
**Symptoms:** Some symbols recognized correctly, others replaced with wrong characters or dropped entirely.
**Hybrid PDFs** have a text layer added by OCR software after scanning. Quality varies wildly — sometimes great, sometimes the "text" layer is completely wrong. These are the most unpredictable.
**Why it happens:** Below about 150 DPI, strokes in small symbols like `\prime`, `\cdot`, or subscript characters become a few pixels wide — too blurry to reliably distinguish.
## The Three Root Causes of Most Failures
**Fix:** Rescan at 300 DPI. Most modern flatbed scanners default to 200 DPI; bumping to 300 produces dramatically better results without significantly increasing file size. For phone scans, use a dedicated scanner app (e.g., Adobe Scan, Microsoft Lens) which applies automatic sharpening and perspective correction.
After a lot of trial and error, I found that failed extractions almost always come back to one of three things:
## Problem 3: Password-Protected PDF
**1. Resolution.** The scan was done at 150 DPI instead of 300. At low resolution, small symbols — subscripts, primes, dots — become a few pixels wide. The model can't reliably distinguish `\prime` from a stray speck. Rescanning at 300 DPI fixed more than half my problems.
**Symptoms:** "No formulas found" or upload fails entirely.
**2. Encryption.** Some PDFs are password-protected or have content restrictions that prevent any tool from reading the content stream. The PDF appears to open fine, but nothing can extract from it. Removing the password (File → Export as PDF in Preview, without the password lock) solved this.
**Why it happens:** Encrypted PDFs require a password to access their content stream. TexPixel cannot process the content of a locked file.
**3. Formulas stored as vector paths.** Some PDF generators draw equations as shapes rather than encoding them as characters. To any extraction tool, these formulas are invisible — just abstract geometry. The only way around this is to render the page as an image and run visual recognition on that instead.
**Fix:** Remove the password protection before uploading. In Preview (Mac), open with the password, then File → Export as PDF — the exported file won't have the password. In Adobe Reader, use File → Print → Save as PDF.
## What Actually Worked
## Problem 4: Formulas Stored as Vector Paths
For my professor's scanned notes, the workflow that worked:
**Symptoms:** PDF looks perfect, but extraction returns nothing or incorrect text.
1. Export each page as a 300 DPI PNG using Preview
2. Upload the PNG to TexPixel
3. Get clean LaTeX back in under a second
**Why it happens:** Some PDF generators (certain Word versions, some online LaTeX renderers) rasterize or vectorize math into paths — the formulas are essentially drawings, not characters. There's no character stream to extract.
Not the direct-PDF workflow I was hoping for, but reliable. The image-based pipeline doesn't care whether the original was scanned or born-digital — it just sees pixels and reads the math.
**Fix:** Export the page as a high-resolution PNG (300 DPI), then upload as an image. TexPixel's visual recognition pipeline handles vector-rendered formulas well.
## The Bigger Lesson
## Problem 5: Multi-Column Layout
PDF is a presentation format, not a data format. It's optimized for how things look, not for what they mean. Mathematical notation in particular gets mangled in transit — rendered, rasterized, path-converted — in ways that destroy the underlying structure.
**Symptoms:** Formulas from two columns are merged or interleaved in the output.
The most reliable signal is always the image. When in doubt, export to PNG and let visual recognition do the work.
**Why it happens:** PDF text streams don't always encode reading order correctly, especially in two-column academic papers.
---
**Fix:** Crop to a single column before uploading. Use any image editor to crop the page into left and right halves, then upload each separately.
## Problem 6: Handwritten Annotations
**Symptoms:** Handwritten notes over a printed formula confuse the output.
**Why it happens:** TexPixel sees both the printed formula and the handwritten annotations together. It may try to recognize the annotations as part of the formula.
**Fix:** Crop tightly to just the printed formula, excluding any handwriting around it.
## Quick Diagnostic Checklist
Before uploading a problematic PDF:
- [ ] Is it a scan or a born-digital PDF?
- [ ] If a scan, what DPI was it scanned at?
- [ ] Is it password-protected?
- [ ] Does it have a two-column layout?
- [ ] Are there handwritten annotations?
Working through this list resolves the issue 90% of the time.
[Upload your PDF →](/app)
For a systematic reference on PDF types, file limits, and what TexPixel can handle, see the [PDF Extraction documentation →](/docs/pdf-extraction)

View File

@@ -1,74 +0,0 @@
---
title: "Copy Math to Word Without Losing Formatting — The Right Way"
description: Three methods for getting recognized formulas into Microsoft Word, ranked by quality and effort
slug: copy-math-to-word
date: 2026-03-01
tags: [tutorial, Word, export]
---
# Copy Math to Word Without Losing Formatting — The Right Way
Most people's first instinct when they need a formula in a Word document is to take a screenshot. It works — until you need to resize the document, change the font, or edit the formula. Screenshots break. Native equations don't.
Here are three ways to get TexPixel's output into Word, from best to worst.
## Method 1: DOCX Export (Best)
The cleanest option. TexPixel converts your recognized formula into a native Word equation (OMML format) and packages it in a `.docx` file.
**How:**
1. Upload your formula image to TexPixel.
2. Click **Export** → select **DOCX**.
3. Open the downloaded file in Word.
4. Select the equation, copy, paste into your target document.
**Why it's best:** The formula is fully editable in Word's built-in equation editor. Double-click it to open the editor, change any symbol, resize it — it behaves exactly like an equation you typed yourself. It also scales correctly when you change font sizes.
**Limitation:** Each upload produces one `.docx` file. If you have many formulas to insert, you'll need to repeat the process or batch them (see below).
## Method 2: Paste LaTeX into Word's Equation Editor (Good)
Word 2019+ and Microsoft 365 support pasting LaTeX directly into equations.
**How:**
1. Get the LaTeX output from TexPixel (e.g., `x = \frac{-b \pm \sqrt{b^2-4ac}}{2a}`).
2. In Word, insert a new equation: **Insert → Equation** (or press `Alt+=`).
3. Make sure the equation box is in **LaTeX mode** (click the dropdown on the right side of the equation box → select "LaTeX").
4. Paste the LaTeX string. Press **Enter** or click outside.
Word converts the LaTeX to a rendered, editable equation.
**Why it's good:** Fast for single formulas. No file download required.
**Limitation:** Word's LaTeX parser doesn't support all LaTeX commands. Obscure or complex expressions may not render correctly. Test before relying on it for important documents.
## Method 3: Image Export (Worst, But Sometimes Necessary)
Export the formula as a PNG and insert it as an image in Word.
**When to use:** Only when you need the formula in a document being shared with someone who doesn't have Word's equation editor (e.g., older Word versions, third-party editors). Or when a complex formula doesn't render correctly via Methods 1 or 2.
**Downsides:** Not editable. Doesn't scale well. Accessibility tools can't read it.
## Handling Multiple Formulas
If you have many formulas to insert into a single document:
1. Upload each formula image and collect the LaTeX strings.
2. Open a new Word document.
3. For each formula, use the **Alt+=** method above to insert them in sequence.
4. Once all formulas are inserted, copy and paste the entire equation block into your target document.
This is faster than one DOCX export per formula.
## Google Docs
Google Docs doesn't natively support LaTeX paste. Options:
- Use the **Auto-LaTeX Equations** Google Docs add-on, which renders LaTeX strings as inline images.
- Export as DOCX and open in Google Docs (equations import as images, not editable).
- Use a tool like `mathpix-markdown-it` to convert to Markdown and render in a Markdown-compatible environment.
For serious equation-heavy work, Word or Overleaf remain better choices than Google Docs.
[Export your next formula to Word →](/app)

View File

@@ -79,4 +79,6 @@ The real value of digitization compounds over time. A well-organized LaTeX refer
Start with the past year's notebooks. The 7-hour investment pays dividends for years.
**See also:** For PDF file limits, supported types, and export options, see the [PDF Extraction documentation →](/docs/pdf-extraction)
[Start digitizing your notes →](/app)

View File

@@ -43,3 +43,5 @@ TexPixel works best when each image contains a single formula or a closely relat
---
With these habits, you'll see noticeably better accuracy — often 95%+ even for complex handwritten expressions.
**See also:** For a systematic breakdown of what affects accuracy (DPI, contrast, formula complexity), see the [OCR Accuracy documentation →](/docs/ocr-accuracy)