refact: eliminate blog/docs content overlap

- Delete blog/copy-math-to-word (EN+ZH) — identical to docs/copy-to-word - Rewrite blog/pdf-formula-issues as narrative troubleshooting story; operational steps now link out to docs/pdf-extraction - Add "Further reading" cross-links: 4 docs → relevant blog posts - Add "See also" cross-links: 3 blog posts → relevant docs Docs = product reference; Blog = narrative/use cases/opinions Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 16:52:27 +08:00
parent 76f1bde56d
commit 99e1314bf9
18 changed files with 82 additions and 242 deletions
--- a/content/blog/en/2026-02-01-student-workflow.md
+++ b/content/blog/en/2026-02-01-student-workflow.md
@@ -68,4 +68,6 @@ With TexPixel: photograph it, get `A = U \Sigma V^T` in one second, paste. For m

 Over a semester, you'll accumulate dozens of recognized formulas. Consider organizing them: paste each into a reference `.tex` file with a short comment. By exam time, you'll have a searchable personal formula sheet that took almost no effort to build.

+**See also:** For supported file types, size limits, and copy options, see the [Image to LaTeX documentation →](/docs/image-to-latex)
+
 [Start digitizing your notes →](/app)
--- a/content/blog/en/2026-02-15-pdf-formula-issues.md
+++ b/content/blog/en/2026-02-15-pdf-formula-issues.md
@@ -1,73 +1,53 @@
 ---
-title: "Why Your PDF Formulas Come Out Wrong (and How to Fix It)"
-description: The most common reasons PDF formula extraction produces errors, and exactly how to fix each one
+title: "I Tried to Extract Formulas from My Professor's PDF. Here's What I Learned."
+description: A real-world account of what goes wrong with PDF formula extraction — and why most problems come down to one of three root causes
 slug: pdf-formula-issues
 date: 2026-02-15
-tags: [troubleshooting, PDF, tips]
+tags: [troubleshooting, PDF]
 ---

-# Why Your PDF Formulas Come Out Wrong (and How to Fix It)
+# I Tried to Extract Formulas from My Professor's PDF. Here's What I Learned.

-PDF formula extraction should be simple — upload, get LaTeX, done. But sometimes the output looks garbled, symbols are missing, or the extractor says no formulas were found. Here's a breakdown of the most common causes and how to fix each one.
+Last semester I was working through a 200-page lecture notes PDF — the kind that gets scanned from printed transparencies, emailed as a file attachment, and opens with a slightly-off angle on every page. I wanted to pull the key equations into my own notes. What followed was an education in how PDFs actually store (or don't store) mathematical content.

-## Problem 1: The PDF is a Scan
+## The First Surprise: Not All PDFs Are the Same

-**Symptoms:** Symbols look correct on screen but extraction output is garbage or empty.
+I naively assumed "PDF with formulas" meant "formulas I can extract." Not true.

-**Why it happens:** A scanned PDF is just a collection of images — there's no actual text layer. The text you see in your PDF reader is either from OCR performed at scan time (often poor quality) or from the image itself.
+There are at least three fundamentally different kinds of PDFs floating around in academic circles, and they behave completely differently:

-**Fix:** Run TexPixel's image-based pipeline instead. Export individual pages as PNG at 300 DPI using any PDF viewer (File → Export as Image in Preview, or Adobe Acrobat's Export PDF feature), then upload the PNG directly. Image-based recognition handles scans correctly; direct PDF text extraction does not.
+**Born-digital PDFs** (generated from LaTeX, Word, or typesetting software) contain actual vector math. Extraction from these is fast and 95%+ accurate — the formula structure is essentially already there.

-## Problem 2: Low-DPI Scan
+**Scanned PDFs** are just photographs of printed pages packaged into a container. There's no text layer. Extraction works through image recognition, and accuracy depends entirely on scan quality. My professor's notes were this kind.

-**Symptoms:** Some symbols recognized correctly, others replaced with wrong characters or dropped entirely.
+**Hybrid PDFs** have a text layer added by OCR software after scanning. Quality varies wildly — sometimes great, sometimes the "text" layer is completely wrong. These are the most unpredictable.

-**Why it happens:** Below about 150 DPI, strokes in small symbols like `\prime`, `\cdot`, or subscript characters become a few pixels wide — too blurry to reliably distinguish.
+## The Three Root Causes of Most Failures

-**Fix:** Rescan at 300 DPI. Most modern flatbed scanners default to 200 DPI; bumping to 300 produces dramatically better results without significantly increasing file size. For phone scans, use a dedicated scanner app (e.g., Adobe Scan, Microsoft Lens) which applies automatic sharpening and perspective correction.
+After a lot of trial and error, I found that failed extractions almost always come back to one of three things:

-## Problem 3: Password-Protected PDF
+**1. Resolution.** The scan was done at 150 DPI instead of 300. At low resolution, small symbols — subscripts, primes, dots — become a few pixels wide. The model can't reliably distinguish `\prime` from a stray speck. Rescanning at 300 DPI fixed more than half my problems.

-**Symptoms:** "No formulas found" or upload fails entirely.
+**2. Encryption.** Some PDFs are password-protected or have content restrictions that prevent any tool from reading the content stream. The PDF appears to open fine, but nothing can extract from it. Removing the password (File → Export as PDF in Preview, without the password lock) solved this.

-**Why it happens:** Encrypted PDFs require a password to access their content stream. TexPixel cannot process the content of a locked file.
+**3. Formulas stored as vector paths.** Some PDF generators draw equations as shapes rather than encoding them as characters. To any extraction tool, these formulas are invisible — just abstract geometry. The only way around this is to render the page as an image and run visual recognition on that instead.

-**Fix:** Remove the password protection before uploading. In Preview (Mac), open with the password, then File → Export as PDF — the exported file won't have the password. In Adobe Reader, use File → Print → Save as PDF.
+## What Actually Worked

-## Problem 4: Formulas Stored as Vector Paths
+For my professor's scanned notes, the workflow that worked:

-**Symptoms:** PDF looks perfect, but extraction returns nothing or incorrect text.
+1. Export each page as a 300 DPI PNG using Preview
+2. Upload the PNG to TexPixel
+3. Get clean LaTeX back in under a second

-**Why it happens:** Some PDF generators (certain Word versions, some online LaTeX renderers) rasterize or vectorize math into paths — the formulas are essentially drawings, not characters. There's no character stream to extract.
+Not the direct-PDF workflow I was hoping for, but reliable. The image-based pipeline doesn't care whether the original was scanned or born-digital — it just sees pixels and reads the math.

-**Fix:** Export the page as a high-resolution PNG (300 DPI), then upload as an image. TexPixel's visual recognition pipeline handles vector-rendered formulas well.
+## The Bigger Lesson

-## Problem 5: Multi-Column Layout
+PDF is a presentation format, not a data format. It's optimized for how things look, not for what they mean. Mathematical notation in particular gets mangled in transit — rendered, rasterized, path-converted — in ways that destroy the underlying structure.

-**Symptoms:** Formulas from two columns are merged or interleaved in the output.
+The most reliable signal is always the image. When in doubt, export to PNG and let visual recognition do the work.

-**Why it happens:** PDF text streams don't always encode reading order correctly, especially in two-column academic papers.
+---

-**Fix:** Crop to a single column before uploading. Use any image editor to crop the page into left and right halves, then upload each separately.
-
-## Problem 6: Handwritten Annotations
-
-**Symptoms:** Handwritten notes over a printed formula confuse the output.
-
-**Why it happens:** TexPixel sees both the printed formula and the handwritten annotations together. It may try to recognize the annotations as part of the formula.
-
-**Fix:** Crop tightly to just the printed formula, excluding any handwriting around it.
-
-## Quick Diagnostic Checklist
-
-Before uploading a problematic PDF:
-
- [ ] Is it a scan or a born-digital PDF?
- [ ] If a scan, what DPI was it scanned at?
- [ ] Is it password-protected?
- [ ] Does it have a two-column layout?
- [ ] Are there handwritten annotations?
-
-Working through this list resolves the issue 90% of the time.
-
-[Upload your PDF →](/app)
+For a systematic reference on PDF types, file limits, and what TexPixel can handle, see the [PDF Extraction documentation →](/docs/pdf-extraction)
--- a/content/blog/en/2026-03-01-copy-math-to-word.md
+++ b/content/blog/en/2026-03-01-copy-math-to-word.md
@@ -1,74 +0,0 @@
---
-title: "Copy Math to Word Without Losing Formatting — The Right Way"
-description: Three methods for getting recognized formulas into Microsoft Word, ranked by quality and effort
-slug: copy-math-to-word
-date: 2026-03-01
-tags: [tutorial, Word, export]
---
-
-# Copy Math to Word Without Losing Formatting — The Right Way
-
-Most people's first instinct when they need a formula in a Word document is to take a screenshot. It works — until you need to resize the document, change the font, or edit the formula. Screenshots break. Native equations don't.
-
-Here are three ways to get TexPixel's output into Word, from best to worst.
-
-## Method 1: DOCX Export (Best)
-
-The cleanest option. TexPixel converts your recognized formula into a native Word equation (OMML format) and packages it in a `.docx` file.
-
-**How:**
-1. Upload your formula image to TexPixel.
-2. Click **Export** → select **DOCX**.
-3. Open the downloaded file in Word.
-4. Select the equation, copy, paste into your target document.
-
-**Why it's best:** The formula is fully editable in Word's built-in equation editor. Double-click it to open the editor, change any symbol, resize it — it behaves exactly like an equation you typed yourself. It also scales correctly when you change font sizes.
-
-**Limitation:** Each upload produces one `.docx` file. If you have many formulas to insert, you'll need to repeat the process or batch them (see below).
-
-## Method 2: Paste LaTeX into Word's Equation Editor (Good)
-
-Word 2019+ and Microsoft 365 support pasting LaTeX directly into equations.
-
-**How:**
-1. Get the LaTeX output from TexPixel (e.g., `x = \frac{-b \pm \sqrt{b^2-4ac}}{2a}`).
-2. In Word, insert a new equation: **Insert → Equation** (or press `Alt+=`).
-3. Make sure the equation box is in **LaTeX mode** (click the dropdown on the right side of the equation box → select "LaTeX").
-4. Paste the LaTeX string. Press **Enter** or click outside.
-
-Word converts the LaTeX to a rendered, editable equation.
-
-**Why it's good:** Fast for single formulas. No file download required.
-
-**Limitation:** Word's LaTeX parser doesn't support all LaTeX commands. Obscure or complex expressions may not render correctly. Test before relying on it for important documents.
-
-## Method 3: Image Export (Worst, But Sometimes Necessary)
-
-Export the formula as a PNG and insert it as an image in Word.
-
-**When to use:** Only when you need the formula in a document being shared with someone who doesn't have Word's equation editor (e.g., older Word versions, third-party editors). Or when a complex formula doesn't render correctly via Methods 1 or 2.
-
-**Downsides:** Not editable. Doesn't scale well. Accessibility tools can't read it.
-
-## Handling Multiple Formulas
-
-If you have many formulas to insert into a single document:
-
-1. Upload each formula image and collect the LaTeX strings.
-2. Open a new Word document.
-3. For each formula, use the **Alt+=** method above to insert them in sequence.
-4. Once all formulas are inserted, copy and paste the entire equation block into your target document.
-
-This is faster than one DOCX export per formula.
-
-## Google Docs
-
-Google Docs doesn't natively support LaTeX paste. Options:
-
- Use the **Auto-LaTeX Equations** Google Docs add-on, which renders LaTeX strings as inline images.
- Export as DOCX and open in Google Docs (equations import as images, not editable).
- Use a tool like `mathpix-markdown-it` to convert to Markdown and render in a Markdown-compatible environment.
-
-For serious equation-heavy work, Word or Overleaf remain better choices than Google Docs.
-
-[Export your next formula to Word →](/app)
--- a/content/blog/en/2026-03-08-researcher-workflow.md
+++ b/content/blog/en/2026-03-08-researcher-workflow.md
@@ -79,4 +79,6 @@ The real value of digitization compounds over time. A well-organized LaTeX refer

 Start with the past year's notebooks. The 7-hour investment pays dividends for years.

+**See also:** For PDF file limits, supported types, and export options, see the [PDF Extraction documentation →](/docs/pdf-extraction)
+
 [Start digitizing your notes →](/app)
--- a/content/blog/en/2026-03-20-handwriting-tips.md
+++ b/content/blog/en/2026-03-20-handwriting-tips.md
@@ -43,3 +43,5 @@ TexPixel works best when each image contains a single formula or a closely relat
 ---

 With these habits, you'll see noticeably better accuracy — often 95%+ even for complex handwritten expressions.
+
+**See also:** For a systematic breakdown of what affects accuracy (DPI, contrast, formula complexity), see the [OCR Accuracy documentation →](/docs/ocr-accuracy)