11 Commits

Author SHA1 Message Date
OleehyO
12e6bb4312 [deps] Pin transformers to 4.47 2025-04-21 12:24:03 +00:00
OleehyO
4292be86f2 [chore] Setup deps for doc build 2025-04-21 12:24:00 +00:00
OleehyO
3e0d236f6b [chore] Update 2025-04-21 12:24:00 +00:00
OleehyO
b653b9e784 [CD] Add documentation auto-deployment 2025-04-21 12:23:56 +00:00
OleehyO
b85979b258 [deps] Add sphnix extension deps 2025-04-21 08:38:06 +00:00
OleehyO
05e494af4b [docs] Fix typo 2025-04-21 08:21:16 +00:00
OleehyO
4b1b8d10de [chore] Change logo font 2025-04-21 08:20:16 +00:00
OleehyO
c8e08a22aa 🔧 Fix all ruff typo errors & test CI/CD workflow (#109)
* [chore] Fix ruff typo

* [robot] Fix welcome robot
2025-04-21 13:52:16 +08:00
OleehyO
4d3be22956 [CI] Fix deps installation 2025-04-21 05:17:12 +00:00
OleehyO
4e92a38682 [CD] Change trigger condition 2025-04-21 05:12:38 +00:00
OleehyO
3e5272a476 [chore] Update README_zh.md 2025-04-21 05:11:47 +00:00
17 changed files with 172 additions and 162 deletions

View File

@@ -11,18 +11,13 @@ jobs:
- uses: actions/checkout@v4 - uses: actions/checkout@v4
with: with:
persist-credentials: false persist-credentials: false
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.10'
- name: Install uv
run: pip install uv
- name: Install docs dependencies
run: uv pip install --system -e ".[docs]"
- name: Build HTML - name: Build HTML
run: | uses: ammaraskar/sphinx-action@7.0.0
cd docs with:
make html pre-build-command: |
apt-get update && apt-get install -y git
pip install uv
uv pip install --system . .[docs]
- name: Upload artifacts - name: Upload artifacts
uses: actions/upload-artifact@v4 uses: actions/upload-artifact@v4
with: with:
@@ -33,4 +28,4 @@ jobs:
if: github.ref == 'refs/heads/main' if: github.ref == 'refs/heads/main'
with: with:
github_token: ${{ secrets.GITHUB_TOKEN }} github_token: ${{ secrets.GITHUB_TOKEN }}
publish_dir: docs/build/html/ publish_dir: docs/build/html

View File

@@ -4,6 +4,10 @@ on:
pull_request: pull_request:
types: [opened] types: [opened]
permissions:
pull-requests: write
issues: write
jobs: jobs:
welcome: welcome:
runs-on: ubuntu-latest runs-on: ubuntu-latest

View File

@@ -2,8 +2,6 @@ name: Publish to PyPI
on: on:
push: push:
branches:
- 'main'
tags: tags:
- 'v*' - 'v*'

View File

@@ -28,8 +28,8 @@ jobs:
- name: Install dependencies - name: Install dependencies
run: | run: |
uv sync --group test uv sync --extra test
- name: Run tests with pytest - name: Run tests with pytest
run: | run: |
uv run pytest tests/ uv run pytest -v tests/

View File

@@ -56,37 +56,43 @@ TexTeller was trained with **80M image-formula pairs** (previous dataset can be
</tr> </tr>
</table> </table>
## 🔄 Change Log ## 📮 Change Log
- 📮[2024-06-06] **TexTeller3.0 released!** The training data has been increased to **80M** (**10x more than** TexTeller2.0 and also improved in data diversity). TexTeller3.0's new features: - [2024-06-06] **TexTeller3.0 released!** The training data has been increased to **80M** (**10x more than** TexTeller2.0 and also improved in data diversity). TexTeller3.0's new features:
- Support scanned image, handwritten formulas, English(Chinese) mixed formulas. - Support scanned image, handwritten formulas, English(Chinese) mixed formulas.
- OCR abilities in both Chinese and English for printed images. - OCR abilities in both Chinese and English for printed images.
- 📮[2024-05-02] Support **paragraph recognition**. - [2024-05-02] Support **paragraph recognition**.
- 📮[2024-04-12] **Formula detection model** released! - [2024-04-12] **Formula detection model** released!
- 📮[2024-03-25] TexTeller2.0 released! The training data for TexTeller2.0 has been increased to 7.5M (15x more than TexTeller1.0 and also improved in data quality). The trained TexTeller2.0 demonstrated **superior performance** in the test set, especially in recognizing rare symbols, complex multi-line formulas, and matrices. - [2024-03-25] TexTeller2.0 released! The training data for TexTeller2.0 has been increased to 7.5M (15x more than TexTeller1.0 and also improved in data quality). The trained TexTeller2.0 demonstrated **superior performance** in the test set, especially in recognizing rare symbols, complex multi-line formulas, and matrices.
> [Here](./assets/test.pdf) are more test images and a horizontal comparison of various recognition models. > [Here](./assets/test.pdf) are more test images and a horizontal comparison of various recognition models.
## 🚀 Getting Started ## 🚀 Getting Started
1. Install the project's dependencies: 1. Install uv:
```bash ```bash
pip install texteller pip install uv
``` ```
2. If your are using CUDA backend, you may need to install `onnxruntime-gpu`: 2. Install the project's dependencies:
```bash ```bash
pip install texteller[onnxruntime-gpu] uv pip install texteller
``` ```
3. Run the following command to start inference: 3. If your are using CUDA backend, you may need to install `onnxruntime-gpu`:
```bash
uv pip install texteller[onnxruntime-gpu]
```
4. Run the following command to start inference:
```bash ```bash
texteller inference "/path/to/image.{jpg,png}" texteller inference "/path/to/image.{jpg,png}"
@@ -164,7 +170,7 @@ Please setup your environment before training:
1. Install the dependencies for training: 1. Install the dependencies for training:
```bash ```bash
pip install texteller[train] uv pip install texteller[train]
``` ```
2. Clone the repository: 2. Clone the repository:

View File

@@ -74,19 +74,25 @@ TexTeller 使用 **8千万图像-公式对** 进行训练(前代数据集可
## 🚀 快速开始 ## 🚀 快速开始
1. 安装项目依赖 1. 安装uv
```bash ```bash
pip install texteller pip install uv
``` ```
2. 若使用 CUDA 后端,可能需要安装 `onnxruntime-gpu` 2. 安装项目依赖
```bash ```bash
pip install texteller[onnxruntime-gpu] uv pip install texteller
``` ```
3. 运行以下命令开始推理 3. 若使用 CUDA 后端,可能需要安装 `onnxruntime-gpu`
```bash
uv pip install texteller[onnxruntime-gpu]
```
4. 运行以下命令开始推理:
```bash ```bash
texteller inference "/path/to/image.{jpg,png}" texteller inference "/path/to/image.{jpg,png}"
@@ -96,7 +102,7 @@ TexTeller 使用 **8千万图像-公式对** 进行训练(前代数据集可
## 🌐 网页演示 ## 🌐 网页演示
运行命令: 命令行运行
```bash ```bash
texteller web texteller web
@@ -152,7 +158,7 @@ print(response.text)
TexTeller的公式检测模型在3415张中文资料图像和8272张[IBEM数据集](https://zenodo.org/records/4757865)图像上训练。 TexTeller的公式检测模型在3415张中文资料图像和8272张[IBEM数据集](https://zenodo.org/records/4757865)图像上训练。
<div align="center"> <div align="center">
<img src="./assets/det_rec.png" width=250> <img src="./det_rec.png" width=250>
</div> </div>
我们在Python接口中提供了公式检测接口详见[接口文档](https://oleehyo.github.io/TexTeller/)。 我们在Python接口中提供了公式检测接口详见[接口文档](https://oleehyo.github.io/TexTeller/)。
@@ -164,7 +170,7 @@ TexTeller的公式检测模型在3415张中文资料图像和8272张[IBEM数据
1. 安装训练依赖: 1. 安装训练依赖:
```bash ```bash
pip install texteller[train] uv pip install texteller[train]
``` ```
2. 克隆仓库: 2. 克隆仓库:

View File

@@ -457,4 +457,4 @@
<animate attributeName="cy" values="114.80243604193255;7.19374553530416" keyTimes="0;1" dur="1s" repeatCount="indefinite" begin="-0.6866227460985781s"></animate> <animate attributeName="cy" values="114.80243604193255;7.19374553530416" keyTimes="0;1" dur="1s" repeatCount="indefinite" begin="-0.6866227460985781s"></animate>
<animate attributeName="r" values="9;0;0" keyTimes="0;0.6690048284116141;1" dur="1s" repeatCount="indefinite" begin="-0.6866227460985781s"></animate> <animate attributeName="r" values="9;0;0" keyTimes="0;0.6690048284116141;1" dur="1s" repeatCount="indefinite" begin="-0.6866227460985781s"></animate>
</circle></g> </circle></g>
</svg> </svg>

Before

Width:  |  Height:  |  Size: 58 KiB

After

Width:  |  Height:  |  Size: 58 KiB

View File

@@ -1,9 +1,10 @@
<svg xmlns="http://www.w3.org/2000/svg" width="354" height="100" viewBox="0 0 354 100">
<svg xmlns="http://www.w3.org/2000/svg" width="430" height="80" viewBox="0 0 430 80">
<text <text
x="50%" x="50%"
y="50%" y="50%"
font-family="Arial, sans-serif" font-family="monaco"
font-size="55" font-size="55"
text-anchor="middle" text-anchor="middle"
dominant-baseline="middle"> dominant-baseline="middle">

Before

Width:  |  Height:  |  Size: 389 B

After

Width:  |  Height:  |  Size: 377 B

View File

@@ -12,64 +12,64 @@
import os import os
import sys import sys
sys.path.insert(0, os.path.abspath('../..')) sys.path.insert(0, os.path.abspath("../.."))
# -- Project information ----------------------------------------------------- # -- Project information -----------------------------------------------------
project = 'TexTeller' project = "TexTeller"
copyright = '2025, TexTeller Team' copyright = "2025, TexTeller Team"
author = 'TexTeller Team' author = "TexTeller Team"
# -- General configuration --------------------------------------------------- # -- General configuration ---------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration # https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
extensions = [ extensions = [
'myst_parser', "myst_parser",
'sphinx.ext.duration', "sphinx.ext.duration",
'sphinx.ext.intersphinx', "sphinx.ext.intersphinx",
'sphinx.ext.autosectionlabel', "sphinx.ext.autosectionlabel",
'sphinx.ext.autodoc', "sphinx.ext.autodoc",
'sphinx.ext.viewcode', "sphinx.ext.viewcode",
'sphinx.ext.napoleon', "sphinx.ext.napoleon",
'sphinx.ext.autosummary', "sphinx.ext.autosummary",
'sphinx_copybutton', "sphinx_copybutton",
# 'sphinx.ext.linkcode', # 'sphinx.ext.linkcode',
# 'sphinxarg.ext', # 'sphinxarg.ext',
'sphinx_design', "sphinx_design",
'nbsphinx', "nbsphinx",
] ]
templates_path = ['_templates'] templates_path = ["_templates"]
exclude_patterns = [] exclude_patterns = []
# Autodoc settings # Autodoc settings
autodoc_member_order = 'bysource' autodoc_member_order = "bysource"
add_module_names = False add_module_names = False
autoclass_content = 'both' autoclass_content = "both"
autodoc_default_options = { autodoc_default_options = {
'members': True, "members": True,
'member-order': 'bysource', "member-order": "bysource",
'undoc-members': True, "undoc-members": True,
'show-inheritance': True, "show-inheritance": True,
'imported-members': True, "imported-members": True,
} }
# Intersphinx settings # Intersphinx settings
intersphinx_mapping = { intersphinx_mapping = {
'python': ('https://docs.python.org/3', None), "python": ("https://docs.python.org/3", None),
'numpy': ('https://numpy.org/doc/stable', None), "numpy": ("https://numpy.org/doc/stable", None),
'torch': ('https://pytorch.org/docs/stable', None), "torch": ("https://pytorch.org/docs/stable", None),
'transformers': ('https://huggingface.co/docs/transformers/main/en', None), "transformers": ("https://huggingface.co/docs/transformers/main/en", None),
} }
html_theme = 'sphinx_book_theme' html_theme = "sphinx_book_theme"
html_theme_options = { html_theme_options = {
'repository_url': 'https://github.com/OleehyO/TexTeller', "repository_url": "https://github.com/OleehyO/TexTeller",
'use_repository_button': True, "use_repository_button": True,
'use_issues_button': True, "use_issues_button": True,
'use_edit_page_button': True, "use_edit_page_button": True,
'use_download_button': True, "use_download_button": True,
} }
html_logo = "../../assets/logo.svg" html_logo = "../../assets/logo.svg"

View File

@@ -40,7 +40,7 @@ Converting an image to LaTeX:
Processing a mixed text/formula image: Processing a mixed text/formula image:
.. code-block::python .. code-block:: python
from texteller import ( from texteller import (
load_model, load_tokenizer, load_latexdet_model, load_model, load_tokenizer, load_latexdet_model,

View File

@@ -3,8 +3,8 @@ import requests
server_url = "http://127.0.0.1:8000/predict" server_url = "http://127.0.0.1:8000/predict"
img_path = "/path/to/your/image" img_path = "/path/to/your/image"
with open(img_path, 'rb') as img: with open(img_path, "rb") as img:
files = {'img': img} files = {"img": img}
response = requests.post(server_url, files=files) response = requests.post(server_url, files=files)
print(response.text) print(response.text)

View File

@@ -22,7 +22,7 @@ dependencies = [
"streamlit-paste-button>=0.1.2", "streamlit-paste-button>=0.1.2",
"torch>=2.6.0", "torch>=2.6.0",
"torchvision>=0.21.0", "torchvision>=0.21.0",
"transformers==4.45.2", "transformers==4.47",
"wget>=3.2", "wget>=3.2",
"optimum[onnxruntime]>=1.24.0", "optimum[onnxruntime]>=1.24.0",
"python-multipart>=0.0.20", "python-multipart>=0.0.20",

View File

@@ -19,8 +19,8 @@ TEXT_LINE_START = ""
COMMENT_LINE_START = "% " COMMENT_LINE_START = "% "
# Opening and closing delimiters # Opening and closing delimiters
OPENS = ['{', '(', '['] OPENS = ["{", "(", "["]
CLOSES = ['}', ')', ']'] CLOSES = ["}", ")", "]"]
# Names of LaTeX verbatim environments # Names of LaTeX verbatim environments
VERBATIMS = ["verbatim", "Verbatim", "lstlisting", "minted", "comment"] VERBATIMS = ["verbatim", "Verbatim", "lstlisting", "minted", "comment"]
@@ -138,7 +138,7 @@ class Pattern:
contains_env_end=ENV_END in s, contains_env_end=ENV_END in s,
contains_item=ITEM in s, contains_item=ITEM in s,
contains_splitting=True, contains_splitting=True,
contains_comment='%' in s, contains_comment="%" in s,
) )
else: else:
return cls( return cls(
@@ -146,7 +146,7 @@ class Pattern:
contains_env_end=False, contains_env_end=False,
contains_item=False, contains_item=False,
contains_splitting=False, contains_splitting=False,
contains_comment='%' in s, contains_comment="%" in s,
) )
@@ -169,11 +169,11 @@ def find_comment_index(line: str, pattern: Pattern) -> Optional[int]:
in_command = False in_command = False
for i, c in enumerate(line): for i, c in enumerate(line):
if c == '\\': if c == "\\":
in_command = True in_command = True
elif in_command and not c.isalpha(): elif in_command and not c.isalpha():
in_command = False in_command = False
elif c == '%' and not in_command: elif c == "%" and not in_command:
return i return i
return None return None
@@ -390,10 +390,10 @@ def find_wrap_point(line: str, indent_length: int, args: Args) -> Optional[int]:
line_width += 1 line_width += 1
if line_width > wrap_boundary and wrap_point is not None: if line_width > wrap_boundary and wrap_point is not None:
break break
if c == ' ' and prev_char != '\\': if c == " " and prev_char != "\\":
if after_char: if after_char:
wrap_point = i wrap_point = i
elif c != '%': elif c != "%":
after_char = True after_char = True
prev_char = c prev_char = c
@@ -483,8 +483,8 @@ def split_line(line: str, state: State, file: str, args: Args, logs: List[Log])
if not match: if not match:
return line, "" return line, ""
prev = match.group('prev') prev = match.group("prev")
rest = match.group('env') rest = match.group("env")
if args.verbosity >= 3: # Trace level if args.verbosity >= 3: # Trace level
logs.append( logs.append(
@@ -517,8 +517,8 @@ def clean_text(text: str, args: Args) -> str:
text = RE_NEWLINES.sub(f"{LINE_END}{LINE_END}", text) text = RE_NEWLINES.sub(f"{LINE_END}{LINE_END}", text)
# Remove tabs if they shouldn't be used # Remove tabs if they shouldn't be used
if args.tabchar != '\t': if args.tabchar != "\t":
text = text.replace('\t', ' ' * args.tabsize) text = text.replace("\t", " " * args.tabsize)
# Remove trailing spaces # Remove trailing spaces
text = RE_TRAIL.sub(LINE_END, text) text = RE_TRAIL.sub(LINE_END, text)
@@ -577,7 +577,7 @@ def _format_latex(old_text: str, file: str, args: Args) -> Tuple[str, List[Log]]
new_text = "" new_text = ""
# Select the character used for indentation # Select the character used for indentation
indent_char = '\t' if args.tabchar == '\t' else ' ' indent_char = "\t" if args.tabchar == "\t" else " "
# Get any extra environments to be indented as lists # Get any extra environments to be indented as lists
lists_begin = [f"\\begin{{{l}}}" for l in args.lists] lists_begin = [f"\\begin{{{l}}}" for l in args.lists]

View File

@@ -5,13 +5,13 @@ from .format import format_latex
def _rm_dollar_surr(content): def _rm_dollar_surr(content):
pattern = re.compile(r'\\[a-zA-Z]+\$.*?\$|\$.*?\$') pattern = re.compile(r"\\[a-zA-Z]+\$.*?\$|\$.*?\$")
matches = pattern.findall(content) matches = pattern.findall(content)
for match in matches: for match in matches:
if not re.match(r'\\[a-zA-Z]+', match): if not re.match(r"\\[a-zA-Z]+", match):
new_match = match.strip('$') new_match = match.strip("$")
content = content.replace(match, ' ' + new_match + ' ') content = content.replace(match, " " + new_match + " ")
return content return content
@@ -33,97 +33,97 @@ def to_katex(formula: str) -> str:
""" """
res = formula res = formula
# remove mbox surrounding # remove mbox surrounding
res = change_all(res, r'\mbox ', r' ', r'{', r'}', r'', r'') res = change_all(res, r"\mbox ", r" ", r"{", r"}", r"", r"")
res = change_all(res, r'\mbox', r' ', r'{', r'}', r'', r'') res = change_all(res, r"\mbox", r" ", r"{", r"}", r"", r"")
# remove hbox surrounding # remove hbox surrounding
res = re.sub(r'\\hbox to ?-? ?\d+\.\d+(pt)?\{', r'\\hbox{', res) res = re.sub(r"\\hbox to ?-? ?\d+\.\d+(pt)?\{", r"\\hbox{", res)
res = change_all(res, r'\hbox', r' ', r'{', r'}', r'', r' ') res = change_all(res, r"\hbox", r" ", r"{", r"}", r"", r" ")
# remove raise surrounding # remove raise surrounding
res = re.sub(r'\\raise ?-? ?\d+\.\d+(pt)?', r' ', res) res = re.sub(r"\\raise ?-? ?\d+\.\d+(pt)?", r" ", res)
# remove makebox # remove makebox
res = re.sub(r'\\makebox ?\[\d+\.\d+(pt)?\]\{', r'\\makebox{', res) res = re.sub(r"\\makebox ?\[\d+\.\d+(pt)?\]\{", r"\\makebox{", res)
res = change_all(res, r'\makebox', r' ', r'{', r'}', r'', r' ') res = change_all(res, r"\makebox", r" ", r"{", r"}", r"", r" ")
# remove vbox surrounding, scalebox surrounding # remove vbox surrounding, scalebox surrounding
res = re.sub(r'\\raisebox\{-? ?\d+\.\d+(pt)?\}\{', r'\\raisebox{', res) res = re.sub(r"\\raisebox\{-? ?\d+\.\d+(pt)?\}\{", r"\\raisebox{", res)
res = re.sub(r'\\scalebox\{-? ?\d+\.\d+(pt)?\}\{', r'\\scalebox{', res) res = re.sub(r"\\scalebox\{-? ?\d+\.\d+(pt)?\}\{", r"\\scalebox{", res)
res = change_all(res, r'\scalebox', r' ', r'{', r'}', r'', r' ') res = change_all(res, r"\scalebox", r" ", r"{", r"}", r"", r" ")
res = change_all(res, r'\raisebox', r' ', r'{', r'}', r'', r' ') res = change_all(res, r"\raisebox", r" ", r"{", r"}", r"", r" ")
res = change_all(res, r'\vbox', r' ', r'{', r'}', r'', r' ') res = change_all(res, r"\vbox", r" ", r"{", r"}", r"", r" ")
origin_instructions = [ origin_instructions = [
r'\Huge', r"\Huge",
r'\huge', r"\huge",
r'\LARGE', r"\LARGE",
r'\Large', r"\Large",
r'\large', r"\large",
r'\normalsize', r"\normalsize",
r'\small', r"\small",
r'\footnotesize', r"\footnotesize",
r'\tiny', r"\tiny",
] ]
for old_ins, new_ins in zip(origin_instructions, origin_instructions): for old_ins, new_ins in zip(origin_instructions, origin_instructions):
res = change_all(res, old_ins, new_ins, r'$', r'$', '{', '}') res = change_all(res, old_ins, new_ins, r"$", r"$", "{", "}")
res = change_all(res, r'\mathbf', r'\bm', r'{', r'}', r'{', r'}') res = change_all(res, r"\mathbf", r"\bm", r"{", r"}", r"{", r"}")
res = change_all(res, r'\boldmath ', r'\bm', r'{', r'}', r'{', r'}') res = change_all(res, r"\boldmath ", r"\bm", r"{", r"}", r"{", r"}")
res = change_all(res, r'\boldmath', r'\bm', r'{', r'}', r'{', r'}') res = change_all(res, r"\boldmath", r"\bm", r"{", r"}", r"{", r"}")
res = change_all(res, r'\boldmath ', r'\bm', r'$', r'$', r'{', r'}') res = change_all(res, r"\boldmath ", r"\bm", r"$", r"$", r"{", r"}")
res = change_all(res, r'\boldmath', r'\bm', r'$', r'$', r'{', r'}') res = change_all(res, r"\boldmath", r"\bm", r"$", r"$", r"{", r"}")
res = change_all(res, r'\scriptsize', r'\scriptsize', r'$', r'$', r'{', r'}') res = change_all(res, r"\scriptsize", r"\scriptsize", r"$", r"$", r"{", r"}")
res = change_all(res, r'\emph', r'\textit', r'{', r'}', r'{', r'}') res = change_all(res, r"\emph", r"\textit", r"{", r"}", r"{", r"}")
res = change_all(res, r'\emph ', r'\textit', r'{', r'}', r'{', r'}') res = change_all(res, r"\emph ", r"\textit", r"{", r"}", r"{", r"}")
# remove bold command # remove bold command
res = change_all(res, r'\bm', r' ', r'{', r'}', r'', r'') res = change_all(res, r"\bm", r" ", r"{", r"}", r"", r"")
origin_instructions = [ origin_instructions = [
r'\left', r"\left",
r'\middle', r"\middle",
r'\right', r"\right",
r'\big', r"\big",
r'\Big', r"\Big",
r'\bigg', r"\bigg",
r'\Bigg', r"\Bigg",
r'\bigl', r"\bigl",
r'\Bigl', r"\Bigl",
r'\biggl', r"\biggl",
r'\Biggl', r"\Biggl",
r'\bigm', r"\bigm",
r'\Bigm', r"\Bigm",
r'\biggm', r"\biggm",
r'\Biggm', r"\Biggm",
r'\bigr', r"\bigr",
r'\Bigr', r"\Bigr",
r'\biggr', r"\biggr",
r'\Biggr', r"\Biggr",
] ]
for origin_ins in origin_instructions: for origin_ins in origin_instructions:
res = change_all(res, origin_ins, origin_ins, r'{', r'}', r'', r'') res = change_all(res, origin_ins, origin_ins, r"{", r"}", r"", r"")
res = re.sub(r'\\\[(.*?)\\\]', r'\1\\newline', res) res = re.sub(r"\\\[(.*?)\\\]", r"\1\\newline", res)
if res.endswith(r'\newline'): if res.endswith(r"\newline"):
res = res[:-8] res = res[:-8]
# remove multiple spaces # remove multiple spaces
res = re.sub(r'(\\,){1,}', ' ', res) res = re.sub(r"(\\,){1,}", " ", res)
res = re.sub(r'(\\!){1,}', ' ', res) res = re.sub(r"(\\!){1,}", " ", res)
res = re.sub(r'(\\;){1,}', ' ', res) res = re.sub(r"(\\;){1,}", " ", res)
res = re.sub(r'(\\:){1,}', ' ', res) res = re.sub(r"(\\:){1,}", " ", res)
res = re.sub(r'\\vspace\{.*?}', '', res) res = re.sub(r"\\vspace\{.*?}", "", res)
# merge consecutive text # merge consecutive text
def merge_texts(match): def merge_texts(match):
texts = match.group(0) texts = match.group(0)
merged_content = ''.join(re.findall(r'\\text\{([^}]*)\}', texts)) merged_content = "".join(re.findall(r"\\text\{([^}]*)\}", texts))
return f'\\text{{{merged_content}}}' return f"\\text{{{merged_content}}}"
res = re.sub(r'(\\text\{[^}]*\}\s*){2,}', merge_texts, res) res = re.sub(r"(\\text\{[^}]*\}\s*){2,}", merge_texts, res)
res = res.replace(r'\bf ', '') res = res.replace(r"\bf ", "")
res = _rm_dollar_surr(res) res = _rm_dollar_surr(res)
# remove extra spaces (keeping only one) # remove extra spaces (keeping only one)
res = re.sub(r' +', ' ', res) res = re.sub(r" +", " ", res)
# format latex # format latex
res = res.strip() res = res.strip()

View File

@@ -1,3 +1,3 @@
from .texteller import TexTeller from .texteller import TexTeller
__all__ = ['TexTeller'] __all__ = ["TexTeller"]

View File

@@ -41,7 +41,7 @@ def readimgs(image_paths: list[str]) -> list[np.ndarray]:
if image is None: if image is None:
raise ValueError(f"Image at {path} could not be read.") raise ValueError(f"Image at {path} could not be read.")
if image.dtype == np.uint16: if image.dtype == np.uint16:
_logger.warning(f'Converting {path} to 8-bit, image may be lossy.') _logger.warning(f"Converting {path} to 8-bit, image may be lossy.")
image = cv2.convertScaleAbs(image, alpha=(255.0 / 65535.0)) image = cv2.convertScaleAbs(image, alpha=(255.0 / 65535.0))
channels = 1 if len(image.shape) == 2 else image.shape[2] channels = 1 if len(image.shape) == 2 else image.shape[2]
@@ -112,7 +112,7 @@ def transform(images: List[Union[np.ndarray, Image.Image]]) -> List[torch.Tensor
assert IMG_CHANNELS == 1, "Only support grayscale images for now" assert IMG_CHANNELS == 1, "Only support grayscale images for now"
images = [ images = [
np.array(img.convert('RGB')) if isinstance(img, Image.Image) else img for img in images np.array(img.convert("RGB")) if isinstance(img, Image.Image) else img for img in images
] ]
images = [trim_white_border(image) for image in images] images = [trim_white_border(image) for image in images]
images = [general_transform_pipeline(image) for image in images] images = [general_transform_pipeline(image) for image in images]

View File

@@ -21,7 +21,7 @@ def _change(input_str, old_inst, new_inst, old_surr_l, old_surr_r, new_surr_l, n
j = start + 1 j = start + 1
escaped = False escaped = False
while j < n and count > 0: while j < n and count > 0:
if input_str[j] == '\\' and not escaped: if input_str[j] == "\\" and not escaped:
escaped = True escaped = True
j += 1 j += 1
continue continue
@@ -71,10 +71,10 @@ def change_all(input_str, old_inst, new_inst, old_surr_l, old_surr_r, new_surr_l
for p in pos[::-1]: for p in pos[::-1]:
res[p:] = list( res[p:] = list(
_change( _change(
''.join(res[p:]), old_inst, new_inst, old_surr_l, old_surr_r, new_surr_l, new_surr_r "".join(res[p:]), old_inst, new_inst, old_surr_l, old_surr_r, new_surr_l, new_surr_r
) )
) )
res = ''.join(res) res = "".join(res)
return res return res
@@ -121,7 +121,7 @@ def add_newlines(latex_str: str) -> str:
# 4. Cleanup: Collapse multiple consecutive newlines into a single newline. # 4. Cleanup: Collapse multiple consecutive newlines into a single newline.
# This handles cases where the replacements above might have created \n\n. # This handles cases where the replacements above might have created \n\n.
processed_str = re.sub(r'\n{2,}', '\n', processed_str) processed_str = re.sub(r"\n{2,}", "\n", processed_str)
# Remove leading/trailing whitespace (including potential single newlines # Remove leading/trailing whitespace (including potential single newlines
# at the very start/end resulting from the replacements) from the entire result. # at the very start/end resulting from the replacements) from the entire result.