Compare commits
11 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
12e6bb4312 | ||
|
|
4292be86f2 | ||
|
|
3e0d236f6b | ||
|
|
b653b9e784 | ||
|
|
b85979b258 | ||
|
|
05e494af4b | ||
|
|
4b1b8d10de | ||
|
|
c8e08a22aa | ||
|
|
4d3be22956 | ||
|
|
4e92a38682 | ||
|
|
3e5272a476 |
19
.github/workflows/deploy-doc.yml
vendored
19
.github/workflows/deploy-doc.yml
vendored
@@ -11,18 +11,13 @@ jobs:
|
|||||||
- uses: actions/checkout@v4
|
- uses: actions/checkout@v4
|
||||||
with:
|
with:
|
||||||
persist-credentials: false
|
persist-credentials: false
|
||||||
- name: Set up Python
|
|
||||||
uses: actions/setup-python@v4
|
|
||||||
with:
|
|
||||||
python-version: '3.10'
|
|
||||||
- name: Install uv
|
|
||||||
run: pip install uv
|
|
||||||
- name: Install docs dependencies
|
|
||||||
run: uv pip install --system -e ".[docs]"
|
|
||||||
- name: Build HTML
|
- name: Build HTML
|
||||||
run: |
|
uses: ammaraskar/sphinx-action@7.0.0
|
||||||
cd docs
|
with:
|
||||||
make html
|
pre-build-command: |
|
||||||
|
apt-get update && apt-get install -y git
|
||||||
|
pip install uv
|
||||||
|
uv pip install --system . .[docs]
|
||||||
- name: Upload artifacts
|
- name: Upload artifacts
|
||||||
uses: actions/upload-artifact@v4
|
uses: actions/upload-artifact@v4
|
||||||
with:
|
with:
|
||||||
@@ -33,4 +28,4 @@ jobs:
|
|||||||
if: github.ref == 'refs/heads/main'
|
if: github.ref == 'refs/heads/main'
|
||||||
with:
|
with:
|
||||||
github_token: ${{ secrets.GITHUB_TOKEN }}
|
github_token: ${{ secrets.GITHUB_TOKEN }}
|
||||||
publish_dir: docs/build/html/
|
publish_dir: docs/build/html
|
||||||
|
|||||||
4
.github/workflows/pr-welcome.yml
vendored
4
.github/workflows/pr-welcome.yml
vendored
@@ -4,6 +4,10 @@ on:
|
|||||||
pull_request:
|
pull_request:
|
||||||
types: [opened]
|
types: [opened]
|
||||||
|
|
||||||
|
permissions:
|
||||||
|
pull-requests: write
|
||||||
|
issues: write
|
||||||
|
|
||||||
jobs:
|
jobs:
|
||||||
welcome:
|
welcome:
|
||||||
runs-on: ubuntu-latest
|
runs-on: ubuntu-latest
|
||||||
|
|||||||
2
.github/workflows/publish.yml
vendored
2
.github/workflows/publish.yml
vendored
@@ -2,8 +2,6 @@ name: Publish to PyPI
|
|||||||
|
|
||||||
on:
|
on:
|
||||||
push:
|
push:
|
||||||
branches:
|
|
||||||
- 'main'
|
|
||||||
tags:
|
tags:
|
||||||
- 'v*'
|
- 'v*'
|
||||||
|
|
||||||
|
|||||||
4
.github/workflows/test.yaml
vendored
4
.github/workflows/test.yaml
vendored
@@ -28,8 +28,8 @@ jobs:
|
|||||||
|
|
||||||
- name: Install dependencies
|
- name: Install dependencies
|
||||||
run: |
|
run: |
|
||||||
uv sync --group test
|
uv sync --extra test
|
||||||
|
|
||||||
- name: Run tests with pytest
|
- name: Run tests with pytest
|
||||||
run: |
|
run: |
|
||||||
uv run pytest tests/
|
uv run pytest -v tests/
|
||||||
|
|||||||
28
README.md
28
README.md
@@ -56,37 +56,43 @@ TexTeller was trained with **80M image-formula pairs** (previous dataset can be
|
|||||||
</tr>
|
</tr>
|
||||||
</table>
|
</table>
|
||||||
|
|
||||||
## 🔄 Change Log
|
## 📮 Change Log
|
||||||
|
|
||||||
- 📮[2024-06-06] **TexTeller3.0 released!** The training data has been increased to **80M** (**10x more than** TexTeller2.0 and also improved in data diversity). TexTeller3.0's new features:
|
- [2024-06-06] **TexTeller3.0 released!** The training data has been increased to **80M** (**10x more than** TexTeller2.0 and also improved in data diversity). TexTeller3.0's new features:
|
||||||
|
|
||||||
- Support scanned image, handwritten formulas, English(Chinese) mixed formulas.
|
- Support scanned image, handwritten formulas, English(Chinese) mixed formulas.
|
||||||
|
|
||||||
- OCR abilities in both Chinese and English for printed images.
|
- OCR abilities in both Chinese and English for printed images.
|
||||||
|
|
||||||
- 📮[2024-05-02] Support **paragraph recognition**.
|
- [2024-05-02] Support **paragraph recognition**.
|
||||||
|
|
||||||
- 📮[2024-04-12] **Formula detection model** released!
|
- [2024-04-12] **Formula detection model** released!
|
||||||
|
|
||||||
- 📮[2024-03-25] TexTeller2.0 released! The training data for TexTeller2.0 has been increased to 7.5M (15x more than TexTeller1.0 and also improved in data quality). The trained TexTeller2.0 demonstrated **superior performance** in the test set, especially in recognizing rare symbols, complex multi-line formulas, and matrices.
|
- [2024-03-25] TexTeller2.0 released! The training data for TexTeller2.0 has been increased to 7.5M (15x more than TexTeller1.0 and also improved in data quality). The trained TexTeller2.0 demonstrated **superior performance** in the test set, especially in recognizing rare symbols, complex multi-line formulas, and matrices.
|
||||||
|
|
||||||
> [Here](./assets/test.pdf) are more test images and a horizontal comparison of various recognition models.
|
> [Here](./assets/test.pdf) are more test images and a horizontal comparison of various recognition models.
|
||||||
|
|
||||||
## 🚀 Getting Started
|
## 🚀 Getting Started
|
||||||
|
|
||||||
1. Install the project's dependencies:
|
1. Install uv:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
pip install texteller
|
pip install uv
|
||||||
```
|
```
|
||||||
|
|
||||||
2. If your are using CUDA backend, you may need to install `onnxruntime-gpu`:
|
2. Install the project's dependencies:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
pip install texteller[onnxruntime-gpu]
|
uv pip install texteller
|
||||||
```
|
```
|
||||||
|
|
||||||
3. Run the following command to start inference:
|
3. If your are using CUDA backend, you may need to install `onnxruntime-gpu`:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
uv pip install texteller[onnxruntime-gpu]
|
||||||
|
```
|
||||||
|
|
||||||
|
4. Run the following command to start inference:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
texteller inference "/path/to/image.{jpg,png}"
|
texteller inference "/path/to/image.{jpg,png}"
|
||||||
@@ -164,7 +170,7 @@ Please setup your environment before training:
|
|||||||
1. Install the dependencies for training:
|
1. Install the dependencies for training:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
pip install texteller[train]
|
uv pip install texteller[train]
|
||||||
```
|
```
|
||||||
|
|
||||||
2. Clone the repository:
|
2. Clone the repository:
|
||||||
|
|||||||
@@ -74,19 +74,25 @@ TexTeller 使用 **8千万图像-公式对** 进行训练(前代数据集可
|
|||||||
|
|
||||||
## 🚀 快速开始
|
## 🚀 快速开始
|
||||||
|
|
||||||
1. 安装项目依赖:
|
1. 安装uv:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
pip install texteller
|
pip install uv
|
||||||
```
|
```
|
||||||
|
|
||||||
2. 若使用 CUDA 后端,可能需要安装 `onnxruntime-gpu`:
|
2. 安装项目依赖:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
pip install texteller[onnxruntime-gpu]
|
uv pip install texteller
|
||||||
```
|
```
|
||||||
|
|
||||||
3. 运行以下命令开始推理:
|
3. 若使用 CUDA 后端,可能需要安装 `onnxruntime-gpu`:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
uv pip install texteller[onnxruntime-gpu]
|
||||||
|
```
|
||||||
|
|
||||||
|
4. 运行以下命令开始推理:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
texteller inference "/path/to/image.{jpg,png}"
|
texteller inference "/path/to/image.{jpg,png}"
|
||||||
@@ -96,7 +102,7 @@ TexTeller 使用 **8千万图像-公式对** 进行训练(前代数据集可
|
|||||||
|
|
||||||
## 🌐 网页演示
|
## 🌐 网页演示
|
||||||
|
|
||||||
运行命令:
|
命令行运行:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
texteller web
|
texteller web
|
||||||
@@ -152,7 +158,7 @@ print(response.text)
|
|||||||
TexTeller的公式检测模型在3415张中文资料图像和8272张[IBEM数据集](https://zenodo.org/records/4757865)图像上训练。
|
TexTeller的公式检测模型在3415张中文资料图像和8272张[IBEM数据集](https://zenodo.org/records/4757865)图像上训练。
|
||||||
|
|
||||||
<div align="center">
|
<div align="center">
|
||||||
<img src="./assets/det_rec.png" width=250>
|
<img src="./det_rec.png" width=250>
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
我们在Python接口中提供了公式检测接口,详见[接口文档](https://oleehyo.github.io/TexTeller/)。
|
我们在Python接口中提供了公式检测接口,详见[接口文档](https://oleehyo.github.io/TexTeller/)。
|
||||||
@@ -164,7 +170,7 @@ TexTeller的公式检测模型在3415张中文资料图像和8272张[IBEM数据
|
|||||||
1. 安装训练依赖:
|
1. 安装训练依赖:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
pip install texteller[train]
|
uv pip install texteller[train]
|
||||||
```
|
```
|
||||||
|
|
||||||
2. 克隆仓库:
|
2. 克隆仓库:
|
||||||
|
|||||||
@@ -1,9 +1,10 @@
|
|||||||
<svg xmlns="http://www.w3.org/2000/svg" width="354" height="100" viewBox="0 0 354 100">
|
|
||||||
|
<svg xmlns="http://www.w3.org/2000/svg" width="430" height="80" viewBox="0 0 430 80">
|
||||||
|
|
||||||
<text
|
<text
|
||||||
x="50%"
|
x="50%"
|
||||||
y="50%"
|
y="50%"
|
||||||
font-family="Arial, sans-serif"
|
font-family="monaco"
|
||||||
font-size="55"
|
font-size="55"
|
||||||
text-anchor="middle"
|
text-anchor="middle"
|
||||||
dominant-baseline="middle">
|
dominant-baseline="middle">
|
||||||
|
|||||||
|
Before Width: | Height: | Size: 389 B After Width: | Height: | Size: 377 B |
@@ -12,64 +12,64 @@
|
|||||||
import os
|
import os
|
||||||
import sys
|
import sys
|
||||||
|
|
||||||
sys.path.insert(0, os.path.abspath('../..'))
|
sys.path.insert(0, os.path.abspath("../.."))
|
||||||
|
|
||||||
# -- Project information -----------------------------------------------------
|
# -- Project information -----------------------------------------------------
|
||||||
|
|
||||||
project = 'TexTeller'
|
project = "TexTeller"
|
||||||
copyright = '2025, TexTeller Team'
|
copyright = "2025, TexTeller Team"
|
||||||
author = 'TexTeller Team'
|
author = "TexTeller Team"
|
||||||
|
|
||||||
# -- General configuration ---------------------------------------------------
|
# -- General configuration ---------------------------------------------------
|
||||||
# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
|
# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
|
||||||
|
|
||||||
extensions = [
|
extensions = [
|
||||||
'myst_parser',
|
"myst_parser",
|
||||||
'sphinx.ext.duration',
|
"sphinx.ext.duration",
|
||||||
'sphinx.ext.intersphinx',
|
"sphinx.ext.intersphinx",
|
||||||
'sphinx.ext.autosectionlabel',
|
"sphinx.ext.autosectionlabel",
|
||||||
'sphinx.ext.autodoc',
|
"sphinx.ext.autodoc",
|
||||||
'sphinx.ext.viewcode',
|
"sphinx.ext.viewcode",
|
||||||
'sphinx.ext.napoleon',
|
"sphinx.ext.napoleon",
|
||||||
'sphinx.ext.autosummary',
|
"sphinx.ext.autosummary",
|
||||||
'sphinx_copybutton',
|
"sphinx_copybutton",
|
||||||
# 'sphinx.ext.linkcode',
|
# 'sphinx.ext.linkcode',
|
||||||
# 'sphinxarg.ext',
|
# 'sphinxarg.ext',
|
||||||
'sphinx_design',
|
"sphinx_design",
|
||||||
'nbsphinx',
|
"nbsphinx",
|
||||||
]
|
]
|
||||||
|
|
||||||
templates_path = ['_templates']
|
templates_path = ["_templates"]
|
||||||
exclude_patterns = []
|
exclude_patterns = []
|
||||||
|
|
||||||
# Autodoc settings
|
# Autodoc settings
|
||||||
autodoc_member_order = 'bysource'
|
autodoc_member_order = "bysource"
|
||||||
add_module_names = False
|
add_module_names = False
|
||||||
autoclass_content = 'both'
|
autoclass_content = "both"
|
||||||
autodoc_default_options = {
|
autodoc_default_options = {
|
||||||
'members': True,
|
"members": True,
|
||||||
'member-order': 'bysource',
|
"member-order": "bysource",
|
||||||
'undoc-members': True,
|
"undoc-members": True,
|
||||||
'show-inheritance': True,
|
"show-inheritance": True,
|
||||||
'imported-members': True,
|
"imported-members": True,
|
||||||
}
|
}
|
||||||
|
|
||||||
# Intersphinx settings
|
# Intersphinx settings
|
||||||
intersphinx_mapping = {
|
intersphinx_mapping = {
|
||||||
'python': ('https://docs.python.org/3', None),
|
"python": ("https://docs.python.org/3", None),
|
||||||
'numpy': ('https://numpy.org/doc/stable', None),
|
"numpy": ("https://numpy.org/doc/stable", None),
|
||||||
'torch': ('https://pytorch.org/docs/stable', None),
|
"torch": ("https://pytorch.org/docs/stable", None),
|
||||||
'transformers': ('https://huggingface.co/docs/transformers/main/en', None),
|
"transformers": ("https://huggingface.co/docs/transformers/main/en", None),
|
||||||
}
|
}
|
||||||
|
|
||||||
html_theme = 'sphinx_book_theme'
|
html_theme = "sphinx_book_theme"
|
||||||
|
|
||||||
html_theme_options = {
|
html_theme_options = {
|
||||||
'repository_url': 'https://github.com/OleehyO/TexTeller',
|
"repository_url": "https://github.com/OleehyO/TexTeller",
|
||||||
'use_repository_button': True,
|
"use_repository_button": True,
|
||||||
'use_issues_button': True,
|
"use_issues_button": True,
|
||||||
'use_edit_page_button': True,
|
"use_edit_page_button": True,
|
||||||
'use_download_button': True,
|
"use_download_button": True,
|
||||||
}
|
}
|
||||||
|
|
||||||
html_logo = "../../assets/logo.svg"
|
html_logo = "../../assets/logo.svg"
|
||||||
|
|||||||
@@ -3,8 +3,8 @@ import requests
|
|||||||
server_url = "http://127.0.0.1:8000/predict"
|
server_url = "http://127.0.0.1:8000/predict"
|
||||||
|
|
||||||
img_path = "/path/to/your/image"
|
img_path = "/path/to/your/image"
|
||||||
with open(img_path, 'rb') as img:
|
with open(img_path, "rb") as img:
|
||||||
files = {'img': img}
|
files = {"img": img}
|
||||||
response = requests.post(server_url, files=files)
|
response = requests.post(server_url, files=files)
|
||||||
|
|
||||||
print(response.text)
|
print(response.text)
|
||||||
|
|||||||
@@ -22,7 +22,7 @@ dependencies = [
|
|||||||
"streamlit-paste-button>=0.1.2",
|
"streamlit-paste-button>=0.1.2",
|
||||||
"torch>=2.6.0",
|
"torch>=2.6.0",
|
||||||
"torchvision>=0.21.0",
|
"torchvision>=0.21.0",
|
||||||
"transformers==4.45.2",
|
"transformers==4.47",
|
||||||
"wget>=3.2",
|
"wget>=3.2",
|
||||||
"optimum[onnxruntime]>=1.24.0",
|
"optimum[onnxruntime]>=1.24.0",
|
||||||
"python-multipart>=0.0.20",
|
"python-multipart>=0.0.20",
|
||||||
|
|||||||
@@ -19,8 +19,8 @@ TEXT_LINE_START = ""
|
|||||||
COMMENT_LINE_START = "% "
|
COMMENT_LINE_START = "% "
|
||||||
|
|
||||||
# Opening and closing delimiters
|
# Opening and closing delimiters
|
||||||
OPENS = ['{', '(', '[']
|
OPENS = ["{", "(", "["]
|
||||||
CLOSES = ['}', ')', ']']
|
CLOSES = ["}", ")", "]"]
|
||||||
|
|
||||||
# Names of LaTeX verbatim environments
|
# Names of LaTeX verbatim environments
|
||||||
VERBATIMS = ["verbatim", "Verbatim", "lstlisting", "minted", "comment"]
|
VERBATIMS = ["verbatim", "Verbatim", "lstlisting", "minted", "comment"]
|
||||||
@@ -138,7 +138,7 @@ class Pattern:
|
|||||||
contains_env_end=ENV_END in s,
|
contains_env_end=ENV_END in s,
|
||||||
contains_item=ITEM in s,
|
contains_item=ITEM in s,
|
||||||
contains_splitting=True,
|
contains_splitting=True,
|
||||||
contains_comment='%' in s,
|
contains_comment="%" in s,
|
||||||
)
|
)
|
||||||
else:
|
else:
|
||||||
return cls(
|
return cls(
|
||||||
@@ -146,7 +146,7 @@ class Pattern:
|
|||||||
contains_env_end=False,
|
contains_env_end=False,
|
||||||
contains_item=False,
|
contains_item=False,
|
||||||
contains_splitting=False,
|
contains_splitting=False,
|
||||||
contains_comment='%' in s,
|
contains_comment="%" in s,
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
@@ -169,11 +169,11 @@ def find_comment_index(line: str, pattern: Pattern) -> Optional[int]:
|
|||||||
|
|
||||||
in_command = False
|
in_command = False
|
||||||
for i, c in enumerate(line):
|
for i, c in enumerate(line):
|
||||||
if c == '\\':
|
if c == "\\":
|
||||||
in_command = True
|
in_command = True
|
||||||
elif in_command and not c.isalpha():
|
elif in_command and not c.isalpha():
|
||||||
in_command = False
|
in_command = False
|
||||||
elif c == '%' and not in_command:
|
elif c == "%" and not in_command:
|
||||||
return i
|
return i
|
||||||
|
|
||||||
return None
|
return None
|
||||||
@@ -390,10 +390,10 @@ def find_wrap_point(line: str, indent_length: int, args: Args) -> Optional[int]:
|
|||||||
line_width += 1
|
line_width += 1
|
||||||
if line_width > wrap_boundary and wrap_point is not None:
|
if line_width > wrap_boundary and wrap_point is not None:
|
||||||
break
|
break
|
||||||
if c == ' ' and prev_char != '\\':
|
if c == " " and prev_char != "\\":
|
||||||
if after_char:
|
if after_char:
|
||||||
wrap_point = i
|
wrap_point = i
|
||||||
elif c != '%':
|
elif c != "%":
|
||||||
after_char = True
|
after_char = True
|
||||||
prev_char = c
|
prev_char = c
|
||||||
|
|
||||||
@@ -483,8 +483,8 @@ def split_line(line: str, state: State, file: str, args: Args, logs: List[Log])
|
|||||||
if not match:
|
if not match:
|
||||||
return line, ""
|
return line, ""
|
||||||
|
|
||||||
prev = match.group('prev')
|
prev = match.group("prev")
|
||||||
rest = match.group('env')
|
rest = match.group("env")
|
||||||
|
|
||||||
if args.verbosity >= 3: # Trace level
|
if args.verbosity >= 3: # Trace level
|
||||||
logs.append(
|
logs.append(
|
||||||
@@ -517,8 +517,8 @@ def clean_text(text: str, args: Args) -> str:
|
|||||||
text = RE_NEWLINES.sub(f"{LINE_END}{LINE_END}", text)
|
text = RE_NEWLINES.sub(f"{LINE_END}{LINE_END}", text)
|
||||||
|
|
||||||
# Remove tabs if they shouldn't be used
|
# Remove tabs if they shouldn't be used
|
||||||
if args.tabchar != '\t':
|
if args.tabchar != "\t":
|
||||||
text = text.replace('\t', ' ' * args.tabsize)
|
text = text.replace("\t", " " * args.tabsize)
|
||||||
|
|
||||||
# Remove trailing spaces
|
# Remove trailing spaces
|
||||||
text = RE_TRAIL.sub(LINE_END, text)
|
text = RE_TRAIL.sub(LINE_END, text)
|
||||||
@@ -577,7 +577,7 @@ def _format_latex(old_text: str, file: str, args: Args) -> Tuple[str, List[Log]]
|
|||||||
new_text = ""
|
new_text = ""
|
||||||
|
|
||||||
# Select the character used for indentation
|
# Select the character used for indentation
|
||||||
indent_char = '\t' if args.tabchar == '\t' else ' '
|
indent_char = "\t" if args.tabchar == "\t" else " "
|
||||||
|
|
||||||
# Get any extra environments to be indented as lists
|
# Get any extra environments to be indented as lists
|
||||||
lists_begin = [f"\\begin{{{l}}}" for l in args.lists]
|
lists_begin = [f"\\begin{{{l}}}" for l in args.lists]
|
||||||
|
|||||||
@@ -5,13 +5,13 @@ from .format import format_latex
|
|||||||
|
|
||||||
|
|
||||||
def _rm_dollar_surr(content):
|
def _rm_dollar_surr(content):
|
||||||
pattern = re.compile(r'\\[a-zA-Z]+\$.*?\$|\$.*?\$')
|
pattern = re.compile(r"\\[a-zA-Z]+\$.*?\$|\$.*?\$")
|
||||||
matches = pattern.findall(content)
|
matches = pattern.findall(content)
|
||||||
|
|
||||||
for match in matches:
|
for match in matches:
|
||||||
if not re.match(r'\\[a-zA-Z]+', match):
|
if not re.match(r"\\[a-zA-Z]+", match):
|
||||||
new_match = match.strip('$')
|
new_match = match.strip("$")
|
||||||
content = content.replace(match, ' ' + new_match + ' ')
|
content = content.replace(match, " " + new_match + " ")
|
||||||
|
|
||||||
return content
|
return content
|
||||||
|
|
||||||
@@ -33,97 +33,97 @@ def to_katex(formula: str) -> str:
|
|||||||
"""
|
"""
|
||||||
res = formula
|
res = formula
|
||||||
# remove mbox surrounding
|
# remove mbox surrounding
|
||||||
res = change_all(res, r'\mbox ', r' ', r'{', r'}', r'', r'')
|
res = change_all(res, r"\mbox ", r" ", r"{", r"}", r"", r"")
|
||||||
res = change_all(res, r'\mbox', r' ', r'{', r'}', r'', r'')
|
res = change_all(res, r"\mbox", r" ", r"{", r"}", r"", r"")
|
||||||
# remove hbox surrounding
|
# remove hbox surrounding
|
||||||
res = re.sub(r'\\hbox to ?-? ?\d+\.\d+(pt)?\{', r'\\hbox{', res)
|
res = re.sub(r"\\hbox to ?-? ?\d+\.\d+(pt)?\{", r"\\hbox{", res)
|
||||||
res = change_all(res, r'\hbox', r' ', r'{', r'}', r'', r' ')
|
res = change_all(res, r"\hbox", r" ", r"{", r"}", r"", r" ")
|
||||||
# remove raise surrounding
|
# remove raise surrounding
|
||||||
res = re.sub(r'\\raise ?-? ?\d+\.\d+(pt)?', r' ', res)
|
res = re.sub(r"\\raise ?-? ?\d+\.\d+(pt)?", r" ", res)
|
||||||
# remove makebox
|
# remove makebox
|
||||||
res = re.sub(r'\\makebox ?\[\d+\.\d+(pt)?\]\{', r'\\makebox{', res)
|
res = re.sub(r"\\makebox ?\[\d+\.\d+(pt)?\]\{", r"\\makebox{", res)
|
||||||
res = change_all(res, r'\makebox', r' ', r'{', r'}', r'', r' ')
|
res = change_all(res, r"\makebox", r" ", r"{", r"}", r"", r" ")
|
||||||
# remove vbox surrounding, scalebox surrounding
|
# remove vbox surrounding, scalebox surrounding
|
||||||
res = re.sub(r'\\raisebox\{-? ?\d+\.\d+(pt)?\}\{', r'\\raisebox{', res)
|
res = re.sub(r"\\raisebox\{-? ?\d+\.\d+(pt)?\}\{", r"\\raisebox{", res)
|
||||||
res = re.sub(r'\\scalebox\{-? ?\d+\.\d+(pt)?\}\{', r'\\scalebox{', res)
|
res = re.sub(r"\\scalebox\{-? ?\d+\.\d+(pt)?\}\{", r"\\scalebox{", res)
|
||||||
res = change_all(res, r'\scalebox', r' ', r'{', r'}', r'', r' ')
|
res = change_all(res, r"\scalebox", r" ", r"{", r"}", r"", r" ")
|
||||||
res = change_all(res, r'\raisebox', r' ', r'{', r'}', r'', r' ')
|
res = change_all(res, r"\raisebox", r" ", r"{", r"}", r"", r" ")
|
||||||
res = change_all(res, r'\vbox', r' ', r'{', r'}', r'', r' ')
|
res = change_all(res, r"\vbox", r" ", r"{", r"}", r"", r" ")
|
||||||
|
|
||||||
origin_instructions = [
|
origin_instructions = [
|
||||||
r'\Huge',
|
r"\Huge",
|
||||||
r'\huge',
|
r"\huge",
|
||||||
r'\LARGE',
|
r"\LARGE",
|
||||||
r'\Large',
|
r"\Large",
|
||||||
r'\large',
|
r"\large",
|
||||||
r'\normalsize',
|
r"\normalsize",
|
||||||
r'\small',
|
r"\small",
|
||||||
r'\footnotesize',
|
r"\footnotesize",
|
||||||
r'\tiny',
|
r"\tiny",
|
||||||
]
|
]
|
||||||
for old_ins, new_ins in zip(origin_instructions, origin_instructions):
|
for old_ins, new_ins in zip(origin_instructions, origin_instructions):
|
||||||
res = change_all(res, old_ins, new_ins, r'$', r'$', '{', '}')
|
res = change_all(res, old_ins, new_ins, r"$", r"$", "{", "}")
|
||||||
res = change_all(res, r'\mathbf', r'\bm', r'{', r'}', r'{', r'}')
|
res = change_all(res, r"\mathbf", r"\bm", r"{", r"}", r"{", r"}")
|
||||||
res = change_all(res, r'\boldmath ', r'\bm', r'{', r'}', r'{', r'}')
|
res = change_all(res, r"\boldmath ", r"\bm", r"{", r"}", r"{", r"}")
|
||||||
res = change_all(res, r'\boldmath', r'\bm', r'{', r'}', r'{', r'}')
|
res = change_all(res, r"\boldmath", r"\bm", r"{", r"}", r"{", r"}")
|
||||||
res = change_all(res, r'\boldmath ', r'\bm', r'$', r'$', r'{', r'}')
|
res = change_all(res, r"\boldmath ", r"\bm", r"$", r"$", r"{", r"}")
|
||||||
res = change_all(res, r'\boldmath', r'\bm', r'$', r'$', r'{', r'}')
|
res = change_all(res, r"\boldmath", r"\bm", r"$", r"$", r"{", r"}")
|
||||||
res = change_all(res, r'\scriptsize', r'\scriptsize', r'$', r'$', r'{', r'}')
|
res = change_all(res, r"\scriptsize", r"\scriptsize", r"$", r"$", r"{", r"}")
|
||||||
res = change_all(res, r'\emph', r'\textit', r'{', r'}', r'{', r'}')
|
res = change_all(res, r"\emph", r"\textit", r"{", r"}", r"{", r"}")
|
||||||
res = change_all(res, r'\emph ', r'\textit', r'{', r'}', r'{', r'}')
|
res = change_all(res, r"\emph ", r"\textit", r"{", r"}", r"{", r"}")
|
||||||
|
|
||||||
# remove bold command
|
# remove bold command
|
||||||
res = change_all(res, r'\bm', r' ', r'{', r'}', r'', r'')
|
res = change_all(res, r"\bm", r" ", r"{", r"}", r"", r"")
|
||||||
|
|
||||||
origin_instructions = [
|
origin_instructions = [
|
||||||
r'\left',
|
r"\left",
|
||||||
r'\middle',
|
r"\middle",
|
||||||
r'\right',
|
r"\right",
|
||||||
r'\big',
|
r"\big",
|
||||||
r'\Big',
|
r"\Big",
|
||||||
r'\bigg',
|
r"\bigg",
|
||||||
r'\Bigg',
|
r"\Bigg",
|
||||||
r'\bigl',
|
r"\bigl",
|
||||||
r'\Bigl',
|
r"\Bigl",
|
||||||
r'\biggl',
|
r"\biggl",
|
||||||
r'\Biggl',
|
r"\Biggl",
|
||||||
r'\bigm',
|
r"\bigm",
|
||||||
r'\Bigm',
|
r"\Bigm",
|
||||||
r'\biggm',
|
r"\biggm",
|
||||||
r'\Biggm',
|
r"\Biggm",
|
||||||
r'\bigr',
|
r"\bigr",
|
||||||
r'\Bigr',
|
r"\Bigr",
|
||||||
r'\biggr',
|
r"\biggr",
|
||||||
r'\Biggr',
|
r"\Biggr",
|
||||||
]
|
]
|
||||||
for origin_ins in origin_instructions:
|
for origin_ins in origin_instructions:
|
||||||
res = change_all(res, origin_ins, origin_ins, r'{', r'}', r'', r'')
|
res = change_all(res, origin_ins, origin_ins, r"{", r"}", r"", r"")
|
||||||
|
|
||||||
res = re.sub(r'\\\[(.*?)\\\]', r'\1\\newline', res)
|
res = re.sub(r"\\\[(.*?)\\\]", r"\1\\newline", res)
|
||||||
|
|
||||||
if res.endswith(r'\newline'):
|
if res.endswith(r"\newline"):
|
||||||
res = res[:-8]
|
res = res[:-8]
|
||||||
|
|
||||||
# remove multiple spaces
|
# remove multiple spaces
|
||||||
res = re.sub(r'(\\,){1,}', ' ', res)
|
res = re.sub(r"(\\,){1,}", " ", res)
|
||||||
res = re.sub(r'(\\!){1,}', ' ', res)
|
res = re.sub(r"(\\!){1,}", " ", res)
|
||||||
res = re.sub(r'(\\;){1,}', ' ', res)
|
res = re.sub(r"(\\;){1,}", " ", res)
|
||||||
res = re.sub(r'(\\:){1,}', ' ', res)
|
res = re.sub(r"(\\:){1,}", " ", res)
|
||||||
res = re.sub(r'\\vspace\{.*?}', '', res)
|
res = re.sub(r"\\vspace\{.*?}", "", res)
|
||||||
|
|
||||||
# merge consecutive text
|
# merge consecutive text
|
||||||
def merge_texts(match):
|
def merge_texts(match):
|
||||||
texts = match.group(0)
|
texts = match.group(0)
|
||||||
merged_content = ''.join(re.findall(r'\\text\{([^}]*)\}', texts))
|
merged_content = "".join(re.findall(r"\\text\{([^}]*)\}", texts))
|
||||||
return f'\\text{{{merged_content}}}'
|
return f"\\text{{{merged_content}}}"
|
||||||
|
|
||||||
res = re.sub(r'(\\text\{[^}]*\}\s*){2,}', merge_texts, res)
|
res = re.sub(r"(\\text\{[^}]*\}\s*){2,}", merge_texts, res)
|
||||||
|
|
||||||
res = res.replace(r'\bf ', '')
|
res = res.replace(r"\bf ", "")
|
||||||
res = _rm_dollar_surr(res)
|
res = _rm_dollar_surr(res)
|
||||||
|
|
||||||
# remove extra spaces (keeping only one)
|
# remove extra spaces (keeping only one)
|
||||||
res = re.sub(r' +', ' ', res)
|
res = re.sub(r" +", " ", res)
|
||||||
|
|
||||||
# format latex
|
# format latex
|
||||||
res = res.strip()
|
res = res.strip()
|
||||||
|
|||||||
@@ -1,3 +1,3 @@
|
|||||||
from .texteller import TexTeller
|
from .texteller import TexTeller
|
||||||
|
|
||||||
__all__ = ['TexTeller']
|
__all__ = ["TexTeller"]
|
||||||
|
|||||||
@@ -41,7 +41,7 @@ def readimgs(image_paths: list[str]) -> list[np.ndarray]:
|
|||||||
if image is None:
|
if image is None:
|
||||||
raise ValueError(f"Image at {path} could not be read.")
|
raise ValueError(f"Image at {path} could not be read.")
|
||||||
if image.dtype == np.uint16:
|
if image.dtype == np.uint16:
|
||||||
_logger.warning(f'Converting {path} to 8-bit, image may be lossy.')
|
_logger.warning(f"Converting {path} to 8-bit, image may be lossy.")
|
||||||
image = cv2.convertScaleAbs(image, alpha=(255.0 / 65535.0))
|
image = cv2.convertScaleAbs(image, alpha=(255.0 / 65535.0))
|
||||||
|
|
||||||
channels = 1 if len(image.shape) == 2 else image.shape[2]
|
channels = 1 if len(image.shape) == 2 else image.shape[2]
|
||||||
@@ -112,7 +112,7 @@ def transform(images: List[Union[np.ndarray, Image.Image]]) -> List[torch.Tensor
|
|||||||
|
|
||||||
assert IMG_CHANNELS == 1, "Only support grayscale images for now"
|
assert IMG_CHANNELS == 1, "Only support grayscale images for now"
|
||||||
images = [
|
images = [
|
||||||
np.array(img.convert('RGB')) if isinstance(img, Image.Image) else img for img in images
|
np.array(img.convert("RGB")) if isinstance(img, Image.Image) else img for img in images
|
||||||
]
|
]
|
||||||
images = [trim_white_border(image) for image in images]
|
images = [trim_white_border(image) for image in images]
|
||||||
images = [general_transform_pipeline(image) for image in images]
|
images = [general_transform_pipeline(image) for image in images]
|
||||||
|
|||||||
@@ -21,7 +21,7 @@ def _change(input_str, old_inst, new_inst, old_surr_l, old_surr_r, new_surr_l, n
|
|||||||
j = start + 1
|
j = start + 1
|
||||||
escaped = False
|
escaped = False
|
||||||
while j < n and count > 0:
|
while j < n and count > 0:
|
||||||
if input_str[j] == '\\' and not escaped:
|
if input_str[j] == "\\" and not escaped:
|
||||||
escaped = True
|
escaped = True
|
||||||
j += 1
|
j += 1
|
||||||
continue
|
continue
|
||||||
@@ -71,10 +71,10 @@ def change_all(input_str, old_inst, new_inst, old_surr_l, old_surr_r, new_surr_l
|
|||||||
for p in pos[::-1]:
|
for p in pos[::-1]:
|
||||||
res[p:] = list(
|
res[p:] = list(
|
||||||
_change(
|
_change(
|
||||||
''.join(res[p:]), old_inst, new_inst, old_surr_l, old_surr_r, new_surr_l, new_surr_r
|
"".join(res[p:]), old_inst, new_inst, old_surr_l, old_surr_r, new_surr_l, new_surr_r
|
||||||
)
|
)
|
||||||
)
|
)
|
||||||
res = ''.join(res)
|
res = "".join(res)
|
||||||
return res
|
return res
|
||||||
|
|
||||||
|
|
||||||
@@ -121,7 +121,7 @@ def add_newlines(latex_str: str) -> str:
|
|||||||
|
|
||||||
# 4. Cleanup: Collapse multiple consecutive newlines into a single newline.
|
# 4. Cleanup: Collapse multiple consecutive newlines into a single newline.
|
||||||
# This handles cases where the replacements above might have created \n\n.
|
# This handles cases where the replacements above might have created \n\n.
|
||||||
processed_str = re.sub(r'\n{2,}', '\n', processed_str)
|
processed_str = re.sub(r"\n{2,}", "\n", processed_str)
|
||||||
|
|
||||||
# Remove leading/trailing whitespace (including potential single newlines
|
# Remove leading/trailing whitespace (including potential single newlines
|
||||||
# at the very start/end resulting from the replacements) from the entire result.
|
# at the very start/end resulting from the replacements) from the entire result.
|
||||||
|
|||||||
Reference in New Issue
Block a user