Go to file

OleehyO 154c8fcab5 📝 [docs] Update README with TexTeller 3.0 technical report and dataset release

- Added technical report and dataset release announcements to changelog
- Updated both English and Chinese README files
- Reordered Docker badge in Chinese README for consistency

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-08-22 15:26:47 +08:00

.claude

🧑 [chore] Add Claude Code configuration for Git workflow automation

2025-08-13 21:59:12 +08:00

.github/workflows

🔧 [chore] Replace pre-commit with ruff for linting workflow

2025-08-14 22:34:42 +08:00

assets

📝 [docs] Update README with TexTeller 3.0 technical report and dataset release

2025-08-22 15:26:47 +08:00

docs

[docs] Using uv to install deps

2025-04-23 10:40:12 +00:00

examples

🔧 Fix all ruff typo errors & test CI/CD workflow (#109 )

2025-04-21 13:52:16 +08:00

tests

Add globals test

2025-04-21 02:32:05 +00:00

texteller

🔧 Fix all ruff typo errors & test CI/CD workflow (#109 )

2025-04-21 13:52:16 +08:00

.gitignore

[chore] Ignore images

2025-04-21 02:41:06 +00:00

.pre-commit-config.yaml

📦️ [chore] Update project for TexTeller 3.0 release

2025-08-13 22:01:17 +08:00

.python-version

[chore] Setup project infrastructure

2025-02-28 20:01:52 +08:00

LICENSE

Add Apache2.0 license

2024-06-06 13:06:16 +00:00

pyproject.toml

🔧 [chore] Replace pre-commit with ruff for linting workflow

2025-08-14 22:34:42 +08:00

README.md

📝 [docs] Update README with TexTeller 3.0 technical report and dataset release

2025-08-22 15:26:47 +08:00

README.md

📄 English | 中文

𝚃𝚎𝚡𝚃𝚎𝚕𝚕𝚎𝚛

https://github.com/OleehyO/TexTeller/assets/56267907/532d1471-a72e-4960-9677-ec6c19db289f

TexTeller is an end-to-end formula recognition model, capable of converting images into corresponding LaTeX formulas.

TexTeller was trained with 80M image-formula pairs (previous dataset can be obtained here), compared to LaTeX-OCR which used a 100K dataset, TexTeller has stronger generalization abilities and higher accuracy, covering most use cases.

Note

If you would like to provide feedback or suggestions for this project, feel free to start a discussion in the Discussions section.

Images that can be recognized by TexTeller

📮 Change Log

[2025-08-15] We have published the technical report of TexTeller. The model evaluated on the Benchmark (which was trained from scratch and had its handwritten subset filtered based on the test set) is available at https://huggingface.co/OleehyO/TexTeller_en. Please do not directly use the open-source version of TexTeller3.0 to reproduce the experimental results of handwritten formulas, as this model includes the test sets of these benchmarks.
[2025-08-15] We have open-sourced the training dataset of TexTeller 3.0. Please note that the handwritten* subset of this dataset is collected from existing open-source handwritten datasets (including both training and test sets). If you need to use the handwritten* subset for your experimental ablation, please filter the test labels first.
[2024-06-06] TexTeller3.0 released! The training data has been increased to 80M (10x more than TexTeller2.0 and also improved in data diversity). TexTeller3.0's new features:
- Support scanned image, handwritten formulas, English(Chinese) mixed formulas.
- OCR abilities in both Chinese and English for printed images.
[2024-05-02] Support paragraph recognition.
[2024-04-12] Formula detection model released!
[2024-03-25] TexTeller2.0 released! The training data for TexTeller2.0 has been increased to 7.5M (15x more than TexTeller1.0 and also improved in data quality). The trained TexTeller2.0 demonstrated superior performance in the test set, especially in recognizing rare symbols, complex multi-line formulas, and matrices.

Here are more test images and a horizontal comparison of various recognition models.

🚀 Getting Started

Install uv:
```
pip install uv
```
Install the project's dependencies:
```
uv pip install texteller
```
If your are using CUDA backend, you may need to install onnxruntime-gpu:
```
uv pip install texteller[onnxruntime-gpu]
```
Run the following command to start inference:
```
texteller inference "/path/to/image.{jpg,png}"
```
See texteller inference --help for more details

🌐 Web Demo

Run the following command:

texteller web

Enter http://localhost:8501 in a browser to view the web demo.

Note

Paragraph recognition cannot restore the structure of a document, it can only recognize its content.

🖥️ Server

We use ray serve to provide an API server for TexTeller. To start the server, run the following command:

texteller launch

Parameter	Description
`-ckpt`	The path to the weights file,default is TexTeller's pretrained weights.
`-tknz`	The path to the tokenizer,default is TexTeller's tokenizer.
`-p`	The server's service port,default is 8000.
`--num-replicas`	The number of service replicas to run on the server,default is 1 replica. You can use more replicas to achieve greater throughput.
`--ncpu-per-replica`	The number of CPU cores used per service replica,default is 1.
`--ngpu-per-replica`	The number of GPUs used per service replica,default is 1. You can set this value between 0 and 1 to run multiple service replicas on one GPU to share the GPU, thereby improving GPU utilization. (Note, if --num_replicas is 2, --ngpu_per_replica is 0.7, then 2 GPUs must be available)
`--num-beams`	The number of beams for beam search,default is 1.
`--use-onnx`	Perform inference using Onnx Runtime, disabled by default

To send requests to the server:

# client_demo.py

import requests

server_url = "http://127.0.0.1:8000/predict"

img_path = "/path/to/your/image"
with open(img_path, 'rb') as img:
    files = {'img': img}
    response = requests.post(server_url, files=files)

print(response.text)

🐍 Python API

We provide several easy-to-use Python APIs for formula OCR scenarios. Please refer to our documentation to learn about the corresponding API interfaces and usage.

🔍 Formula Detection

TexTeller's formula detection model is trained on 3,415 images of Chinese materials and 8,272 images from the IBEM dataset.

We provide a formula detection interface in the Python API. Please refer to our API documentation for more details.

🏋️‍♂️ Training

Please setup your environment before training:

Install the dependencies for training:
```
uv pip install texteller[train]
```

Clone the repository:

git clone https://github.com/OleehyO/TexTeller.git

Dataset

We provide an example dataset in the examples/train_texteller/dataset/train directory, you can place your own training data according to the format of the example dataset.

Training the Model

In the examples/train_texteller/ directory, run the following command:

accelerate launch train.py

Training arguments can be adjusted in train_config.yaml.

📅 Plans

~~Train the model with a larger dataset~~
~~Recognition of scanned images~~
~~Support for English and Chinese scenarios~~
~~Handwritten formulas support~~
PDF document recognition
Inference acceleration

README.md

𝚃𝚎𝚡𝚃𝚎𝚕𝚕𝚎𝚛

🔖 Table of Contents

📮 Change Log

🚀 Getting Started

🌐 Web Demo

🖥️ Server

🐍 Python API

🔍 Formula Detection

🏋️‍♂️ Training

Dataset

Training the Model

📅 Plans

⭐️ Stargazers over time

👥 Contributors

README.md Unescape Escape

𝚃𝚎𝚡𝚃𝚎𝚕𝚕𝚎𝚛

🔖 Table of Contents

📮 Change Log

🚀 Getting Started

🌐 Web Demo

🖥️ Server

🐍 Python API

🔍 Formula Detection

🏋️‍♂️ Training

Dataset

Training the Model

📅 Plans

⭐️ Stargazers over time

👥 Contributors

README.md