[chore] Update README.md

This commit is contained in:
OleehyO
2025-04-23 04:47:51 +00:00
parent 29f6f8960d
commit 42db737ae6
2 changed files with 169 additions and 350 deletions

213
README.md
View File

@@ -2,51 +2,27 @@
<div align="center"> <div align="center">
<h1> <h1>
<img src="./assets/fire.svg" width=30, height=30> <img src="./assets/fire.svg" width=30, height=30>
𝚃𝚎𝚡𝚃𝚎𝚕𝚕𝚎𝚛 𝚃𝚎𝚡𝚃𝚎𝚕𝚕𝚎𝚛
<img src="./assets/fire.svg" width=30, height=30> <img src="./assets/fire.svg" width=30, height=30>
</h1> </h1>
<!-- <p align="center">
🤗 <a href="https://huggingface.co/OleehyO/TexTeller"> Hugging Face </a>
</p> -->
[![](https://img.shields.io/badge/License-Apache_2.0-blue.svg?logo=github)](https://opensource.org/licenses/Apache-2.0) [![](https://img.shields.io/badge/API-Docs-orange.svg?logo=read-the-docs)](https://oleehyo.github.io/TexTeller/)
[![](https://img.shields.io/badge/docker-pull-green.svg?logo=docker)](https://hub.docker.com/r/oleehyo/texteller) [![](https://img.shields.io/badge/docker-pull-green.svg?logo=docker)](https://hub.docker.com/r/oleehyo/texteller)
[![](https://img.shields.io/badge/Data-Texteller1.0-brightgreen.svg?logo=huggingface)](https://huggingface.co/datasets/OleehyO/latex-formulas) [![](https://img.shields.io/badge/Data-Texteller1.0-brightgreen.svg?logo=huggingface)](https://huggingface.co/datasets/OleehyO/latex-formulas)
[![](https://img.shields.io/badge/Weights-Texteller3.0-yellow.svg?logo=huggingface)](https://huggingface.co/OleehyO/TexTeller) [![](https://img.shields.io/badge/Weights-Texteller3.0-yellow.svg?logo=huggingface)](https://huggingface.co/OleehyO/TexTeller)
[![](https://img.shields.io/badge/License-Apache_2.0-blue.svg?logo=github)](https://opensource.org/licenses/Apache-2.0)
</div> </div>
<!-- <p align="center">
<a href="https://opensource.org/licenses/Apache-2.0">
<img src="https://img.shields.io/badge/License-Apache_2.0-blue.svg" alt="License">
</a>
<a href="https://github.com/OleehyO/TexTeller/issues">
<img src="https://img.shields.io/badge/Maintained%3F-yes-green.svg" alt="Maintenance">
</a>
<a href="https://github.com/OleehyO/TexTeller/pulls">
<img src="https://img.shields.io/badge/Contributions-welcome-brightgreen.svg?style=flat" alt="Contributions welcome">
</a>
<a href="https://huggingface.co/datasets/OleehyO/latex-formulas">
<img src="https://img.shields.io/badge/Data-Texteller1.0-brightgreen.svg" alt="Data">
</a>
<a href="https://huggingface.co/OleehyO/TexTeller">
<img src="https://img.shields.io/badge/Weights-Texteller3.0-yellow.svg" alt="Weights">
</a>
</p> -->
https://github.com/OleehyO/TexTeller/assets/56267907/532d1471-a72e-4960-9677-ec6c19db289f https://github.com/OleehyO/TexTeller/assets/56267907/532d1471-a72e-4960-9677-ec6c19db289f
TexTeller is an end-to-end formula recognition model based on [TrOCR](https://arxiv.org/abs/2109.10282), capable of converting images into corresponding LaTeX formulas. TexTeller is an end-to-end formula recognition model, capable of converting images into corresponding LaTeX formulas.
TexTeller was trained with **80M image-formula pairs** (previous dataset can be obtained [here](https://huggingface.co/datasets/OleehyO/latex-formulas)), compared to [LaTeX-OCR](https://github.com/lukas-blecher/LaTeX-OCR) which used a 100K dataset, TexTeller has **stronger generalization abilities** and **higher accuracy**, covering most use cases. TexTeller was trained with **80M image-formula pairs** (previous dataset can be obtained [here](https://huggingface.co/datasets/OleehyO/latex-formulas)), compared to [LaTeX-OCR](https://github.com/lukas-blecher/LaTeX-OCR) which used a 100K dataset, TexTeller has **stronger generalization abilities** and **higher accuracy**, covering most use cases.
>[!NOTE] >[!NOTE]
> If you would like to provide feedback or suggestions for this project, feel free to start a discussion in the [Discussions section](https://github.com/OleehyO/TexTeller/discussions). > If you would like to provide feedback or suggestions for this project, feel free to start a discussion in the [Discussions section](https://github.com/OleehyO/TexTeller/discussions).
>
> Additionally, if you find this project helpful, please don't forget to give it a star⭐🙏
--- ---
@@ -55,15 +31,12 @@ TexTeller was trained with **80M image-formula pairs** (previous dataset can be
<td> <td>
## 🔖 Table of Contents ## 🔖 Table of Contents
- [Change Log](#-change-log)
- [Getting Started](#-getting-started) - [Getting Started](#-getting-started)
- [Web Demo](#-web-demo) - [Web Demo](#-web-demo)
- [Server](#-server)
- [Python API](#-python-api)
- [Formula Detection](#-formula-detection) - [Formula Detection](#-formula-detection)
- [API Usage](#-api-usage)
- [Training](#-training) - [Training](#-training)
- [Plans](#-plans)
- [Stargazers over time](#-stargazers-over-time)
- [Contributors](#-contributors)
</td> </td>
<td> <td>
@@ -76,18 +49,9 @@ TexTeller was trained with **80M image-formula pairs** (previous dataset can be
</figcaption> </figcaption>
</figure> </figure>
<div> <div>
<p>
Thanks to the
<i>
Super Computing Platform of Beijing University of Posts and Telecommunications
</i>
for supporting this work😘
</p>
<!-- <img src="assets/scss.png" width="200"> -->
</div> </div>
</div> </div>
</td> </td>
</tr> </tr>
</table> </table>
@@ -110,153 +74,118 @@ TexTeller was trained with **80M image-formula pairs** (previous dataset can be
## 🚀 Getting Started ## 🚀 Getting Started
1. Clone the repository: 1. Install the project's dependencies:
```bash
git clone https://github.com/OleehyO/TexTeller
```
2. Install the project's dependencies:
```bash ```bash
pip install texteller pip install texteller
``` ```
3. Enter the `src/` directory and run the following command in the terminal to start inference: 2. If your are using CUDA backend, you may need to install `onnxruntime-gpu`:
```bash ```bash
python inference.py -img "/path/to/image.{jpg,png}" pip install texteller[onnxruntime-gpu]
# use --inference-mode option to enable GPU(cuda or mps) inference
#+e.g. python inference.py -img "img.jpg" --inference-mode cuda
``` ```
> The first time you run it, the required checkpoints will be downloaded from Hugging Face. 3. Run the following command to start inference:
### Paragraph Recognition
As demonstrated in the video, TexTeller is also capable of recognizing entire text paragraphs. Although TexTeller has general text OCR capabilities, we still recommend using paragraph recognition for better results:
1. [Download the weights](https://huggingface.co/TonyLee1256/texteller_det/resolve/main/rtdetr_r50vd_6x_coco.onnx?download=true) of the formula detection model to the`src/models/det_model/model/`directory
2. Run `inference.py` in the `src/` directory and add the `-mix` option, the results will be output in markdown format.
```bash ```bash
python inference.py -img "/path/to/image.{jpg,png}" -mix texteller inference "/path/to/image.{jpg,png}"
``` ```
TexTeller uses the lightweight [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR) model by default for recognizing both Chinese and English text. You can try using a larger model to achieve better recognition results for both Chinese and English: > See `texteller inference --help` for more details
| Checkpoints | Model Description | Size |
|-------------|-------------------| ---- |
| [ch_PP-OCRv4_det.onnx](https://huggingface.co/OleehyO/paddleocrv4.onnx/resolve/main/ch_PP-OCRv4_det.onnx?download=true) | **Default detection model**, supports Chinese-English text detection | 4.70M |
| [ch_PP-OCRv4_server_det.onnx](https://huggingface.co/OleehyO/paddleocrv4.onnx/resolve/main/ch_PP-OCRv4_server_det.onnx?download=true) | High accuracy model, supports Chinese-English text detection | 115M |
| [ch_PP-OCRv4_rec.onnx](https://huggingface.co/OleehyO/paddleocrv4.onnx/resolve/main/ch_PP-OCRv4_rec.onnx?download=true) | **Default recoginition model**, supports Chinese-English text recognition | 10.80M |
| [ch_PP-OCRv4_server_rec.onnx](https://huggingface.co/OleehyO/paddleocrv4.onnx/resolve/main/ch_PP-OCRv4_server_rec.onnx?download=true) | High accuracy model, supports Chinese-English text recognition | 90.60M |
Place the weights of the recognition/detection model in the `det/` or `rec/` directories within `src/models/third_party/paddleocr/checkpoints/`, and rename them to `default_model.onnx`.
> [!NOTE]
> Paragraph recognition cannot restore the structure of a document, it can only recognize its content.
## 🌐 Web Demo ## 🌐 Web Demo
Go to the `src/` directory and run the following command: Run the following command:
```bash ```bash
./start_web.sh texteller web
``` ```
Enter `http://localhost:8501` in a browser to view the web demo. Enter `http://localhost:8501` in a browser to view the web demo.
> [!NOTE] > [!NOTE]
> 1. For Windows users, please run the `start_web.bat` file. > Paragraph recognition cannot restore the structure of a document, it can only recognize its content.
> 2. When using onnxruntime + GPU for inference, you need to install onnxruntime-gpu.
## 🔍 Formula Detection ## 🖥️ Server
TexTellers formula detection model is trained on 3,415 images of Chinese educational materials (with over 130 layouts) and 8,272 images from the [IBEM dataset](https://zenodo.org/records/4757865), and it supports formula detection across entire images. We use [ray serve](https://github.com/ray-project/ray) to provide an API server for TexTeller. To start the server, run the following command:
<div align="center">
<img src="./assets/det_rec.png" width=250>
</div>
1. Download the model weights and place them in `src/models/det_model/model/` [[link](https://huggingface.co/TonyLee1256/texteller_det/resolve/main/rtdetr_r50vd_6x_coco.onnx?download=true)].
2. Run the following command in the `src/` directory, and the results will be saved in `src/subimages/`
<details>
<summary>Advanced: batch formula recognition</summary>
After **formula detection**, run the following command in the `src/` directory:
```shell
python rec_infer_from_crop_imgs.py
```
This will use the results of the previous formula detection to perform batch recognition on all cropped formulas, saving the recognition results as txt files in `src/results/`.
</details>
## 📡 API Usage
We use [ray serve](https://github.com/ray-project/ray) to provide an API interface for TexTeller, allowing you to integrate TexTeller into your own projects. To start the server, you first need to enter the `src/` directory and then run the following command:
```bash ```bash
python server.py texteller launch
``` ```
| Parameter | Description | | Parameter | Description |
| --------- | -------- | | --------- | -------- |
| `-ckpt` | The path to the weights file,*default is TexTeller's pretrained weights*. | | `-ckpt` | The path to the weights file,*default is TexTeller's pretrained weights*. |
| `-tknz` | The path to the tokenizer,*default is TexTeller's tokenizer*. | | `-tknz` | The path to the tokenizer,*default is TexTeller's tokenizer*. |
| `-port` | The server's service port,*default is 8000*. | | `-p` | The server's service port,*default is 8000*. |
| `--inference-mode` | Whether to use "cuda" or "mps" for inference,*default is "cpu"*. | | `--num-replicas` | The number of service replicas to run on the server,*default is 1 replica*. You can use more replicas to achieve greater throughput.|
| `--num_beams` | The number of beams for beam search,*default is 1*. | | `--ncpu-per-replica` | The number of CPU cores used per service replica,*default is 1*.|
| `--num_replicas` | The number of service replicas to run on the server,*default is 1 replica*. You can use more replicas to achieve greater throughput.| | `--ngpu-per-replica` | The number of GPUs used per service replica,*default is 1*. You can set this value between 0 and 1 to run multiple service replicas on one GPU to share the GPU, thereby improving GPU utilization. (Note, if --num_replicas is 2, --ngpu_per_replica is 0.7, then 2 GPUs must be available) |
| `--ncpu_per_replica` | The number of CPU cores used per service replica,*default is 1*.| | `--num-beams` | The number of beams for beam search,*default is 1*. |
| `--ngpu_per_replica` | The number of GPUs used per service replica,*default is 1*. You can set this value between 0 and 1 to run multiple service replicas on one GPU to share the GPU, thereby improving GPU utilization. (Note, if --num_replicas is 2, --ngpu_per_replica is 0.7, then 2 GPUs must be available) | | `--use-onnx` | Perform inference using Onnx Runtime, *disabled by default* |
| `-onnx` | Perform inference using Onnx Runtime, *disabled by default* |
> [!NOTE] To send requests to the server:
> A client demo can be found at `src/client/demo.py`, you can refer to `demo.py` to send requests to the server
```python
# client_demo.py
import requests
server_url = "http://127.0.0.1:8000/predict"
img_path = "/path/to/your/image"
with open(img_path, 'rb') as img:
files = {'img': img}
response = requests.post(server_url, files=files)
print(response.text)
```
## 🐍 Python API
We provide several easy-to-use Python APIs for formula OCR scenarios. Please refer to our [documentation](https://oleehyo.github.io/TexTeller/) to learn about the corresponding API interfaces and usage.
## 🔍 Formula Detection
TexTeller's formula detection model is trained on 3,415 images of Chinese materials and 8,272 images from the [IBEM dataset](https://zenodo.org/records/4757865).
<div align="center">
<img src="./assets/det_rec.png" width=250>
</div>
We provide a formula detection interface in the Python API. Please refer to our [API documentation](https://oleehyo.github.io/TexTeller/) for more details.
## 🏋️‍♂️ Training ## 🏋️‍♂️ Training
### Dataset Please setup your environment before training:
We provide an example dataset in the `src/models/ocr_model/train/dataset/` directory, you can place your own images in the `images/` directory and annotate each image with its corresponding formula in `formulas.jsonl`. 1. Install the dependencies for training:
After preparing your dataset, you need to **change the `DIR_URL` variable to your own dataset's path** in `**/train/dataset/loader.py`
### Retraining the Tokenizer
If you are using a different dataset, you might need to retrain the tokenizer to obtain a different vocabulary. After configuring your dataset, you can train your own tokenizer with the following command:
1. In `src/models/tokenizer/train.py`, change `new_tokenizer.save_pretrained('./your_dir_name')` to your custom output directory
> If you want to use a different vocabulary size (default 15K), you need to change the `VOCAB_SIZE` variable in `src/models/globals.py`
>
2. **In the `src/` directory**, run the following command:
```bash ```bash
python -m models.tokenizer.train pip install texteller[train]
``` ```
2. Clone the repository:
```bash
git clone https://github.com/OleehyO/TexTeller.git
```
### Dataset
We provide an example dataset in the `examples/train_texteller/dataset/train` directory, you can place your own training data according to the format of the example dataset.
### Training the Model ### Training the Model
1. Modify `num_processes` in `src/train_config.yaml` to match the number of GPUs available for training (default is 1). In the `examples/train_texteller/` directory, run the following command:
2. In the `src/` directory, run the following command:
```bash ```bash
accelerate launch --config_file ./train_config.yaml -m models.ocr_model.train.train accelerate launch train.py
``` ```
You can set your own tokenizer and checkpoint paths in `src/models/ocr_model/train/train.py` (refer to `train.py` for more information). If you are using the same architecture and vocabulary as TexTeller, you can also fine-tune TexTeller's default weights with your own dataset. Training arguments can be adjusted in [`train_config.yaml`](./examples/train_texteller/train_config.yaml).
In `src/globals.py` and `src/models/ocr_model/train/train_args.py`, you can change the model's architecture and training hyperparameters.
> [!NOTE]
> Our training scripts use the [Hugging Face Transformers](https://github.com/huggingface/transformers) library, so you can refer to their [documentation](https://huggingface.co/docs/transformers/v4.32.1/main_classes/trainer#transformers.TrainingArguments) for more details and configurations on training parameters.
## 📅 Plans ## 📅 Plans
@@ -266,13 +195,11 @@ In `src/globals.py` and `src/models/ocr_model/train/train_args.py`, you can chan
- [X] ~~Handwritten formulas support~~ - [X] ~~Handwritten formulas support~~
- [ ] PDF document recognition - [ ] PDF document recognition
- [ ] Inference acceleration - [ ] Inference acceleration
- [ ] ...
## ⭐️ Stargazers over time ## ⭐️ Stargazers over time
[![Stargazers over time](https://starchart.cc/OleehyO/TexTeller.svg?variant=adaptive)](https://starchart.cc/OleehyO/TexTeller) [![Stargazers over time](https://starchart.cc/OleehyO/TexTeller.svg?variant=adaptive)](https://starchart.cc/OleehyO/TexTeller)
## 👥 Contributors ## 👥 Contributors
<a href="https://github.com/OleehyO/TexTeller/graphs/contributors"> <a href="https://github.com/OleehyO/TexTeller/graphs/contributors">

View File

@@ -1,52 +1,28 @@
📄 <a href="../README.md">English</a> | 中文 📄 中文 | [English](./README.md)
<div align="center"> <div align="center">
<h1> <h1>
<img src="./fire.svg" width=30, height=30> <img src="./fire.svg" width=30, height=30>
𝚃𝚎𝚡𝚃𝚎𝚕𝚕𝚎𝚛 𝚃𝚎𝚡𝚃𝚎𝚕𝚕𝚎𝚛
<img src="./fire.svg" width=30, height=30> <img src="./fire.svg" width=30, height=30>
</h1> </h1>
<!-- <p align="center">
🤗 <a href="https://huggingface.co/OleehyO/TexTeller"> Hugging Face </a>
</p> -->
[![](https://img.shields.io/badge/License-Apache_2.0-blue.svg?logo=github)](https://opensource.org/licenses/Apache-2.0) [![](https://img.shields.io/badge/API-文档-orange.svg?logo=read-the-docs)](https://oleehyo.github.io/TexTeller/)
[![](https://img.shields.io/badge/docker-pull-green.svg?logo=docker)](https://hub.docker.com/r/oleehyo/texteller) [![](https://img.shields.io/badge/docker-镜像-green.svg?logo=docker)](https://hub.docker.com/r/oleehyo/texteller)
[![](https://img.shields.io/badge/Data-Texteller1.0-brightgreen.svg?logo=huggingface)](https://huggingface.co/datasets/OleehyO/latex-formulas) [![](https://img.shields.io/badge/数据-Texteller1.0-brightgreen.svg?logo=huggingface)](https://huggingface.co/datasets/OleehyO/latex-formulas)
[![](https://img.shields.io/badge/Weights-Texteller3.0-yellow.svg?logo=huggingface)](https://huggingface.co/OleehyO/TexTeller) [![](https://img.shields.io/badge/权重-Texteller3.0-yellow.svg?logo=huggingface)](https://huggingface.co/OleehyO/TexTeller)
[![](https://img.shields.io/badge/协议-Apache_2.0-blue.svg?logo=github)](https://opensource.org/licenses/Apache-2.0)
</div> </div>
<!-- <p align="center">
<a href="https://opensource.org/licenses/Apache-2.0">
<img src="https://img.shields.io/badge/License-Apache_2.0-blue.svg" alt="License">
</a>
<a href="https://github.com/OleehyO/TexTeller/issues">
<img src="https://img.shields.io/badge/Maintained%3F-yes-green.svg" alt="Maintenance">
</a>
<a href="https://github.com/OleehyO/TexTeller/pulls">
<img src="https://img.shields.io/badge/Contributions-welcome-brightgreen.svg?style=flat" alt="Contributions welcome">
</a>
<a href="https://huggingface.co/datasets/OleehyO/latex-formulas">
<img src="https://img.shields.io/badge/Data-Texteller1.0-brightgreen.svg" alt="Data">
</a>
<a href="https://huggingface.co/OleehyO/TexTeller">
<img src="https://img.shields.io/badge/Weights-Texteller3.0-yellow.svg" alt="Weights">
</a>
</p> -->
https://github.com/OleehyO/TexTeller/assets/56267907/532d1471-a72e-4960-9677-ec6c19db289f https://github.com/OleehyO/TexTeller/assets/56267907/532d1471-a72e-4960-9677-ec6c19db289f
TexTeller是一个基于[TrOCR](https://arxiv.org/abs/2109.10282)的端到端公式识别模型,可以把图片转换为对应的latex公式 TexTeller 是一个端到端公式识别模型,能够将图像转换为对应的 LaTeX 公式
TexTeller用了**80M**个图片-公式对进行训练(过去的数据集可以在[这里](https://huggingface.co/datasets/OleehyO/latex-formulas)获取),相比于[LaTeX-OCR](https://github.com/lukas-blecher/LaTeX-OCR)(使用了一个100K的数据集)TexTeller具有**更强的泛化能力**以及**更高的准确率**可以覆盖大部分的使用场景。 TexTeller 使用 **8千万图像-公式对** 进行训练(前代数据集可在此[获取](https://huggingface.co/datasets/OleehyO/latex-formulas)),相较 [LaTeX-OCR](https://github.com/lukas-blecher/LaTeX-OCR) 使用的 10 万量级数据集TexTeller 具有**更强的泛化能力****更高的准确率**覆盖绝大多数使用场景。
> [!NOTE] >[!NOTE]
> 如果您想本项目提供一些反馈建议,欢迎在[Discussions版块](https://github.com/OleehyO/TexTeller/discussions)发起讨论。 > 如果您想本项目提反馈建议,欢迎前往 [讨论区](https://github.com/OleehyO/TexTeller/discussions) 发起讨论。
>
> 另外如果您觉得这个项目对您有帮助请不要忘记点亮上方的Star⭐🙏
--- ---
@@ -55,17 +31,12 @@ TexTeller用了**80M**个图片-公式对进行训练(过去的数据集可以
<td> <td>
## 🔖 目录 ## 🔖 目录
- [快速开始](#-快速开始)
- [变更信息](#-变更信息)
- [开搞](#-开搞)
- [常见问题无法连接到Hugging Face](#-常见问题无法连接到hugging-face)
- [网页演示](#-网页演示) - [网页演示](#-网页演示)
- [服务部署](#-服务部署)
- [Python接口](#-python接口)
- [公式检测](#-公式检测) - [公式检测](#-公式检测)
- [API调用](#-api调用) - [模型训练](#-模型训练)
- [训练](#-训练)
- [计划](#-计划)
- [观星曲线](#-观星曲线)
- [贡献者](#-贡献者)
</td> </td>
<td> <td>
@@ -74,17 +45,10 @@ TexTeller用了**80M**个图片-公式对进行训练(过去的数据集可以
<figure> <figure>
<img src="cover.png" width="800"> <img src="cover.png" width="800">
<figcaption> <figcaption>
<p>可以被TexTeller识别的图</p> <p>TexTeller识别的图像示例</p>
</figcaption> </figcaption>
</figure> </figure>
<div> <div>
<p>
感谢
<i>
北京邮电大学超算平台
</i>
为本项工作提供支持😘
</p>
</div> </div>
</div> </div>
@@ -92,221 +56,149 @@ TexTeller用了**80M**个图片-公式对进行训练(过去的数据集可以
</tr> </tr>
</table> </table>
## 🔄 变更信息 ## 📮 更新日志
- 📮[2024-06-06] **TexTeller3.0**发布! 训练数据集增加到了**80M**(相较于TexTeller2.0增加了**10倍**,并且改善了数据多样性)。新版的TexTeller具有以下新的特性: - [2024-06-06] **TexTeller3.0 发布!** 训练数据增至 **8千万**(是 TexTeller2.0**10倍** 并提升了数据多样性)。TexTeller3.0 新特性:
- 支持扫描图片、手写公式以及中英文混合的公式。
- 在打印图片上具有通用的中英文识别能力。
- 📮[2024-05-02] 支持**段落识别**。 - 支持扫描件、手写公式、中英文混合公式识别
- 📮[2024-04-12] **公式检测模型**发布! - 支持印刷体中英文混排公式的OCR识别
- 📮[2024-03-25] TexTeller2.0发布TexTeller2.0的训练数据增大到了7.5M(相较于TexTeller1.0增加了~15倍并且数据质量也有所改善)。训练后的TexTeller2.0在测试集中展现出了更加优越的性能,尤其在生僻符号、复杂多行、矩阵的识别场景中。 - [2024-05-02] 支持**段落识别**功能
> 在[这里](./test.pdf)有更多的测试图片以及各家识别模型的横向对比。 - [2024-04-12] **公式检测模型**发布!
## 🚀 开搞 - [2024-03-25] TexTeller2.0 发布TexTeller2.0 的训练数据增至750万是前代的15倍并提升了数据质量。训练后的 TexTeller2.0 在测试集中展现了**更优性能**,特别是在识别罕见符号、复杂多行公式和矩阵方面表现突出。
1. 克隆本仓库: > [此处](./assets/test.pdf) 展示了更多测试图像及各类识别模型的横向对比。
```bash ## 🚀 快速开始
git clone https://github.com/OleehyO/TexTeller
```
2. 安装项目依赖包: 1. 安装项目依赖
```bash ```bash
pip install texteller pip install texteller
``` ```
3. 进入`src/`目录,在终端运行以下命令进行推理: 2. 若使用 CUDA 后端,可能需要安装 `onnxruntime-gpu`
```bash ```bash
python inference.py -img "/path/to/image.{jpg,png}" pip install texteller[onnxruntime-gpu]
# use --inference-mode option to enable GPU(cuda or mps) inference
#+e.g. python inference.py -img "img.jpg" --inference-mode cuda
``` ```
> 第一次运行时会在Hugging Face上下载所需要的权重 3. 运行以下命令开始推理:
### 段落识别
如演示视频所示TexTeller还可以识别整个文本段落。尽管TexTeller具备通用的文本OCR能力但我们仍然建议使用段落识别来获得更好的效果
1. 下载公式检测模型的权重到`src/models/det_model/model/`目录 [[链接](https://huggingface.co/TonyLee1256/texteller_det/resolve/main/rtdetr_r50vd_6x_coco.onnx?download=true)]
2. `src/`目录下运行`inference.py`并添加`-mix`选项结果会以markdown的格式进行输出。
```bash ```bash
python inference.py -img "/path/to/image.{jpg,png}" -mix texteller inference "/path/to/image.{jpg,png}"
``` ```
TexTeller默认使用轻量的[PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)模型来识别中英文,可以尝试使用更大的模型来获取更好的中英文识别效果: > 更多参数请查看 `texteller inference --help`
| 权重 | 描述 | 尺寸 |
|-------------|-------------------| ---- |
| [ch_PP-OCRv4_det.onnx](https://huggingface.co/OleehyO/paddleocrv4.onnx/resolve/main/ch_PP-OCRv4_det.onnx?download=true) | **默认的检测模型**,支持中英文检测 | 4.70M |
| [ch_PP-OCRv4_server_det.onnx](https://huggingface.co/OleehyO/paddleocrv4.onnx/resolve/main/ch_PP-OCRv4_server_det.onnx?download=true) | 高精度模型,支持中英文检测 | 115M |
| [ch_PP-OCRv4_rec.onnx](https://huggingface.co/OleehyO/paddleocrv4.onnx/resolve/main/ch_PP-OCRv4_rec.onnx?download=true) | **默认的识别模型**,支持中英文识别 | 10.80M |
| [ch_PP-OCRv4_server_rec.onnx](https://huggingface.co/OleehyO/paddleocrv4.onnx/resolve/main/ch_PP-OCRv4_server_rec.onnx?download=true) | 高精度模型,支持中英文识别 | 90.60M |
把识别/检测模型的权重放在`src/models/third_party/paddleocr/checkpoints/`
下的`det/`或`rec/`目录中,然后重命名为`default_model.onnx`。
> [!NOTE]
> 段落识别只能识别文档内容,无法还原文档的结构。
## ❓ 常见问题无法连接到Hugging Face
默认情况下会在Hugging Face中下载模型权重**如果你的远端服务器无法连接到Hugging Face**,你可以通过以下命令进行加载:
1. 安装huggingface hub包
```bash
pip install -U "huggingface_hub[cli]"
```
2. 在能连接Hugging Face的机器上下载模型权重:
```bash
huggingface-cli download \
OleehyO/TexTeller \
--repo-type model \
--local-dir "your/dir/path" \
--local-dir-use-symlinks False
```
3. 把包含权重的目录上传远端服务器,然后把 `src/models/ocr_model/model/TexTeller.py`中的 `REPO_NAME = 'OleehyO/TexTeller'`修改为 `REPO_NAME = 'your/dir/path'`
<!-- 如果你还想在训练模型时开启evaluate你需要提前下载metric脚本并上传远端服务器
1. 在能连接Hugging Face的机器上下载metric脚本
```bash
huggingface-cli download \
evaluate-metric/google_bleu \
--repo-type space \
--local-dir "your/dir/path" \
--local-dir-use-symlinks False
```
2. 把这个目录上传远端服务器,并在 `TexTeller/src/models/ocr_model/utils/metrics.py`中把 `evaluate.load('google_bleu')`改为 `evaluate.load('your/dir/path/google_bleu.py')` -->
## 🌐 网页演示 ## 🌐 网页演示
进入 `src/` 目录,运行以下命令 运行命令
```bash ```bash
./start_web.sh texteller web
``` ```
在浏览器输入 `http://localhost:8501`就可以看到web demo 在浏览器输入 `http://localhost:8501` 查看网页演示。
> [!NOTE] > [!NOTE]
> 1. 对于Windows用户, 请运行 `start_web.bat`文件 > 段落识别无法还原文档结构,仅能识别其内容
> 2. 使用onnxruntime + gpu 推理时需要安装onnxruntime-gpu
## 🖥️ 服务部署
我们使用 [ray serve](https://github.com/ray-project/ray) 为 TexTeller 提供 API 服务。启动服务:
```bash
texteller launch
```
| 参数 | 说明 |
| --------- | -------- |
| `-ckpt` | 权重文件路径,*默认为 TexTeller 预训练权重* |
| `-tknz` | 分词器路径,*默认为 TexTeller 分词器* |
| `-p` | 服务端口,*默认 8000* |
| `--num-replicas` | 服务副本数,*默认 1*。可使用更多副本来提升吞吐量 |
| `--ncpu-per-replica` | 单个副本使用的CPU核数*默认 1* |
| `--ngpu-per-replica` | 单个副本使用的GPU数*默认 1*。可设置为0~1之间的值来在单卡上运行多个服务副本共享GPU提升GPU利用率注意若--num_replicas为2--ngpu_per_replica为0.7则需有2块可用GPU |
| `--num-beams` | beam search的束宽*默认 1* |
| `--use-onnx` | 使用Onnx Runtime进行推理*默认关闭* |
向服务发送请求:
```python
# client_demo.py
import requests
server_url = "http://127.0.0.1:8000/predict"
img_path = "/path/to/your/image"
with open(img_path, 'rb') as img:
files = {'img': img}
response = requests.post(server_url, files=files)
print(response.text)
```
## 🐍 Python接口
我们为公式OCR场景提供了多个易用的Python API接口请参考[接口文档](https://oleehyo.github.io/TexTeller/)了解对应的API接口及使用方法。
## 🔍 公式检测 ## 🔍 公式检测
TexTeller的公式检测模型在3415张中文教材数据(130+版式)和8272张[IBEM数据集](https://zenodo.org/records/4757865)上训练得到,支持对整张图片进行**公式检测** TexTeller的公式检测模型在3415张中文资料图像和8272张[IBEM数据集](https://zenodo.org/records/4757865)图像上训练。
<div align="center"> <div align="center">
<img src="det_rec.png" width=250> <img src="./assets/det_rec.png" width=250>
</div> </div>
1. 下载公式检测模型的权重到`src/models/det_model/model/`目录 [[链接](https://huggingface.co/TonyLee1256/texteller_det/resolve/main/rtdetr_r50vd_6x_coco.onnx?download=true)] 我们在Python接口中提供了公式检测接口详见[接口文档](https://oleehyo.github.io/TexTeller/)。
2. `src/`目录下运行以下命令,结果保存在`src/subimages/` ## 🏋️‍♂️ 模型训练
请按以下步骤配置训练环境:
1. 安装训练依赖:
```bash ```bash
python infer_det.py pip install texteller[train]
``` ```
<details> 2. 克隆仓库:
<summary>更进一步:公式批识别</summary>
在进行**公式检测后**`src/`目录下运行以下命令
```shell
python rec_infer_from_crop_imgs.py
```
会基于上一步公式检测的结果,对裁剪出的所有公式进行批量识别,将识别结果在 `src/results/`中保存为txt文件。
</details>
## 📡 API调用
我们使用[ray serve](https://github.com/ray-project/ray)来对外提供一个TexTeller的API接口通过使用这个接口你可以把TexTeller整合到自己的项目里。要想启动server你需要先进入 `src/`目录然后运行以下命令:
```bash
python server.py
```
| 参数 | 描述 |
| --- | --- |
| `-ckpt` | 权重文件的路径,*默认为TexTeller的预训练权重*。|
| `-tknz` | 分词器的路径,*默认为TexTeller的分词器*。|
| `-port` | 服务器的服务端口,*默认是8000*。|
| `--inference-mode` | 使用"cuda"或"mps"推理,*默认为"cpu"*。|
| `--num_beams` | beam search的beam数量*默认是1*。|
| `--num_replicas` | 在服务器上运行的服务副本数量,*默认1个副本*。你可以使用更多的副本来获取更大的吞吐量。|
| `--ncpu_per_replica` | 每个服务副本所用的CPU核心数*默认为1*。|
| `--ngpu_per_replica` | 每个服务副本所用的GPU数量*默认为1*。你可以把这个值设置成 0~1之间的数这样会在一个GPU上运行多个服务副本来共享GPU从而提高GPU的利用率。(注意,如果 --num_replicas 2, --ngpu_per_replica 0.7, 那么就必须要有2个GPU可用) |
| `-onnx` | 使用Onnx Runtime进行推理*默认不使用*。|
> [!NOTE]
> 一个客户端demo可以在 `TexTeller/client/demo.py`找到,你可以参考 `demo.py`来给server发送请求
## 🏋️‍♂️ 训练
### 数据集
我们在 `src/models/ocr_model/train/dataset/`目录中提供了一个数据集的例子,你可以把自己的图片放在 `images`目录然后在 `formulas.jsonl`中为每张图片标注对应的公式。
准备好数据集后,你需要在 `**/train/dataset/loader.py`中把 **`DIR_URL`变量改成你自己数据集的路径**
### 重新训练分词器
如果你使用了不一样的数据集你可能需要重新训练tokenizer来得到一个不一样的词典。配置好数据集后可以通过以下命令来训练自己的tokenizer
1. 在`src/models/tokenizer/train.py`中,修改`new_tokenizer.save_pretrained('./your_dir_name')`为你自定义的输出目录
> 注意:如果要用一个不一样大小的词典(默认1.5W个token),你需要在`src/models/globals.py`中修改`VOCAB_SIZE`变量
2. **在`src/`目录下**运行以下命令:
```bash ```bash
python -m models.tokenizer.train git clone https://github.com/OleehyO/TexTeller.git
``` ```
### 训练模型 ### 数据集准备
1. 修改`src/train_config.yaml`中的`num_processes`为训练用的显卡数(默认为1) 我们在`examples/train_texteller/dataset/train`目录中提供了示例数据集,您可按照示例数据集的格式放置自己的训练数据。
2. 在`src/`目录下运行以下命令: ### 开始训练
在`examples/train_texteller/`目录下运行:
```bash ```bash
accelerate launch --config_file ./train_config.yaml -m models.ocr_model.train.train accelerate launch train.py
``` ```
你可以在`src/models/ocr_model/train/train.py`中设置自己的tokenizer和checkpoint路径请参考`train.py`。如果你使用了与TexTeller一样的架构和相同的词典你还可以用自己的数据集来微调TexTeller的默认权重 训练参数可通过[`train_config.yaml`](./examples/train_texteller/train_config.yaml)调整
> [!NOTE] ## 📅 计划列表
> 我们的训练脚本使用了[Hugging Face Transformers](https://github.com/huggingface/transformers)库, 所以你可以参考他们提供的[文档](https://huggingface.co/docs/transformers/v4.32.1/main_classes/trainer#transformers.TrainingArguments)来获取更多训练参数的细节以及配置。
## 📅 计划 - [X] ~~使用更大规模数据集训练模型~~
- [X] ~~扫描件识别支持~~
- [X] ~~使用更大的数据集来训练模型~~
- [X] ~~扫描图片识别~~
- [X] ~~中英文场景支持~~ - [X] ~~中英文场景支持~~
- [X] ~~手写公式识别~~ - [X] ~~手写公式支持~~
- [ ] PDF文档识别 - [ ] PDF文档识别
- [ ] 推理加速 - [ ] 推理加速
## ⭐️ 观星曲线 ## ⭐️ 项目星标
[![Stargazers over time](https://starchart.cc/OleehyO/TexTeller.svg?variant=adaptive)](https://starchart.cc/OleehyO/TexTeller) [![Star增长曲线](https://starchart.cc/OleehyO/TexTeller.svg?variant=adaptive)](https://starchart.cc/OleehyO/TexTeller)
## 👥 贡献者 ## 👥 贡献者