Update README

This commit is contained in:
三洋三洋
2024-06-05 16:55:42 +00:00
parent a7044e0369
commit aa14674097
4 changed files with 284 additions and 124 deletions

203
README.md
View File

@@ -7,30 +7,106 @@
<img src="./assets/fire.svg" width=30, height=30>
</h1>
<p align="center">
🤗 <a href="https://huggingface.co/OleehyO/TexTeller"> Hugging Face</a>
🤗 <a href="https://huggingface.co/OleehyO/TexTeller"> Hugging Face </a>
</p>
<!-- <p align="center">
<img src="./assets/web_demo.gif" alt="TexTeller_demo" width=800>
</p> -->
[![](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![](https://img.shields.io/badge/Maintained%3F-yes-green.svg)](https://github.com/OleehyO/TexTeller/issues)
[![](https://img.shields.io/badge/Data-Texteller1.0-brightgreen.svg)](https://huggingface.co/datasets/OleehyO/latex-formulas)
[![](https://img.shields.io/badge/Weights-Texteller3.0-yellow.svg)](https://huggingface.co/OleehyO/TexTeller)
</div>
https://github.com/OleehyO/TexTeller/assets/56267907/b23b2b2e-a663-4abb-b013-bd47238d513b
<!-- <p align="center">
TexTeller is an end-to-end formula recognition model based on ViT, capable of converting images into corresponding LaTeX formulas.
<a href="https://opensource.org/licenses/Apache-2.0">
<img src="https://img.shields.io/badge/License-Apache_2.0-blue.svg" alt="License">
</a>
<a href="https://github.com/OleehyO/TexTeller/issues">
<img src="https://img.shields.io/badge/Maintained%3F-yes-green.svg" alt="Maintenance">
</a>
<a href="https://github.com/OleehyO/TexTeller/pulls">
<img src="https://img.shields.io/badge/Contributions-welcome-brightgreen.svg?style=flat" alt="Contributions welcome">
</a>
<a href="https://huggingface.co/datasets/OleehyO/latex-formulas">
<img src="https://img.shields.io/badge/Data-Texteller1.0-brightgreen.svg" alt="Data">
</a>
<a href="https://huggingface.co/OleehyO/TexTeller">
<img src="https://img.shields.io/badge/Weights-Texteller3.0-yellow.svg" alt="Weights">
</a>
TexTeller was trained with 7.5M image-formula pairs (dataset available [here](https://huggingface.co/datasets/OleehyO/latex-formulas)), compared to [LaTeX-OCR](https://github.com/lukas-blecher/LaTeX-OCR) which used a 100K dataset, TexTeller has **stronger generalization abilities** and **higher accuracy**, covering most use cases (**except for scanned images and handwritten formulas**).
</p> -->
> If you find this project helpful, please don't forget to give it a star⭐
https://github.com/OleehyO/TexTeller/assets/56267907/532d1471-a72e-4960-9677-ec6c19db289f
TexTeller is an end-to-end formula recognition model based on [TrOCR](https://arxiv.org/abs/2109.10282), capable of converting images into corresponding LaTeX formulas.
TexTeller was trained with **80M image-formula pairs** (previous dataset can be obtained [here](https://huggingface.co/datasets/OleehyO/latex-formulas)), compared to [LaTeX-OCR](https://github.com/lukas-blecher/LaTeX-OCR) which used a 100K dataset, TexTeller has **stronger generalization abilities** and **higher accuracy**, covering most use cases.
>[!NOTE]
> If you would like to provide feedback or suggestions for this project, feel free to start a discussion in the [Discussions section](https://github.com/OleehyO/TexTeller/discussions).
>
> Additionally, if you find this project helpful, please don't forget to give it a star⭐🙏
---
<table>
<tr>
<td>
## 🔖 Table of Contents
- [Change Log](#-change-log)
- [Getting Started](#-getting-started)
- [Web Demo](#-web-demo)
- [Formula Detection](#-formula-detection)
- [API Usage](#-api-usage)
- [Training](#-training)
- [Plans](#-plans)
- [Stargazers over time](#-stargazers-over-time)
- [Contributors](#-contributors)
</td>
<td>
<div align="center">
<figure>
<img src="assets/cover.png" width="800">
<figcaption>
<p>Images that can be recognized by TexTeller</p>
</figcaption>
</figure>
<div>
<p>
Thanks to the
<i>
Super Computing Platform of Beijing University of Posts and Telecommunications
</i>
for supporting this work😘
</p>
<!-- <img src="assets/scss.png" width="200"> -->
</div>
</div>
</td>
</tr>
</table>
## 🔄 Change Log
* 📮[2024-05-02] Support mixed Chinese English formula recognition(Beta).
- 📮[2024-06-06] **TexTeller3.0 released!** The training data has been increased to **80M** (**10x more than** TexTeller2.0 and also improved in data diversity). TexTeller3.0's new features:
* 📮[2024-04-12] Trained a **formula detection model**, thereby enhancing the capability to detect and recognize formulas in entire documents (whole-image inference)!
- Support scanned image, handwritten formulas, English(Chinese) mixed formulas.
* 📮[2024-03-25] TexTeller 2.0 released! The training data for TexTeller 2.0 has been increased to 7.5M (about **15 times more** than TexTeller 1.0 and also improved in data quality). The trained TexTeller 2.0 demonstrated **superior performance** in the test set, especially in recognizing rare symbols, complex multi-line formulas, and matrices.
- OCR abilities in both Chinese and English for printed images.
> [There](./assets/test.pdf) are more test images here and a horizontal comparison of recognition models from different companies.
- 📮[2024-05-02] Support **paragraph recognition**.
- 📮[2024-04-12] **Formula detection model** released!
- 📮[2024-03-25] TexTeller2.0 released! The training data for TexTeller2.0 has been increased to 7.5M (15x more than TexTeller1.0 and also improved in data quality). The trained TexTeller2.0 demonstrated **superior performance** in the test set, especially in recognizing rare symbols, complex multi-line formulas, and matrices.
> [Here](./assets/test.pdf) are more test images and a horizontal comparison of various recognition models.
## 🚀 Getting Started
@@ -46,24 +122,45 @@ TexTeller was trained with 7.5M image-formula pairs (dataset available [here](ht
pip install texteller
```
3. Enter the `TexTeller/src` directory and run the following command in the terminal to start inference:
3. Enter the `src/` directory and run the following command in the terminal to start inference:
```bash
python inference.py -img "/path/to/image.{jpg,png}"
# use --inference-mode option to enable GPU(cuda or mps) inference
#+e.g. python inference.py -img "img.jpg" --inference-mode cuda
# use -mix option to enable mixed text and formula recognition
#+e.g. python inference.py -img "img.jpg" -mix
```
> The first time you run it, the required checkpoints will be downloaded from Hugging Face
> The first time you run it, the required checkpoints will be downloaded from Hugging Face.
> [!IMPORTANT]
> If using mixed text and formula recognition, it is necessary to [download formula detection model weights](https://github.com/OleehyO/TexTeller?tab=readme-ov-file#download-weights)
### Paragraph Recognition
As demonstrated in the video, TexTeller is also capable of recognizing entire text paragraphs. Although TexTeller has general text OCR capabilities, we still recommend using paragraph recognition for better results:
1. [Download the weights](https://huggingface.co/TonyLee1256/texteller_det/resolve/main/rtdetr_r50vd_6x_coco.onnx?download=true) of the formula detection model to the`src/models/det_model/model/`directory
2. Run `inference.py` in the `src/` directory and add the `-mix` option, the results will be output in markdown format.
```bash
python inference.py -img "/path/to/image.{jpg,png}" -mix
```
TexTeller uses the lightweight [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR) model by default for recognizing both Chinese and English text. You can try using a larger model to achieve better recognition results for both Chinese and English:
| Checkpoints | Model Description | Size |
|-------------|-------------------| ---- |
| [ch_PP-OCRv4_det.onnx](https://huggingface.co/OleehyO/paddleocrv4.onnx/resolve/main/ch_PP-OCRv4_det.onnx?download=true) | **Default detection model**, supports Chinese-English text detection | 4.70M |
| [ch_PP-OCRv4_server_det.onnx](https://huggingface.co/OleehyO/paddleocrv4.onnx/resolve/main/ch_PP-OCRv4_server_det.onnx?download=true) | High accuracy model, supports Chinese-English text detection | 115M |
| [ch_PP-OCRv4_rec.onnx](https://huggingface.co/OleehyO/paddleocrv4.onnx/resolve/main/ch_PP-OCRv4_rec.onnx?download=true) | **Default recoginition model**, supports Chinese-English text recognition | 10.80M |
| [ch_PP-OCRv4_server_rec.onnx](https://huggingface.co/OleehyO/paddleocrv4.onnx/resolve/main/ch_PP-OCRv4_server_rec.onnx?download=true) | High accuracy model, supports Chinese-English text recognition | 90.60M |
Place the weights of the recognition/detection model in the `det/` or `rec/` directories within `src/models/third_party/paddleocr/checkpoints/`, and rename them to `default_model.onnx`.
> [!NOTE]
> Paragraph recognition cannot restore the structure of a document, it can only recognize its content.
## 🌐 Web Demo
Go to the `TexTeller/src` directory and run the following command:
Go to the `src/` directory and run the following command:
```bash
./start_web.sh
@@ -74,43 +171,34 @@ Enter `http://localhost:8501` in a browser to view the web demo.
> [!NOTE]
> If you are Windows user, please run the `start_web.bat` file instead.
## 🧠 Full Image Inference
## 🔍 Formula Detection
TexTeller also supports **formula detection and recognition** on full images, allowing for the detection of formulas throughout the image, followed by batch recognition of the formulas.
### Download Weights
Download the model weights from [this link](https://huggingface.co/TonyLee1256/texteller_det/resolve/main/rtdetr_r50vd_6x_coco.onnx?download=true) and place them in `src/models/det_model/model`.
> TexTeller's formula detection model was trained on a total of 11,867 images, consisting of 3,415 images from Chinese textbooks (over 130 layouts) and 8,272 images from the [IBEM dataset](https://zenodo.org/records/4757865).
### Formula Detection
Run the following command in the `TexTeller/src` directory:
```bash
python infer_det.py
```
Detects all formulas in the full image, and the results are saved in `TexTeller/src/subimages`.
TexTellers formula detection model is trained on 3,415 images of Chinese educational materials (with over 130 layouts) and 8,272 images from the [IBEM dataset](https://zenodo.org/records/4757865), and it supports formula detection across entire images.
<div align="center">
<img src="./assets/det_rec.png" width=400>
<img src="./assets/det_rec.png" width=250>
</div>
### Batch Formula Recognition
1. Download the model weights and place them in `src/models/det_model/model/` [[link](https://huggingface.co/TonyLee1256/texteller_det/resolve/main/rtdetr_r50vd_6x_coco.onnx?download=true)].
After **formula detection**, run the following command in the `TexTeller/src` directory:
2. Run the following command in the `src/` directory, and the results will be saved in `src/subimages/`
<details>
<summary>Advanced: batch formula recognition</summary>
After **formula detection**, run the following command in the `src/` directory:
```shell
python rec_infer_from_crop_imgs.py
```
This will use the results of the previous formula detection to perform batch recognition on all cropped formulas, saving the recognition results as txt files in `TexTeller/src/results`.
This will use the results of the previous formula detection to perform batch recognition on all cropped formulas, saving the recognition results as txt files in `src/results/`.
</details>
## 📡 API Usage
We use [ray serve](https://github.com/ray-project/ray) to provide an API interface for TexTeller, allowing you to integrate TexTeller into your own projects. To start the server, you first need to enter the `TexTeller/src` directory and then run the following command:
We use [ray serve](https://github.com/ray-project/ray) to provide an API interface for TexTeller, allowing you to integrate TexTeller into your own projects. To start the server, you first need to enter the `src/` directory and then run the following command:
```bash
python server.py
@@ -128,13 +216,13 @@ python server.py
| `--ngpu_per_replica` | The number of GPUs used per service replica,*default is 1*. You can set this value between 0 and 1 to run multiple service replicas on one GPU to share the GPU, thereby improving GPU utilization. (Note, if --num_replicas is 2, --ngpu_per_replica is 0.7, then 2 GPUs must be available) |
> [!NOTE]
> A client demo can be found at `TexTeller/client/demo.py`, you can refer to `demo.py` to send requests to the server
> A client demo can be found at `src/client/demo.py`, you can refer to `demo.py` to send requests to the server
## 🏋️‍♂️ Training
### Dataset
We provide an example dataset in the `TexTeller/src/models/ocr_model/train/dataset` directory, you can place your own images in the `images` directory and annotate each image with its corresponding formula in `formulas.jsonl`.
We provide an example dataset in the `src/models/ocr_model/train/dataset/` directory, you can place your own images in the `images/` directory and annotate each image with its corresponding formula in `formulas.jsonl`.
After preparing your dataset, you need to **change the `DIR_URL` variable to your own dataset's path** in `**/train/dataset/loader.py`
@@ -142,11 +230,11 @@ After preparing your dataset, you need to **change the `DIR_URL` variable to you
If you are using a different dataset, you might need to retrain the tokenizer to obtain a different vocabulary. After configuring your dataset, you can train your own tokenizer with the following command:
1. In `TexTeller/src/models/tokenizer/train.py`, change `new_tokenizer.save_pretrained('./your_dir_name')` to your custom output directory
1. In `src/models/tokenizer/train.py`, change `new_tokenizer.save_pretrained('./your_dir_name')` to your custom output directory
> If you want to use a different vocabulary size (default is 15k tokens), you need to change the `VOCAB_SIZE` variable in `TexTeller/src/models/globals.py`
> If you want to use a different vocabulary size (default 15K), you need to change the `VOCAB_SIZE` variable in `src/models/globals.py`
>
2. **In the `TexTeller/src` directory**, run the following command:
2. **In the `src/` directory**, run the following command:
```bash
python -m models.tokenizer.train
@@ -155,29 +243,25 @@ If you are using a different dataset, you might need to retrain the tokenizer to
### Training the Model
1. Modify `num_processes` in `src/train_config.yaml` to match the number of GPUs available for training (default is 1).
2. In the `TexTeller/src` directory, run the following command:
2. In the `src/` directory, run the following command:
```bash
accelerate launch --config_file ./train_config.yaml -m models.ocr_model.train.train
```
You can set your own tokenizer and checkpoint paths in `TexTeller/src/models/ocr_model/train/train.py` (refer to `train.py` for more information). If you are using the same architecture and vocabulary as TexTeller, you can also fine-tune TexTeller's default weights with your own dataset.
You can set your own tokenizer and checkpoint paths in `src/models/ocr_model/train/train.py` (refer to `train.py` for more information). If you are using the same architecture and vocabulary as TexTeller, you can also fine-tune TexTeller's default weights with your own dataset.
In `TexTeller/src/globals.py` and `TexTeller/src/models/ocr_model/train/train_args.py`, you can change the model's architecture and training hyperparameters.
In `src/globals.py` and `src/models/ocr_model/train/train_args.py`, you can change the model's architecture and training hyperparameters.
> [!NOTE]
> Our training scripts use the [Hugging Face Transformers](https://github.com/huggingface/transformers) library, so you can refer to their [documentation](https://huggingface.co/docs/transformers/v4.32.1/main_classes/trainer#transformers.TrainingArguments) for more details and configurations on training parameters.
## 🚧 Limitations
* Does not support scanned images
* Does not support handwritten formulas
## 📅 Plans
- [X] ~~Train the model with a larger dataset (7.5M samples, coming soon)~~
- [ ] Recognition of scanned images
- [ ] Support for English and Chinese scenarios
- [X] ~~Train the model with a larger dataset~~
- [X] ~~Recognition of scanned images~~
- [X] ~~Support for English and Chinese scenarios~~
- [X] ~~Handwritten formulas support~~
- [ ] PDF document recognition
- [ ] Inference acceleration
- [ ] ...
@@ -186,9 +270,6 @@ In `TexTeller/src/globals.py` and `TexTeller/src/models/ocr_model/train/train_ar
[![Stargazers over time](https://starchart.cc/OleehyO/TexTeller.svg?variant=adaptive)](https://starchart.cc/OleehyO/TexTeller)
## 💖 Acknowledgments
Thanks to [LaTeX-OCR](https://github.com/lukas-blecher/LaTeX-OCR) which has brought me a lot of inspiration, and [im2latex-100K](https://zenodo.org/records/56198#.V2px0jXT6eA) which enriches our dataset.
## 👥 Contributors

View File

@@ -4,31 +4,105 @@
<h1>
<img src="./fire.svg" width=30, height=30>
𝚃𝚎𝚡𝚃𝚎𝚕𝚕𝚎𝚛
<img src="./fire.svg" width=30, height=30>
<img src="./fire.svg" width=30, height=30>
</h1>
<p align="center">
🤗 <a href="https://huggingface.co/OleehyO/TexTeller">Hugging Face</a>
🤗 <a href="https://huggingface.co/OleehyO/TexTeller"> Hugging Face </a>
</p>
<!-- <p align="center">
<img src="./web_demo.gif" alt="TexTeller_demo" width=800>
</p> -->
[![](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![](https://img.shields.io/badge/Maintained%3F-yes-green.svg)](https://github.com/OleehyO/TexTeller/issues)
[![](https://img.shields.io/badge/Data-Texteller1.0-brightgreen.svg)](https://huggingface.co/datasets/OleehyO/latex-formulas)
[![](https://img.shields.io/badge/Weights-Texteller3.0-yellow.svg)](https://huggingface.co/OleehyO/TexTeller)
</div>
https://github.com/OleehyO/TexTeller/assets/56267907/fb17af43-f2a5-47ce-ad1d-101db5fd7fbb
<!-- <p align="center">
TexTeller是一个基于ViT的端到端公式识别模型可以把图片转换为对应的latex公式
<a href="https://opensource.org/licenses/Apache-2.0">
<img src="https://img.shields.io/badge/License-Apache_2.0-blue.svg" alt="License">
</a>
<a href="https://github.com/OleehyO/TexTeller/issues">
<img src="https://img.shields.io/badge/Maintained%3F-yes-green.svg" alt="Maintenance">
</a>
<a href="https://github.com/OleehyO/TexTeller/pulls">
<img src="https://img.shields.io/badge/Contributions-welcome-brightgreen.svg?style=flat" alt="Contributions welcome">
</a>
<a href="https://huggingface.co/datasets/OleehyO/latex-formulas">
<img src="https://img.shields.io/badge/Data-Texteller1.0-brightgreen.svg" alt="Data">
</a>
<a href="https://huggingface.co/OleehyO/TexTeller">
<img src="https://img.shields.io/badge/Weights-Texteller3.0-yellow.svg" alt="Weights">
</a>
TexTeller用了7.5M的图片-公式对进行训练(数据集可以在[这里](https://huggingface.co/datasets/OleehyO/latex-formulas)获取),相比于[LaTeX-OCR](https://github.com/lukas-blecher/LaTeX-OCR)(使用了一个100K的数据集)TexTeller具有**更强的泛化能力**以及**更高的准确率**,可以覆盖大部分的使用场景(**扫描图片,手写公式除外**)。
</p> -->
> 如果您觉得这个项目对您有帮助请不要忘记点亮上方的Star⭐
https://github.com/OleehyO/TexTeller/assets/56267907/532d1471-a72e-4960-9677-ec6c19db289f
TexTeller是一个基于[TrOCR](https://arxiv.org/abs/2109.10282)的端到端公式识别模型可以把图片转换为对应的latex公式
TexTeller用了**80M**个图片-公式对进行训练(过去的数据集可以在[这里](https://huggingface.co/datasets/OleehyO/latex-formulas)获取),相比于[LaTeX-OCR](https://github.com/lukas-blecher/LaTeX-OCR)(使用了一个100K的数据集)TexTeller具有**更强的泛化能力**以及**更高的准确率**,可以覆盖大部分的使用场景。
> [!NOTE]
> 如果您想为本项目提供一些反馈、建议等,欢迎在[Discussions版块](https://github.com/OleehyO/TexTeller/discussions)发起讨论。
>
> 另外如果您觉得这个项目对您有帮助请不要忘记点亮上方的Star⭐🙏
---
<table>
<tr>
<td>
## 🔖 目录
- [变更信息](#-变更信息)
- [开搞](#-开搞)
- [常见问题无法连接到Hugging Face](#-常见问题无法连接到hugging-face)
- [网页演示](#-网页演示)
- [公式检测](#-公式检测)
- [API调用](#-api调用)
- [训练](#-训练)
- [计划](#-计划)
- [观星曲线](#-观星曲线)
- [贡献者](#-贡献者)
</td>
<td>
<div align="center">
<figure>
<img src="cover.png" width="800">
<figcaption>
<p>可以被TexTeller识别出的图片</p>
</figcaption>
</figure>
<div>
<p>
感谢
<i>
北京邮电大学超算平台
</i>
为本项工作提供支持😘
</p>
</div>
</div>
</td>
</tr>
</table>
## 🔄 变更信息
* 📮[2024-05-02] 支持中英文-公式混合识别(Beta)。
- 📮[2024-06-06] **TexTeller3.0**发布! 训练数据集增加到了**80M**(相较于TexTeller2.0增加了**10倍**,并且改善了数据的多样性)。新版的TexTeller具有以下新的特性
- 支持扫描图片、手写公式以及中英文混合的公式。
- 在打印图片上具有通用的中英文识别能力。
* 📮[2024-04-12] 训练了**公式检测模型**,从而增加了对整个文档进行公式检测+公式识别(整图推理)的功能!
- 📮[2024-05-02] 支持**段落识别**。
* 📮[2024-03-25] TexTeller2.0发布TexTeller2.0的训练数据增大到了7.5M(相较于TexTeller1.0**增加了~15倍**并且数据质量也有所改善)。训练后的TexTeller2.0在测试集中展现出了**更加优越的性能**,尤其在生僻符号、复杂多行、矩阵的识别场景中。
- 📮[2024-04-12] **公式检测模型**发布!
- 📮[2024-03-25] TexTeller2.0发布TexTeller2.0的训练数据增大到了7.5M(相较于TexTeller1.0增加了~15倍并且数据质量也有所改善)。训练后的TexTeller2.0在测试集中展现出了更加优越的性能,尤其在生僻符号、复杂多行、矩阵的识别场景中。
> 在[这里](./test.pdf)有更多的测试图片以及各家识别模型的横向对比。
@@ -46,20 +120,42 @@ TexTeller用了7.5M的图片-公式对进行训练(数据集可以在[这里](ht
pip install texteller
```
3. 进入 `TexTeller/src`目录,在终端运行以下命令进行推理:
3. 进入`src/`目录,在终端运行以下命令进行推理:
```bash
python inference.py -img "/path/to/image.{jpg,png}"
# use --inference-mode option to enable GPU(cuda or mps) inference
#+e.g. python inference.py -img "img.jpg" --inference-mode cuda
# use -mix option to enable mixed text and formula recognition
#+e.g. python inference.py -img "img.jpg" -mix
```
> 第一次运行时会在Hugging Face上下载所需要的权重
> [!IMPORTANT]
> 如果使用文字-公式混合识别,需要[下载公式检测模型的权重](https://github.com/OleehyO/TexTeller/blob/main/assets/README_zh.md#%E4%B8%8B%E8%BD%BD%E6%9D%83%E9%87%8D)
### 段落识别
如演示视频所示TexTeller还可以识别整个文本段落。尽管TexTeller具备通用的文本OCR能力但我们仍然建议使用段落识别来获得更好的效果
1. 下载公式检测模型的权重到`src/models/det_model/model/`目录 [[链接](https://huggingface.co/TonyLee1256/texteller_det/resolve/main/rtdetr_r50vd_6x_coco.onnx?download=true)]
2. `src/`目录下运行`inference.py`并添加`-mix`选项结果会以markdown的格式进行输出。
```bash
python inference.py -img "/path/to/image.{jpg,png}" -mix
```
TexTeller默认使用轻量的[PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)模型来识别中英文,可以尝试使用更大的模型来获取更好的中英文识别效果:
| 权重 | 描述 | 尺寸 |
|-------------|-------------------| ---- |
| [ch_PP-OCRv4_det.onnx](https://huggingface.co/OleehyO/paddleocrv4.onnx/resolve/main/ch_PP-OCRv4_det.onnx?download=true) | **默认的检测模型**,支持中英文检测 | 4.70M |
| [ch_PP-OCRv4_server_det.onnx](https://huggingface.co/OleehyO/paddleocrv4.onnx/resolve/main/ch_PP-OCRv4_server_det.onnx?download=true) | 高精度模型,支持中英文检测 | 115M |
| [ch_PP-OCRv4_rec.onnx](https://huggingface.co/OleehyO/paddleocrv4.onnx/resolve/main/ch_PP-OCRv4_rec.onnx?download=true) | **默认的识别模型**,支持中英文识别 | 10.80M |
| [ch_PP-OCRv4_server_rec.onnx](https://huggingface.co/OleehyO/paddleocrv4.onnx/resolve/main/ch_PP-OCRv4_server_rec.onnx?download=true) | 高精度模型,支持中英文识别 | 90.60M |
把识别/检测模型的权重放在`src/models/third_party/paddleocr/checkpoints/`
下的`det/`或`rec/`目录中,然后重命名为`default_model.onnx`。
> [!NOTE]
> 段落识别只能识别文档内容,无法还原文档的结构。
## ❓ 常见问题无法连接到Hugging Face
@@ -81,7 +177,7 @@ TexTeller用了7.5M的图片-公式对进行训练(数据集可以在[这里](ht
--local-dir-use-symlinks False
```
3. 把包含权重的目录上传远端服务器,然后把 `TexTeller/src/models/ocr_model/model/TexTeller.py`中的 `REPO_NAME = 'OleehyO/TexTeller'`修改为 `REPO_NAME = 'your/dir/path'`
3. 把包含权重的目录上传远端服务器,然后把 `src/models/ocr_model/model/TexTeller.py`中的 `REPO_NAME = 'OleehyO/TexTeller'`修改为 `REPO_NAME = 'your/dir/path'`
<!-- 如果你还想在训练模型时开启evaluate你需要提前下载metric脚本并上传远端服务器
@@ -99,7 +195,7 @@ TexTeller用了7.5M的图片-公式对进行训练(数据集可以在[这里](ht
## 🌐 网页演示
进入 `TexTeller/src` 目录,运行以下命令
进入 `src/` 目录,运行以下命令
```bash
./start_web.sh
@@ -108,45 +204,39 @@ TexTeller用了7.5M的图片-公式对进行训练(数据集可以在[这里](ht
在浏览器里输入 `http://localhost:8501`就可以看到web demo
> [!NOTE]
> 对于Windows用户, 请运行 `start_web.bat`文件.
> 对于Windows用户, 请运行 `start_web.bat`文件
## 🧠 整图推理
## 🔍 公式检测
TexTeller还支持对整张图片进行**公式检测+公式识别**,从而对整图公式进行检测,然后进行批公式识别
### 下载权重
根据[这里的链接](https://huggingface.co/TonyLee1256/texteller_det/resolve/main/rtdetr_r50vd_6x_coco.onnx?download=true)把模型权重下载到`src/models/det_model/model`
> TexTeller的公式检测模型在3415张中文教材数据(130+版式)和8272张[IBEM数据集](https://zenodo.org/records/4757865)上共11867张图片上训练得到.
### 公式检测
`TexTeller/src`目录下运行以下命令
```bash
python infer_det.py
```
对整张图中的所有公式进行检测,结果保存在 `TexTeller/src/subimages`
TexTeller的公式检测模型在3415张中文教材数据(130+版式)和8272张[IBEM数据集](https://zenodo.org/records/4757865)上训练得到,支持对整张图片进行**公式检测**
<div align="center">
<img src="det_rec.png" width=400>
<img src="det_rec.png" width=250>
</div>
### 公式批识别
1. 下载公式检测模型的权重到`src/models/det_model/model/`目录 [[链接](https://huggingface.co/TonyLee1256/texteller_det/resolve/main/rtdetr_r50vd_6x_coco.onnx?download=true)]
在进行**公式检测后** `TexTeller/src`目录下运行以下命令
2. `src/`目录下运行以下命令,结果保存在`src/subimages/`
```bash
python infer_det.py
```
<details>
<summary>更进一步:公式批识别</summary>
在进行**公式检测后**`src/`目录下运行以下命令
```shell
python rec_infer_from_crop_imgs.py
```
会基于上一步公式检测的结果,对裁剪出的所有公式进行批量识别,将识别结果在 `TexTeller/src/results`中保存为txt文件。
会基于上一步公式检测的结果,对裁剪出的所有公式进行批量识别,将识别结果在 `src/results/`中保存为txt文件。
</details>
## 📡 API调用
我们使用[ray serve](https://github.com/ray-project/ray)来对外提供一个TexTeller的API接口通过使用这个接口你可以把TexTeller整合到自己的项目里。要想启动server你需要先进入 `TexTeller/src`目录然后运行以下命令:
我们使用[ray serve](https://github.com/ray-project/ray)来对外提供一个TexTeller的API接口通过使用这个接口你可以把TexTeller整合到自己的项目里。要想启动server你需要先进入 `src/`目录然后运行以下命令:
```bash
python server.py
@@ -170,7 +260,7 @@ python server.py
### 数据集
我们在 `TexTeller/src/models/ocr_model/train/dataset`目录中提供了一个数据集的例子,你可以把自己的图片放在 `images`目录然后在 `formulas.jsonl`中为每张图片标注对应的公式。
我们在 `src/models/ocr_model/train/dataset/`目录中提供了一个数据集的例子,你可以把自己的图片放在 `images`目录然后在 `formulas.jsonl`中为每张图片标注对应的公式。
准备好数据集后,你需要在 `**/train/dataset/loader.py`中把 **`DIR_URL`变量改成你自己数据集的路径**
@@ -178,11 +268,11 @@ python server.py
如果你使用了不一样的数据集你可能需要重新训练tokenizer来得到一个不一样的词典。配置好数据集后可以通过以下命令来训练自己的tokenizer
1. 在 `TexTeller/src/models/tokenizer/train.py`中,修改 `new_tokenizer.save_pretrained('./your_dir_name')`为你自定义的输出目录
1. 在`src/models/tokenizer/train.py`中,修改`new_tokenizer.save_pretrained('./your_dir_name')`为你自定义的输出目录
> 注意:如果要用一个不一样大小的词典(默认1.5W个token),你需要在 `TexTeller/src/models/globals.py`中修改 `VOCAB_SIZE`变量
> 注意:如果要用一个不一样大小的词典(默认1.5W个token),你需要在`src/models/globals.py`中修改`VOCAB_SIZE`变量
2. **在 `TexTeller/src` 目录下**运行以下命令:
2. **在`src/`目录下**运行以下命令:
```bash
python -m models.tokenizer.train
@@ -192,30 +282,23 @@ python server.py
1. 修改`src/train_config.yaml`中的`num_processes`为训练用的显卡数(默认为1)
2. 在`TexTeller/src`目录下运行以下命令:
2. 在`src/`目录下运行以下命令:
```bash
accelerate launch --config_file ./train_config.yaml -m models.ocr_model.train.train
```
你可以在 `TexTeller/src/models/ocr_model/train/train.py`中设置自己的tokenizer和checkpoint路径请参考 `train.py`。如果你使用了与TexTeller一样的架构和相同的词典你还可以用自己的数据集来微调TexTeller的默认权重。
在 `TexTeller/src/globals.py`和 `TexTeller/src/models/ocr_model/train/train_args.py`中,你可以改变模型的架构以及训练的超参数。
你可以在`src/models/ocr_model/train/train.py`中设置自己的tokenizer和checkpoint路径请参考`train.py`。如果你使用了与TexTeller一样的架构和相同的词典你还可以用自己的数据集来微调TexTeller的默认权重。
> [!NOTE]
> 我们的训练脚本使用了[Hugging Face Transformers](https://github.com/huggingface/transformers)库, 所以你可以参考他们提供的[文档](https://huggingface.co/docs/transformers/v4.32.1/main_classes/trainer#transformers.TrainingArguments)来获取更多训练参数的细节以及配置。
## 🚧 不足
* 不支持扫描图片
* 不支持手写体公式
* 不支持PDF文档识别
## 📅 计划
- [X] ~~使用更大的数据集来训练模型~~
- [ ] 扫描图片识别
- [ ] 中英文场景支持
- [X] ~~扫描图片识别~~
- [X] ~~中英文场景支持~~
- [X] ~~手写公式识别~~
- [ ] PDF文档识别
- [ ] 推理加速
@@ -223,10 +306,6 @@ python server.py
[![Stargazers over time](https://starchart.cc/OleehyO/TexTeller.svg?variant=adaptive)](https://starchart.cc/OleehyO/TexTeller)
## 💖 感谢
Thanks to [LaTeX-OCR](https://github.com/lukas-blecher/LaTeX-OCR) which has brought me a lot of inspiration, and [im2latex-100K](https://zenodo.org/records/56198#.V2px0jXT6eA) which enriches our dataset.
## 👥 贡献者
<a href="https://github.com/OleehyO/TexTeller/graphs/contributors">

Binary file not shown.

Before

Width:  |  Height:  |  Size: 3.4 MiB

After

Width:  |  Height:  |  Size: 3.4 MiB

BIN
assets/scss.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 137 KiB