update README

This commit is contained in:
三洋三洋
2024-04-12 06:13:58 +00:00
parent 9e8b15ef3a
commit 78d29d49ef
4 changed files with 107 additions and 82 deletions

View File

@@ -27,7 +27,7 @@ TexTeller was trained with ~~550K~~7.5M image-formula pairs (dataset available [
* 📮[2024-03-25] TexTeller 2.0 released! The training data for TexTeller 2.0 has been increased to 7.5M (about **15 times more** than TexTeller 1.0 and also improved in data quality). The trained TexTeller 2.0 demonstrated **superior performance** in the test set, especially in recognizing rare symbols, complex multi-line formulas, and matrices.
> [There](./assets/test.pdf) are more test images here and a horizontal comparison of recognition models from different companies.
* 📮[2024-04-11] Added whole image inference capability, just need to additionally install the onnxruntime library to get the new feature! We manually annotated formulas in 3,415 Chinese textbook images and used 8,272 formula images from the IBEM English paper detection dataset. We trained a formula object detection model based on the RT-DETR-R50 architecture and exported the trained model to the ONNX format. This allows inputting an image and recognizing all formulas in the image in one go.
* 📮[2024-04-12] Trained a **formula detection model**, thereby enhancing the capability to detect and recognize formulas in entire documents (whole-image inference)!
## 🔑 Prerequisites
@@ -82,21 +82,39 @@ Enter `http://localhost:8501` in a browser to view the web demo.
> [!NOTE]
> If you are Windows user, please run the `start_web.bat` file instead.
## Inference on Whole Images
### Download Weights
The ONNX model trained on the 8,272 IBEM dataset (https://zenodo.org/records/4757865) of English papers:
https://huggingface.co/TonyLee1256/texteller_det/resolve/main/rtdetr_r50vd_6x_coco_trained_on_IBEM_en_papers.onnx?download=true
## 🧠 Full Image Inference
The ONNX model trained on 2,560 Chinese textbook images (100+ layouts):
https://huggingface.co/TonyLee1256/texteller_det/blob/main/rtdetr_r50vd_6x_coco_trained_on_cn_textbook.onnx
TexTeller also supports **formula detection and recognition** on full images, allowing for the detection of formulas throughout the image, followed by batch recognition of the formulas.
### Download Weights
English documentation formula detection [[link](https://huggingface.co/TonyLee1256/texteller_det/resolve/main/rtdetr_r50vd_6x_coco_trained_on_IBEM_en_papers.onnx?download=true)]: Trained on 8272 images from the [IBEM dataset](https://zenodo.org/records/4757865).
Chinese documentation formula detection [[link](https://huggingface.co/TonyLee1256/texteller_det/blob/main/rtdetr_r50vd_6x_coco_trained_on_cn_textbook.onnx)]: Trained on 2560 Chinese textbook images (100+ layouts).
### Formula Detection
Run infer_det.py in the TexTeller/src directory.
This will detect all formulas in the input image, draw the detection results on the entire image and save it, and crop and save each detected formula as a separate image.
Run the following command in the `TexTeller/src` directory:
```bash
python infer_det.py
```
Detects all formulas in the full image, and the results are saved in `TexTeller/src/subimages`.
<div align="center">
<img src="./assets/det_rec.png" width=400>
</div>
### Batch Formula Recognition
Run rec_infer_from_crop_imgs.py.
Based on the formula detection results from the previous step, this script will perform batch recognition on all cropped formula images and save the recognition results as text files.
After **formula detection**, run the following command in the `TexTeller/src` directory:
```shell
rec_infer_from_crop_imgs.py
```
This will use the results of the previous formula detection to perform batch recognition on all cropped formulas, saving the recognition results as txt files in `TexTeller/src/results`.
## 📡 API Usage

View File

@@ -25,10 +25,10 @@ TexTeller用了~~550K~~7.5M的图片-公式对进行训练(数据集可以在[
## 🔄 变更信息
* 📮[2024-03-25] TexTeller2.0发布TexTeller2.0的训练数据增大到了7.5M(相较于TexTeller1.0**增加了~15倍**并且数据质量也有所改善)。训练后的TexTeller2.0在测试集中展现出了**更加优越的性能**,尤其在生僻符号、复杂多行、矩阵的识别场景中。
> 在[这里](./test.pdf)有更多的测试图片以及各家识别模型的横向对比。
* 📮[2024-04-11] 增加了整图推理的功能只需额外安装onnxruntime库即可获取新功能我们自行标注了3415张中文教材图片中的公式并使用了8272张来自于IBEM英文论文公式检测数据集中的公式基于RT-DETR-R50模型进行公式目标检测的训练并将训练好的模型导出为了onnx格式。以方便输入图片一次性对图片中的所有公式进行识别。
>
* 📮[2024-04-12] 训练了**公式检测模型**,从而增加了对整个文档进行公式检测+公式识别(整图推理)的功能!
## 🔑 前置条件
@@ -46,15 +46,12 @@ python=3.10
```bash
git clone https://github.com/OleehyO/TexTeller
```
2. [安装pytorch](https://pytorch.org/get-started/locally/#start-locally)
3. 安装本项目的依赖包:
```bash
pip install -r requirements.txt
```
4. 进入 `TexTeller/src`目录,在终端运行以下命令进行推理:
```bash
@@ -75,13 +72,11 @@ python=3.10
```bash
pip install -U "huggingface_hub[cli]"
```
2. 在能连接Hugging Face的机器上下载模型权重:
```bash
huggingface-cli download OleehyO/TexTeller --include "*.json" "*.bin" "*.txt" --repo-type model --local-dir "your/dir/path"
```
3. 把包含权重的目录上传远端服务器,然后把 `TexTeller/src/models/ocr_model/model/TexTeller.py`中的 `REPO_NAME = 'OleehyO/TexTeller'`修改为 `REPO_NAME = 'your/dir/path'`
如果你还想在训练模型时开启evaluate你需要提前下载metric脚本并上传远端服务器
@@ -91,7 +86,6 @@ python=3.10
```bash
huggingface-cli download evaluate-metric/google_bleu --repo-type space --local-dir "your/dir/path"
```
2. 把这个目录上传远端服务器,并在 `TexTeller/src/models/ocr_model/utils/metrics.py`中把 `evaluate.load('google_bleu')`改为 `evaluate.load('your/dir/path/google_bleu.py')`
## 🌐 网页演示
@@ -110,22 +104,39 @@ python=3.10
> [!NOTE]
> 对于Windows用户, 请运行 `start_web.bat`文件.
## 整图推理
## 🧠 整图推理
TexTeller还支持对整张图片进行**公式检测+公式识别**,从而对整图公式进行检测,然后进行批公式识别。
### 下载权重
在8272张IBEM数据集https://zenodo.org/records/4757865上训练并导出的onnx模型
https://huggingface.co/TonyLee1256/texteller_det/resolve/main/rtdetr_r50vd_6x_coco_trained_on_IBEM_en_papers.onnx?download=true
在2560张中文教材数据100+版式上训练并导出的onnx模型
https://huggingface.co/TonyLee1256/texteller_det/blob/main/rtdetr_r50vd_6x_coco_trained_on_cn_textbook.onnx
英文文档公式检测 [[link](https://huggingface.co/TonyLee1256/texteller_det/resolve/main/rtdetr_r50vd_6x_coco_trained_on_IBEM_en_papers.onnx?download=true)]在8272张[IBEM数据集](https://zenodo.org/records/4757865)上训练得到
中文文档公式检测 [[link](https://huggingface.co/TonyLee1256/texteller_det/blob/main/rtdetr_r50vd_6x_coco_trained_on_cn_textbook.onnx)]在2560张中文教材数据(100+版式)上训练得到
### 公式检测
cd TexTeller/src
infer_det.py
运行后,对整张图中的所有公式进行检测,绘制整图检测结果并保存,并将每一个检测出的目标单独裁剪并保存下来。
`TexTeller/src`目录下运行以下命令
```bash
python infer_det.py
```
对整张图中的所有公式进行检测,结果保存在 `TexTeller/src/subimages`
<div align="center">
<img src="det_rec.png" width=400>
</div>
### 公式批识别
在进行**公式检测后** `TexTeller/src`目录下运行以下命令
```shell
rec_infer_from_crop_imgs.py
基于上一步公式检测的结果对裁剪出的所有公式进行批量识别将识别结果保存为txt文件。
```
会基于上一步公式检测的结果,对裁剪出的所有公式进行批量识别,将识别结果在 `TexTeller/src/results`中保存为txt文件。
## 📡 API调用
@@ -138,7 +149,7 @@ python server.py # default settings
你可以给 `server.py`传递以下参数来改变server的推理设置(e.g. `python server.py --use_gpu` 来启动GPU推理):
| 参数 | 描述 |
| --- | --- |
| ---------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `-ckpt` | 权重文件的路径,*默认为TexTeller的预训练权重*。 |
| `-tknz` | 分词器的路径,*默认为TexTeller的分词器*。 |
| `-port` | 服务器的服务端口,*默认是8000*。 |
@@ -164,8 +175,9 @@ python server.py # default settings
如果你使用了不一样的数据集你可能需要重新训练tokenizer来得到一个不一样的字典。配置好数据集后可以通过以下命令来训练自己的tokenizer
1. 在 `TexTeller/src/models/tokenizer/train.py`中,修改 `new_tokenizer.save_pretrained('./your_dir_name')`为你自定义的输出目录
> 注意:如果要用一个不一样大小的字典(默认1W个token),你需要在 `TexTeller/src/models/globals.py`中修改`VOCAB_SIZE`变量
> 注意:如果要用一个不一样大小的字典(默认1W个token),你需要在 `TexTeller/src/models/globals.py`中修改 `VOCAB_SIZE`变量
>
2. **在 `TexTeller/src` 目录下**运行以下命令:
```bash
@@ -190,19 +202,14 @@ python -m models.ocr_model.train.train
## 🚧 不足
* 不支持扫描图片以及PDF文档识别
* 不支持手写体公式
## 📅 计划
- [x] ~~使用更大的数据集来训练模型(7.5M样本,即将发布)~~
- [X] ~~使用更大的数据集来训练模型(7.5M样本,即将发布)~~
- [ ] 扫描图片识别
- [ ] PDF文档识别 + 中英文场景支持
- [ ] 推理加速
- [ ] ...
## 💖 感谢

BIN
assets/det_rec.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 919 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 484 KiB