update README
This commit is contained in:
40
README.md
40
README.md
@@ -27,7 +27,7 @@ TexTeller was trained with ~~550K~~7.5M image-formula pairs (dataset available [
|
||||
* 📮[2024-03-25] TexTeller 2.0 released! The training data for TexTeller 2.0 has been increased to 7.5M (about **15 times more** than TexTeller 1.0 and also improved in data quality). The trained TexTeller 2.0 demonstrated **superior performance** in the test set, especially in recognizing rare symbols, complex multi-line formulas, and matrices.
|
||||
> [There](./assets/test.pdf) are more test images here and a horizontal comparison of recognition models from different companies.
|
||||
|
||||
* 📮[2024-04-11] Added whole image inference capability, just need to additionally install the onnxruntime library to get the new feature! We manually annotated formulas in 3,415 Chinese textbook images and used 8,272 formula images from the IBEM English paper detection dataset. We trained a formula object detection model based on the RT-DETR-R50 architecture and exported the trained model to the ONNX format. This allows inputting an image and recognizing all formulas in the image in one go.
|
||||
* 📮[2024-04-12] Trained a **formula detection model**, thereby enhancing the capability to detect and recognize formulas in entire documents (whole-image inference)!
|
||||
|
||||
|
||||
## 🔑 Prerequisites
|
||||
@@ -82,21 +82,39 @@ Enter `http://localhost:8501` in a browser to view the web demo.
|
||||
> [!NOTE]
|
||||
> If you are Windows user, please run the `start_web.bat` file instead.
|
||||
|
||||
## Inference on Whole Images
|
||||
### Download Weights
|
||||
The ONNX model trained on the 8,272 IBEM dataset (https://zenodo.org/records/4757865) of English papers:
|
||||
https://huggingface.co/TonyLee1256/texteller_det/resolve/main/rtdetr_r50vd_6x_coco_trained_on_IBEM_en_papers.onnx?download=true
|
||||
## 🧠 Full Image Inference
|
||||
|
||||
The ONNX model trained on 2,560 Chinese textbook images (100+ layouts):
|
||||
https://huggingface.co/TonyLee1256/texteller_det/blob/main/rtdetr_r50vd_6x_coco_trained_on_cn_textbook.onnx
|
||||
TexTeller also supports **formula detection and recognition** on full images, allowing for the detection of formulas throughout the image, followed by batch recognition of the formulas.
|
||||
|
||||
### Download Weights
|
||||
|
||||
English documentation formula detection [[link](https://huggingface.co/TonyLee1256/texteller_det/resolve/main/rtdetr_r50vd_6x_coco_trained_on_IBEM_en_papers.onnx?download=true)]: Trained on 8272 images from the [IBEM dataset](https://zenodo.org/records/4757865).
|
||||
|
||||
Chinese documentation formula detection [[link](https://huggingface.co/TonyLee1256/texteller_det/blob/main/rtdetr_r50vd_6x_coco_trained_on_cn_textbook.onnx)]: Trained on 2560 Chinese textbook images (100+ layouts).
|
||||
|
||||
### Formula Detection
|
||||
Run infer_det.py in the TexTeller/src directory.
|
||||
This will detect all formulas in the input image, draw the detection results on the entire image and save it, and crop and save each detected formula as a separate image.
|
||||
|
||||
Run the following command in the `TexTeller/src` directory:
|
||||
|
||||
```bash
|
||||
python infer_det.py
|
||||
```
|
||||
|
||||
Detects all formulas in the full image, and the results are saved in `TexTeller/src/subimages`.
|
||||
|
||||
<div align="center">
|
||||
<img src="./assets/det_rec.png" width=400>
|
||||
</div>
|
||||
|
||||
### Batch Formula Recognition
|
||||
Run rec_infer_from_crop_imgs.py.
|
||||
Based on the formula detection results from the previous step, this script will perform batch recognition on all cropped formula images and save the recognition results as text files.
|
||||
|
||||
After **formula detection**, run the following command in the `TexTeller/src` directory:
|
||||
|
||||
```shell
|
||||
rec_infer_from_crop_imgs.py
|
||||
```
|
||||
|
||||
This will use the results of the previous formula detection to perform batch recognition on all cropped formulas, saving the recognition results as txt files in `TexTeller/src/results`.
|
||||
|
||||
## 📡 API Usage
|
||||
|
||||
|
||||
@@ -25,10 +25,10 @@ TexTeller用了~~550K~~7.5M的图片-公式对进行训练(数据集可以在[
|
||||
## 🔄 变更信息
|
||||
|
||||
* 📮[2024-03-25] TexTeller2.0发布!TexTeller2.0的训练数据增大到了7.5M(相较于TexTeller1.0**增加了~15倍**并且数据质量也有所改善)。训练后的TexTeller2.0在测试集中展现出了**更加优越的性能**,尤其在生僻符号、复杂多行、矩阵的识别场景中。
|
||||
|
||||
> 在[这里](./test.pdf)有更多的测试图片以及各家识别模型的横向对比。
|
||||
|
||||
* 📮[2024-04-11] 增加了整图推理的功能,只需额外安装onnxruntime库即可获取新功能!我们自行标注了3415张中文教材图片中的公式,并使用了8272张来自于IBEM英文论文公式检测数据集中的公式,基于RT-DETR-R50模型进行了公式目标检测的训练,并将训练好的模型导出为了onnx格式。以方便输入图片,一次性对图片中的所有公式进行识别。
|
||||
|
||||
>
|
||||
* 📮[2024-04-12] 训练了**公式检测模型**,从而增加了对整个文档进行公式检测+公式识别(整图推理)的功能!
|
||||
|
||||
## 🔑 前置条件
|
||||
|
||||
@@ -46,15 +46,12 @@ python=3.10
|
||||
```bash
|
||||
git clone https://github.com/OleehyO/TexTeller
|
||||
```
|
||||
|
||||
2. [安装pytorch](https://pytorch.org/get-started/locally/#start-locally)
|
||||
|
||||
3. 安装本项目的依赖包:
|
||||
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
4. 进入 `TexTeller/src`目录,在终端运行以下命令进行推理:
|
||||
|
||||
```bash
|
||||
@@ -75,13 +72,11 @@ python=3.10
|
||||
```bash
|
||||
pip install -U "huggingface_hub[cli]"
|
||||
```
|
||||
|
||||
2. 在能连接Hugging Face的机器上下载模型权重:
|
||||
|
||||
```bash
|
||||
huggingface-cli download OleehyO/TexTeller --include "*.json" "*.bin" "*.txt" --repo-type model --local-dir "your/dir/path"
|
||||
```
|
||||
|
||||
3. 把包含权重的目录上传远端服务器,然后把 `TexTeller/src/models/ocr_model/model/TexTeller.py`中的 `REPO_NAME = 'OleehyO/TexTeller'`修改为 `REPO_NAME = 'your/dir/path'`
|
||||
|
||||
如果你还想在训练模型时开启evaluate,你需要提前下载metric脚本并上传远端服务器:
|
||||
@@ -91,7 +86,6 @@ python=3.10
|
||||
```bash
|
||||
huggingface-cli download evaluate-metric/google_bleu --repo-type space --local-dir "your/dir/path"
|
||||
```
|
||||
|
||||
2. 把这个目录上传远端服务器,并在 `TexTeller/src/models/ocr_model/utils/metrics.py`中把 `evaluate.load('google_bleu')`改为 `evaluate.load('your/dir/path/google_bleu.py')`
|
||||
|
||||
## 🌐 网页演示
|
||||
@@ -110,22 +104,39 @@ python=3.10
|
||||
> [!NOTE]
|
||||
> 对于Windows用户, 请运行 `start_web.bat`文件.
|
||||
|
||||
## 整图推理
|
||||
## 🧠 整图推理
|
||||
|
||||
TexTeller还支持对整张图片进行**公式检测+公式识别**,从而对整图公式进行检测,然后进行批公式识别。
|
||||
|
||||
### 下载权重
|
||||
在8272张IBEM数据集(https://zenodo.org/records/4757865)上训练,并导出的onnx模型:
|
||||
https://huggingface.co/TonyLee1256/texteller_det/resolve/main/rtdetr_r50vd_6x_coco_trained_on_IBEM_en_papers.onnx?download=true
|
||||
在2560张中文教材数据(100+版式)上训练,并导出的onnx模型:
|
||||
https://huggingface.co/TonyLee1256/texteller_det/blob/main/rtdetr_r50vd_6x_coco_trained_on_cn_textbook.onnx
|
||||
|
||||
英文文档公式检测 [[link](https://huggingface.co/TonyLee1256/texteller_det/resolve/main/rtdetr_r50vd_6x_coco_trained_on_IBEM_en_papers.onnx?download=true)]:在8272张[IBEM数据集](https://zenodo.org/records/4757865)上训练得到
|
||||
|
||||
中文文档公式检测 [[link](https://huggingface.co/TonyLee1256/texteller_det/blob/main/rtdetr_r50vd_6x_coco_trained_on_cn_textbook.onnx)]:在2560张中文教材数据(100+版式)上训练得到
|
||||
|
||||
### 公式检测
|
||||
cd TexTeller/src
|
||||
infer_det.py
|
||||
运行后,对整张图中的所有公式进行检测,绘制整图检测结果并保存,并将每一个检测出的目标单独裁剪并保存下来。
|
||||
|
||||
`TexTeller/src`目录下运行以下命令
|
||||
|
||||
```bash
|
||||
python infer_det.py
|
||||
```
|
||||
|
||||
对整张图中的所有公式进行检测,结果保存在 `TexTeller/src/subimages`
|
||||
|
||||
<div align="center">
|
||||
<img src="det_rec.png" width=400>
|
||||
</div>
|
||||
|
||||
### 公式批识别
|
||||
|
||||
在进行**公式检测后**, `TexTeller/src`目录下运行以下命令
|
||||
|
||||
```shell
|
||||
rec_infer_from_crop_imgs.py
|
||||
基于上一步公式检测的结果,对裁剪出的所有公式进行批量识别,将识别结果保存为txt文件。
|
||||
```
|
||||
|
||||
会基于上一步公式检测的结果,对裁剪出的所有公式进行批量识别,将识别结果在 `TexTeller/src/results`中保存为txt文件。
|
||||
|
||||
## 📡 API调用
|
||||
|
||||
@@ -138,7 +149,7 @@ python server.py # default settings
|
||||
你可以给 `server.py`传递以下参数来改变server的推理设置(e.g. `python server.py --use_gpu` 来启动GPU推理):
|
||||
|
||||
| 参数 | 描述 |
|
||||
| --- | --- |
|
||||
| ---------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `-ckpt` | 权重文件的路径,*默认为TexTeller的预训练权重*。 |
|
||||
| `-tknz` | 分词器的路径,*默认为TexTeller的分词器*。 |
|
||||
| `-port` | 服务器的服务端口,*默认是8000*。 |
|
||||
@@ -164,8 +175,9 @@ python server.py # default settings
|
||||
如果你使用了不一样的数据集,你可能需要重新训练tokenizer来得到一个不一样的字典。配置好数据集后,可以通过以下命令来训练自己的tokenizer:
|
||||
|
||||
1. 在 `TexTeller/src/models/tokenizer/train.py`中,修改 `new_tokenizer.save_pretrained('./your_dir_name')`为你自定义的输出目录
|
||||
> 注意:如果要用一个不一样大小的字典(默认1W个token),你需要在 `TexTeller/src/models/globals.py`中修改`VOCAB_SIZE`变量
|
||||
|
||||
> 注意:如果要用一个不一样大小的字典(默认1W个token),你需要在 `TexTeller/src/models/globals.py`中修改 `VOCAB_SIZE`变量
|
||||
>
|
||||
2. **在 `TexTeller/src` 目录下**运行以下命令:
|
||||
|
||||
```bash
|
||||
@@ -190,19 +202,14 @@ python -m models.ocr_model.train.train
|
||||
## 🚧 不足
|
||||
|
||||
* 不支持扫描图片以及PDF文档识别
|
||||
|
||||
* 不支持手写体公式
|
||||
|
||||
## 📅 计划
|
||||
|
||||
- [x] ~~使用更大的数据集来训练模型(7.5M样本,即将发布)~~
|
||||
|
||||
- [X] ~~使用更大的数据集来训练模型(7.5M样本,即将发布)~~
|
||||
- [ ] 扫描图片识别
|
||||
|
||||
- [ ] PDF文档识别 + 中英文场景支持
|
||||
|
||||
- [ ] 推理加速
|
||||
|
||||
- [ ] ...
|
||||
|
||||
## 💖 感谢
|
||||
|
||||
BIN
assets/det_rec.png
Normal file
BIN
assets/det_rec.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 919 KiB |
BIN
assets/image/README_zh/1712901497354.png
Normal file
BIN
assets/image/README_zh/1712901497354.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 484 KiB |
Reference in New Issue
Block a user