From 72a60f861114cf90b489804042185367bc122572 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E4=B8=89=E6=B4=8B=E4=B8=89=E6=B4=8B?= <1258009915@qq.com> Date: Mon, 12 Feb 2024 16:27:58 +0000 Subject: [PATCH] Update README --- README.md | 4 ++-- assets/README_zh.md | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index 8e81cfc..b1bfc27 100644 --- a/README.md +++ b/README.md @@ -96,13 +96,13 @@ After the dataset is ready, you should **change the `DIR_URL` variable** in `... If you are using a different dataset, you may need to retrain the tokenizer to match your specific vocabulary. After setting up the dataset, you can do this by: -1. Change the line `new_tokenizer.save_pretrained('./your_dir_name')` in `TexTeller/src/models/ocr_model/tokenizer/train.py` to your desired output directory name. +1. Change the line `new_tokenizer.save_pretrained('./your_dir_name')` in `TexTeller/src/models/tokenizer/train.py` to your desired output directory name. > To use a different vocabulary size, you should modify the `VOCAB_SIZE` parameter in the `TexTeller/src/models/globals.py`. 2. Running the following command **under `TexTeller/src` directory**: ```bash - python -m models.ocr_model.tokenizer.train + python -m models.tokenizer.train ``` ### Train the model diff --git a/assets/README_zh.md b/assets/README_zh.md index 1a68b64..7768d9d 100644 --- a/assets/README_zh.md +++ b/assets/README_zh.md @@ -126,13 +126,13 @@ python serve.py # default settings 如果你使用了不一样的数据集,你可能需要重新训练tokenizer来得到一个不一样的字典。配置好数据集后,可以通过以下命令来训练自己的tokenizer: -1. 在`TexTeller/src/models/ocr_model/tokenizer/train.py`中,修改`new_tokenizer.save_pretrained('./your_dir_name')`为你自定义的输出目录 +1. 在`TexTeller/src/models/tokenizer/train.py`中,修改`new_tokenizer.save_pretrained('./your_dir_name')`为你自定义的输出目录 > 如果要用一个不一样大小的字典(默认1W个token),你需要在 `TexTeller/src/models/globals.py`中修改`VOCAB_SIZE`变量 2. **在 `TexTeller/src` 目录下**运行以下命令: ```bash - python -m models.ocr_model.tokenizer.train + python -m models.tokenizer.train ``` ### Train the model