Update README

This commit is contained in:
三洋三洋
2024-02-12 16:27:58 +00:00
parent 3683623925
commit 72a60f8611
2 changed files with 4 additions and 4 deletions

View File

@@ -96,13 +96,13 @@ After the dataset is ready, you should **change the `DIR_URL` variable** in `...
If you are using a different dataset, you may need to retrain the tokenizer to match your specific vocabulary. After setting up the dataset, you can do this by:
1. Change the line `new_tokenizer.save_pretrained('./your_dir_name')` in `TexTeller/src/models/ocr_model/tokenizer/train.py` to your desired output directory name.
1. Change the line `new_tokenizer.save_pretrained('./your_dir_name')` in `TexTeller/src/models/tokenizer/train.py` to your desired output directory name.
> To use a different vocabulary size, you should modify the `VOCAB_SIZE` parameter in the `TexTeller/src/models/globals.py`.
2. Running the following command **under `TexTeller/src` directory**:
```bash
python -m models.ocr_model.tokenizer.train
python -m models.tokenizer.train
```
### Train the model

View File

@@ -126,13 +126,13 @@ python serve.py # default settings
如果你使用了不一样的数据集你可能需要重新训练tokenizer来得到一个不一样的字典。配置好数据集后可以通过以下命令来训练自己的tokenizer
1. 在`TexTeller/src/models/ocr_model/tokenizer/train.py`中,修改`new_tokenizer.save_pretrained('./your_dir_name')`为你自定义的输出目录
1. 在`TexTeller/src/models/tokenizer/train.py`中,修改`new_tokenizer.save_pretrained('./your_dir_name')`为你自定义的输出目录
> 如果要用一个不一样大小的字典(默认1W个token),你需要在 `TexTeller/src/models/globals.py`中修改`VOCAB_SIZE`变量
2. **在 `TexTeller/src` 目录下**运行以下命令:
```bash
python -m models.ocr_model.tokenizer.train
python -m models.tokenizer.train
```
### Train the model