Files
TexTeller/README.md
三洋三洋 74341c7e8a update
2024-03-19 14:43:03 +00:00

7.6 KiB
Raw Blame History

𝚃𝚎𝚡𝚃𝚎𝚕𝚕𝚎𝚛

English | 中文

TexTeller_demo

TexTeller is an end-to-end formula recognition model based on ViT, capable of converting images into corresponding LaTeX formulas.

TexTeller was trained with 550K7.5M image-formula pairs (dataset available here), compared to LaTeX-OCR which used a 100K dataset, TexTeller has stronger generalization abilities and higher accuracy, covering most use cases (except for scanned images and handwritten formulas).

We will soon release a TexTeller checkpoint trained on a 7.5M dataset

🔄 Change Log

  • 📮[2024-03-24] TexTeller 2.0 released! The training data for TexTeller 2.0 has been increased to 7.5M (about 15 times more than TexTeller 1.0 and also improved in data quality). The trained TexTeller 2.0 demonstrated superior performance in the test set, especially in recognizing rare symbols, complex multi-line formulas, and matrices.

🔑 Prerequisites

python=3.10

pytorch

Note: Only CUDA versions >= 12.0 have been fully tested, so it is recommended to use CUDA version >= 12.0

🖼 About Rendering LaTeX as Images

  • Install XeLaTex and ensure xelatex can be called directly from the command line.

  • To ensure correct rendering of the predicted formulas, include the following packages in your .tex file:

    \usepackage{multirow,multicol,amsmath,amsfonts,amssymb,mathtools,bm,mathrsfs,wasysym,amsbsy,upgreek,mathalfa,stmaryrd,mathrsfs,dsfont,amsthm,amsmath,multirow}
    

🚀 Getting Started

  1. Clone the repository:

    git clone https://github.com/OleehyO/TexTeller
    
  2. After installing pytorch, install the project's dependencies:

    pip install -r requirements.txt
    
  3. Enter the TexTeller/src directory and run the following command in the terminal to start inference:

    python inference.py -img "/path/to/image.{jpg,png}" 
    # use -cuda option to enable GPU inference
    #+e.g. python inference.py -img "./img.jpg" -cuda
    

    The first time you run it, the required checkpoints will be downloaded from Hugging Face

🌐 Web Demo

To start the web demo, you need to first enter the TexTeller/src directory, then run the following command

./start_web.sh

Then, enter http://localhost:8501 in your browser to see the web demo

You can change the default configuration of start_web.sh, for example, to use GPU for inference (e.g. USE_CUDA=True) or to increase the number of beams (e.g. NUM_BEAM=3) to achieve higher accuracy

NOTE: If you want to directly render the prediction results as images on the web (for example, to check if the prediction is correct), you need to ensure xelatex is correctly installed

📡 API Usage

We use ray serve to provide an API interface for TexTeller, allowing you to integrate TexTeller into your own projects. To start the server, you first need to enter the TexTeller/src directory and then run the following command:

python server.py  # default settings

You can pass the following arguments to server.py to change the server's inference settings (e.g. python server.py --use_gpu to enable GPU inference):

Parameter Description
-ckpt The path to the weights file, default is TexTeller's pretrained weights.
-tknz The path to the tokenizer, default is TexTeller's tokenizer.
-port The server's service port, default is 8000.
--use_gpu Whether to use GPU for inference, default is CPU.
--num_beams The number of beams for beam search, default is 1.
--num_replicas The number of service replicas to run on the server, default is 1 replica. You can use more replicas to achieve greater throughput.
--ncpu_per_replica The number of CPU cores used per service replica, default is 1.
--ngpu_per_replica The number of GPUs used per service replica, default is 1. You can set this value between 0 and 1 to run multiple service replicas on one GPU to share the GPU, thereby improving GPU utilization. (Note, if --num_replicas is 2, --ngpu_per_replica is 0.7, then 2 GPUs must be available)

A client demo can be found at TexTeller/client/demo.py, you can refer to demo.py to send requests to the server

🏋️‍♂️ Training

Dataset

We provide an example dataset in the TexTeller/src/models/ocr_model/train/dataset directory, you can place your own images in the images directory and annotate each image with its corresponding formula in formulas.jsonl.

After preparing your dataset, you need to change the DIR_URL variable to your own dataset's path in .../dataset/loader.py

Retraining the Tokenizer

If you are using a different dataset, you might need to retrain the tokenizer to obtain a different dictionary. After configuring your dataset, you can train your own tokenizer with the following command:

  1. In TexTeller/src/models/tokenizer/train.py, change new_tokenizer.save_pretrained('./your_dir_name') to your custom output directory

    If you want to use a different dictionary size (default is 10k tokens), you need to change the VOCAB_SIZE variable in TexTeller/src/models/globals.py

  2. In the TexTeller/src directory, run the following command:

    python -m models.tokenizer.train
    

Training the Model

To train the model, you need to run the following command in the TexTeller/src directory:

python -m models.ocr_model.train.train

You can set your own tokenizer and checkpoint paths in TexTeller/src/models/ocr_model/train/train.py (refer to train.py for more information). If you are using the same architecture and dictionary as TexTeller, you can also fine-tune TexTeller's default weights with your own dataset.

In TexTeller/src/globals.py and TexTeller/src/models/ocr_model/train/train_args.py, you can change the model's architecture and training hyperparameters.

Our training scripts use the Hugging Face Transformers library, so you can refer to their documentation for more details and configurations on training parameters.

🚧 Limitations

  • Some complex multi-line scenarios are not well handled (e.g., long formulas mixed with matrices)

  • Does not support scanned images and PDF document recognition

  • Does not support handwritten formulas

📅 Plans

  • Train the model with a larger dataset (7.5M samples, coming soon)

  • Recognition of scanned images

  • PDF document recognition + Support for English and Chinese scenarios

  • Inference acceleration

  • ...

💖 Acknowledgments

Thanks to LaTeX-OCR which has brought me a lot of inspiration, and im2latex-100K which enriches our dataset.

Stargazers over time

Stargazers over time