LaTeXTrOCR 📝➡️📄

LaTeXTrOCR is a cutting-edge Transformer-based OCR (Optical Character Recognition) model designed to convert images of handwritten and printed mathematical equations directly into LaTeX code. By leveraging the power of deep learning and advanced tokenization techniques, LaTeXTrOCR aims to streamline the process of digitizing and editing mathematical content, making it an invaluable tool for researchers, educators, and students.

Features

Transformer-Based Architecture: Utilizes state-of-the-art Transformer models for accurate and efficient OCR.
Custom Tokenizer: Specialized tokenizer tailored for LaTeX syntax and mathematical symbols.
ArXiv Scraper: Automated tools to scrape and preprocess LaTeX documents from arXiv for training.
Flexible Dataset Handling: Supports various image formats and preprocesses them for optimal model performance.
Interactive Training Loop: Incorporates robust training scripts with logging and checkpointing.
Comprehensive Evaluation: Tools to assess model performance with detailed metrics and visualizations.
Easy Integration: Designed to be easily integrated into larger projects or used as a standalone tool.

Demo

Check out our demo video showcasing the model's capabilities in real-time.

Installation

Prerequisites

Python 3.8+
PyTorch 1.8+
CUDA 10.2+ (for GPU support)

Clone the Repository

<|code|>bash git clone https://github.com/YourUsername/LaTeXTrOCR.git cd LaTeXTrOCR <|code|>

Create a Virtual Environment

It's recommended to use a virtual environment to manage dependencies.

<|code|>bash python -m venv env source env/bin/activate # On Windows: env\Scripts\activate <|code|>

Install Dependencies

<|code|>bash pip install -r requirements.txt <|code|>

Additional Requirements

Tesseract OCR: Install Tesseract OCR for preprocessing images.
- Ubuntu: <|code|>bash sudo apt-get update sudo apt-get install tesseract-ocr <|code|>
- macOS: <|code|>bash brew install tesseract <|code|>
- Windows: Download Installer

Usage

1. Preparing the Dataset

LaTeXTrOCR includes a scraper to download and preprocess LaTeX documents from arXiv.

<|code|>bash python dataset/arxiv_scraper.py <|code|>

This will:

Download: Fetch .tar.gz archives of papers based on predefined queries.
Extract: Unpack and extract .tex files from the archives.
Process: Clean and prepare LaTeX content for training.

2. Tokenizing LaTeX Content

Train the custom tokenizer to handle LaTeX syntax effectively.

<|code|>bash python tokenizer.py --text data/raw_la.tex --vocab_size 1000 <|code|>

This will generate a tokenizer.json file used during training and inference.

3. Training the Model

Start training the Transformer-based OCR model.

<|code|>bash python models/trOCR.py <|code|>

Training Parameters:

Adjust hyperparameters in config/config.yaml as needed.
Utilize GPU acceleration for faster training.

4. Running Inference

Convert an image of a handwritten equation to LaTeX.

<|code|>bash python inference.py --image path/to/equation.png --model weights/ocr_model.pth <|code|>

Output: <|code|>latex \frac{d}{dx}e^{x} = e^{x} <|code|>

5. Evaluating the Model

Assess model performance with evaluation scripts.

<|code|>bash python evaluate.py --model weights/ocr_model.pth --dataset data/test_images/ <|code|>

Project Structure

<|code|> LaTeXTrOCR/ ├── README.md ├── LICENSE ├── requirements.txt ├── setup.py ├── .gitignore ├── data/ │ ├── external/ │ ├── arxiv_papers/ │ └── data.txt ├── notebooks/ │ └── analysis.ipynb ├── models/ │ ├── trOCR.py │ ├── encoder.py │ └── infer.py ├── dataset/ │ ├── transforms.py │ ├── dataset.py │ ├── arxiv_scraper.py │ └── extract_latex.py ├── tokenizer.py ├── utils/ │ └── utils.py ├── docker/ │ └── Dockerfile └── config/ └── config.yaml <|code|>

Contributing

Contributions are welcome and greatly appreciated! To contribute to LaTeXTrOCR, please follow these steps:

Fork the repository.
Create a new branch: <|code|>bash git checkout -b feature/YourFeature <|code|>
Make your changes and commit them: <|code|>bash git commit -m "Add your feature" <|code|>
Push to the branch: <|code|>bash git push origin feature/YourFeature <|code|>
Create a Pull Request detailing your changes.

For major changes, please open an issue first to discuss what you would like to change.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

For any inquiries, suggestions, or feedback, feel free to reach out:

Twitter: @Vover163
Email: vovatara123@gmail.com
GitHub Issues: Open an Issue

Acknowledgements

arXiv: For providing access to a vast repository of academic papers.
PyTorch: For the powerful deep learning framework.
Tesseract OCR: For the open-source OCR engine.
tiktoken: For efficient tokenization.
Seaborn: For beautiful statistical data visualization.

Made with ❤️ by Vover

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LaTeXTrOCR 📝➡️📄

Table of Contents

Features

Demo

Installation

Prerequisites

Clone the Repository

Create a Virtual Environment

Install Dependencies

Additional Requirements

Usage

1. Preparing the Dataset

2. Tokenizing LaTeX Content

3. Training the Model

4. Running Inference

5. Evaluating the Model

Project Structure

Contributing

License

Contact

Acknowledgements

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
LaTeXTrOCR		LaTeXTrOCR
config		config
docker		docker
notebooks		notebooks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
TODO.md		TODO.md
requirements.txt		requirements.txt

License

T4ras123/DTrOCR

Folders and files

Latest commit

History

Repository files navigation

LaTeXTrOCR 📝➡️📄

Table of Contents

Features

Demo

Installation

Prerequisites

Clone the Repository

Create a Virtual Environment

Install Dependencies

Additional Requirements

Usage

1. Preparing the Dataset

2. Tokenizing LaTeX Content

3. Training the Model

4. Running Inference

5. Evaluating the Model

Project Structure

Contributing

License

Contact

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages