LaTeXTrOCR is a cutting-edge Transformer-based OCR (Optical Character Recognition) model designed to convert images of handwritten and printed mathematical equations directly into LaTeX code. By leveraging the power of deep learning and advanced tokenization techniques, LaTeXTrOCR aims to streamline the process of digitizing and editing mathematical content, making it an invaluable tool for researchers, educators, and students.
- Transformer-Based Architecture: Utilizes state-of-the-art Transformer models for accurate and efficient OCR.
- Custom Tokenizer: Specialized tokenizer tailored for LaTeX syntax and mathematical symbols.
- ArXiv Scraper: Automated tools to scrape and preprocess LaTeX documents from arXiv for training.
- Flexible Dataset Handling: Supports various image formats and preprocesses them for optimal model performance.
- Interactive Training Loop: Incorporates robust training scripts with logging and checkpointing.
- Comprehensive Evaluation: Tools to assess model performance with detailed metrics and visualizations.
- Easy Integration: Designed to be easily integrated into larger projects or used as a standalone tool.
Check out our demo video showcasing the model's capabilities in real-time.
- Python 3.8+
- PyTorch 1.8+
- CUDA 10.2+ (for GPU support)
<|code|>bash git clone https://github.com/YourUsername/LaTeXTrOCR.git cd LaTeXTrOCR <|code|>
It's recommended to use a virtual environment to manage dependencies.
<|code|>bash python -m venv env source env/bin/activate # On Windows: env\Scripts\activate <|code|>
<|code|>bash pip install -r requirements.txt <|code|>
- Tesseract OCR: Install Tesseract OCR for preprocessing images.
- Ubuntu: <|code|>bash sudo apt-get update sudo apt-get install tesseract-ocr <|code|>
- macOS: <|code|>bash brew install tesseract <|code|>
- Windows: Download Installer
LaTeXTrOCR includes a scraper to download and preprocess LaTeX documents from arXiv.
<|code|>bash python dataset/arxiv_scraper.py <|code|>
This will:
- Download: Fetch
.tar.gz
archives of papers based on predefined queries. - Extract: Unpack and extract
.tex
files from the archives. - Process: Clean and prepare LaTeX content for training.
Train the custom tokenizer to handle LaTeX syntax effectively.
<|code|>bash python tokenizer.py --text data/raw_la.tex --vocab_size 1000 <|code|>
This will generate a tokenizer.json
file used during training and inference.
Start training the Transformer-based OCR model.
<|code|>bash python models/trOCR.py <|code|>
Training Parameters:
- Adjust hyperparameters in
config/config.yaml
as needed. - Utilize GPU acceleration for faster training.
Convert an image of a handwritten equation to LaTeX.
<|code|>bash python inference.py --image path/to/equation.png --model weights/ocr_model.pth <|code|>
Output: <|code|>latex \frac{d}{dx}e^{x} = e^{x} <|code|>
Assess model performance with evaluation scripts.
<|code|>bash python evaluate.py --model weights/ocr_model.pth --dataset data/test_images/ <|code|>
<|code|> LaTeXTrOCR/ ├── README.md ├── LICENSE ├── requirements.txt ├── setup.py ├── .gitignore ├── data/ │ ├── external/ │ ├── arxiv_papers/ │ └── data.txt ├── notebooks/ │ └── analysis.ipynb ├── models/ │ ├── trOCR.py │ ├── encoder.py │ └── infer.py ├── dataset/ │ ├── transforms.py │ ├── dataset.py │ ├── arxiv_scraper.py │ └── extract_latex.py ├── tokenizer.py ├── utils/ │ └── utils.py ├── docker/ │ └── Dockerfile └── config/ └── config.yaml <|code|>
Contributions are welcome and greatly appreciated! To contribute to LaTeXTrOCR, please follow these steps:
- Fork the repository.
- Create a new branch: <|code|>bash git checkout -b feature/YourFeature <|code|>
- Make your changes and commit them: <|code|>bash git commit -m "Add your feature" <|code|>
- Push to the branch: <|code|>bash git push origin feature/YourFeature <|code|>
- Create a Pull Request detailing your changes.
For major changes, please open an issue first to discuss what you would like to change.
This project is licensed under the MIT License - see the LICENSE file for details.
For any inquiries, suggestions, or feedback, feel free to reach out:
- Twitter: @Vover163
- Email: vovatara123@gmail.com
- GitHub Issues: Open an Issue
- arXiv: For providing access to a vast repository of academic papers.
- PyTorch: For the powerful deep learning framework.
- Tesseract OCR: For the open-source OCR engine.
- tiktoken: For efficient tokenization.
- Seaborn: For beautiful statistical data visualization.
Made with ❤️ by Vover