Welcome to the Single-Pass Training Method project! This project introduces an innovative approach to training transformer-based models by leveraging a single-pass dataset loading mechanism. By optimizing the training process, our method enhances efficiency, reduces computational overhead, and maintains high performance across various tasks.
Mathematical Derivation for Closed-Form Solution
- General Formulation In traditional gradient descent, the weight update for each batch 𝑖 can be expressed as:
Where:
𝑊𝑡 is the weight at step 𝑡. 𝜂 is the learning rate. 𝐿𝑖 is the loss function for batch 𝑖.
After one epoch, the cumulative weight update for all batches {𝑖=1,2,…,𝑁batches} can be represented as:
This summation accounts for the cumulative gradient descent updates over the entire epoch.
- Incorporating Batch-Wise Gradient Effects To derive a closed-form solution, we need to simulate the effect of multiple gradient updates for each batch in a single step. This involves computing an effective gradient that encapsulates the contributions of all batches:
This equation is equivalent to computing a "global gradient" that captures the cumulative effect of all individual batch gradients.
- Closed-Form Solution Using Kernel Matrix To extend this concept to our kernel-based model, we use the kernel matrix 𝐾 that captures the relationships between data points. The optimal weight vector in the kernel space can be calculated using:
Where:
𝐾 is the kernel matrix representing the inner product of all training embeddings. 𝜆𝐼 is a regularization term. 𝑌 represents the target values.
- Single-Pass Training: Streamlined dataset loading for efficient training.
- Transformer Integration: Utilizes state-of-the-art transformer architectures.
- Performance Optimization: Reduced memory usage and faster convergence.
- Modular Design: Easily extendable components for customization.
-
Clone the Repository
git clone https://github.com/peytontolbert/SinglePassTransformer.git cd SinglePassTransformer
-
Create a Virtual Environment
python3 -m venv env source env/bin/activate
To train the model using the single-pass training method:
python train.py --method single-pass --epochs 50 --batch_size 32 --learning_rate 1e-3
For benchmarking against traditional training methods, use the compare.py
script:
python compare.py
This script runs both traditional and single-pass training methods, comparing metrics such as training time and memory usage.
The single-pass training method revolutionizes the way datasets are loaded and processed during model training. Unlike traditional methods that may require multiple passes over the data, our approach ensures that each data batch is processed in a single pass, significantly enhancing training efficiency.
- Efficiency: Reduces the number of data loading operations, speeding up the training process.
- Resource Optimization: Lowers memory consumption by handling data more effectively.
- Scalability: Facilitates training on larger datasets without additional computational costs.
- Simplified Pipeline: Streamlines the training workflow, making it easier to integrate and manage.
Our single-pass training method is implemented in the train.py
script. Key components include:
- Dataset Loader: Utilizes a single-pass data loading mechanism to feed data batches to the model.
- Model Architecture: Employs a simple transformer-based model (
SimpleTransformer
) designed for binary classification tasks. - Training Loop: Optimizes the training loop to handle data in a single pass, updating model weights efficiently.
- Evaluation: Provides functions to evaluate model performance on new data.
import torch
import torch.nn as nn
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import numpy as np
from datasets import load_dataset
Incorporate the batch-wise gradient effects analytically in the solution
def compute_batch_aware_solution(K, y, num_epochs, batch_size, learning_rate, lambda_reg=1e-5):
n = K.shape[0]
I = torch.eye(n)
# Simulate the effect of batch-wise gradient updates
batch_effect = (n // batch_size) learning_rate num_epochs # Approximate cumulative effect over batches
effective_K = K + lambda_reg I
A_inv = torch.inverse(effective_K batch_effect + lambda_reg I)
alpha = torch.mm(A_inv, y)
return alpha
single-pass-training/
├── compare.py
├── traditional.py
├── train.py
├── README.md
- compare.py: Script to compare traditional and single-pass training methods.
- traditional.py: Implements the traditional training approach for baseline comparison.
- train.py: Main script using the single-pass training method.
- README.md: Project documentation.
Contributions are welcome! Please fork the repository and submit a pull request with your enhancements.