Releases: T4ras123/SmolBPE
Releases · T4ras123/SmolBPE
0.3.2
Full Changelog: 0.0.1...0.3.2
Create the text block
text = """# SmolBPE v0.3.2 Release Notes
What's New
Major Changes
- Added support for special tokens (<|code|>special_tokens<|/code|> parameter)
- Improved regex pattern for better tokenization
- Renamed main tokenizer class from <|code|>GPT4Tokenizer<|/code|> to <|code|>Tokenizer<|/code|>
New Features
- Special Tokens Support: Special tokens are now encoded first and preserved during tokenization
- Enhanced Pattern Matching: New default regex pattern that better handles contractions and special characters
- Simplified Interface: More intuitive class naming and parameters
Breaking Changes
GPT4Tokenizer
class has been renamed toTokenizer
- Changed command-line interface from
gpt4tokenizer
totokenizer
- Modified default regex pattern
Installation
pip install smolbpe==0.3.2
Upgrade Guide
If upgrading from version 0.2.0:
- Update imports:
# Old
from smolbpe.gpt4Tokenizer import GPT4Tokenizer
# New
from smolbpe.tokenizer import Tokenizer
- Update class initialization:
# Old
tokenizer = GPT4Tokenizer(output='vocab.json')
# New
tokenizer = Tokenizer(
output='vocab.json',
special_tokens=['<|start|>', '<|end|>'] # Optional
)
- Update CLI commands:
# Old
gpt4tokenizer --text input.txt --vocab_size 400
# New
tokenizer --text input.txt --vocab_size 400 --special_tokens "<|start|>" "<|end|>"
Example Usage
from smolbpe.tokenizer import Tokenizer
# Initialize with special tokens
tokenizer = Tokenizer(
output='vocab.json',
special_tokens=['<|start|>', '<|end|>']
)
# Train on your data
with open('input.txt', 'r', encoding='utf-8') as f:
text = f.read()
tokenizer.train(text, vocab_size=400)
# Encode text with special tokens
text = "<|start|>Hello world!<|end|>"
encoded = tokenizer.encode(text)
decoded = tokenizer.decode(encoded)
print(decoded) # "<|start|>Hello world!<|end|>"
Bug Fixes
- Fixed empty statistics handling during training
- Improved Unicode character handling
- Better error messages for invalid inputs
Documentation Updates
- Added examples for special tokens usage
- Updated CLI documentation
- Improved code comments
Contributors
- @Vover - Core development and maintenance
Links
For any issues or questions, please open an issue on GitHub."""