Skip to content

Releases: T4ras123/SmolBPE

0.3.2

04 Nov 09:13
Compare
Choose a tag to compare

Full Changelog: 0.0.1...0.3.2

Create the text block

text = """# SmolBPE v0.3.2 Release Notes

What's New

Major Changes

  • Added support for special tokens (<|code|>special_tokens<|/code|> parameter)
  • Improved regex pattern for better tokenization
  • Renamed main tokenizer class from <|code|>GPT4Tokenizer<|/code|> to <|code|>Tokenizer<|/code|>

New Features

  • Special Tokens Support: Special tokens are now encoded first and preserved during tokenization
  • Enhanced Pattern Matching: New default regex pattern that better handles contractions and special characters
  • Simplified Interface: More intuitive class naming and parameters

Breaking Changes

  • GPT4Tokenizer class has been renamed to Tokenizer
  • Changed command-line interface from gpt4tokenizer to tokenizer
  • Modified default regex pattern

Installation

pip install smolbpe==0.3.2

Upgrade Guide

If upgrading from version 0.2.0:

  1. Update imports:
# Old
from smolbpe.gpt4Tokenizer import GPT4Tokenizer

# New 
from smolbpe.tokenizer import Tokenizer
  1. Update class initialization:
# Old
tokenizer = GPT4Tokenizer(output='vocab.json')

# New
tokenizer = Tokenizer(
    output='vocab.json',
    special_tokens=['<|start|>', '<|end|>']  # Optional
)
  1. Update CLI commands:
# Old
gpt4tokenizer --text input.txt --vocab_size 400

# New
tokenizer --text input.txt --vocab_size 400 --special_tokens "<|start|>" "<|end|>"

Example Usage

from smolbpe.tokenizer import Tokenizer

# Initialize with special tokens
tokenizer = Tokenizer(
    output='vocab.json',
    special_tokens=['<|start|>', '<|end|>']
)

# Train on your data
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()
tokenizer.train(text, vocab_size=400)

# Encode text with special tokens
text = "<|start|>Hello world!<|end|>"
encoded = tokenizer.encode(text)
decoded = tokenizer.decode(encoded)
print(decoded)  # "<|start|>Hello world!<|end|>"

Bug Fixes

  • Fixed empty statistics handling during training
  • Improved Unicode character handling
  • Better error messages for invalid inputs

Documentation Updates

  • Added examples for special tokens usage
  • Updated CLI documentation
  • Improved code comments

Contributors

  • @Vover - Core development and maintenance

Links


For any issues or questions, please open an issue on GitHub."""

0.0.1

10 Oct 14:26
Compare
Choose a tag to compare

First release

  • basic tokenizer with unrestricted merging
  • GPT4 tokenizer with splitting pattern