Create the text block

@Vover

Full Changelog: 0.0.1...0.3.2

Create the text block

text = """# SmolBPE v0.3.2 Release Notes

What's New

Major Changes

Added support for special tokens (<|code|>special_tokens<|/code|> parameter)
Improved regex pattern for better tokenization
Renamed main tokenizer class from <|code|>GPT4Tokenizer<|/code|> to <|code|>Tokenizer<|/code|>

New Features

Special Tokens Support: Special tokens are now encoded first and preserved during tokenization
Enhanced Pattern Matching: New default regex pattern that better handles contractions and special characters
Simplified Interface: More intuitive class naming and parameters

Breaking Changes

GPT4Tokenizer class has been renamed to Tokenizer
Changed command-line interface from gpt4tokenizer to tokenizer
Modified default regex pattern

Installation

pip install smolbpe==0.3.2

Upgrade Guide

If upgrading from version 0.2.0:

Update imports:

# Old
from smolbpe.gpt4Tokenizer import GPT4Tokenizer

# New 
from smolbpe.tokenizer import Tokenizer

Update class initialization:

# Old
tokenizer = GPT4Tokenizer(output='vocab.json')

# New
tokenizer = Tokenizer(
    output='vocab.json',
    special_tokens=['<|start|>', '<|end|>']  # Optional
)

Update CLI commands:

# Old
gpt4tokenizer --text input.txt --vocab_size 400

# New
tokenizer --text input.txt --vocab_size 400 --special_tokens "<|start|>" "<|end|>"

Example Usage

from smolbpe.tokenizer import Tokenizer

# Initialize with special tokens
tokenizer = Tokenizer(
    output='vocab.json',
    special_tokens=['<|start|>', '<|end|>']
)

# Train on your data
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()
tokenizer.train(text, vocab_size=400)

# Encode text with special tokens
text = "<|start|>Hello world!<|end|>"
encoded = tokenizer.encode(text)
decoded = tokenizer.decode(encoded)
print(decoded)  # "<|start|>Hello world!<|end|>"

Bug Fixes

Fixed empty statistics handling during training
Improved Unicode character handling
Better error messages for invalid inputs

Documentation Updates

Added examples for special tokens usage
Updated CLI documentation
Improved code comments

Contributors

@Vover - Core development and maintenance

Links

For any issues or questions, please open an issue on GitHub."""

First release

basic tokenizer with unrestricted merging
GPT4 tokenizer with splitting pattern

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create the text block

What's New

Major Changes

New Features

Breaking Changes

Installation

Upgrade Guide

Example Usage

Bug Fixes

Documentation Updates

Contributors

Links

Contributors

First release

Releases: T4ras123/SmolBPE

0.3.2

Create the text block

What's New

Major Changes

New Features

Breaking Changes

Installation

Upgrade Guide

Example Usage

Bug Fixes

Documentation Updates

Contributors

Links

Contributors

0.0.1

First release