You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Rust crate (similar to a package in Python) contains the three tokenizers currently included in the Python version of minbpe: BasicTokenizer, RegexTokenizer and the GPT4Tokenizer. Here's an example, similar to the one in the README of this project, but using minbpe-rs,
use std::path::Path;
use minbpe::{BasicTokenizer, Saveable, Tokenizer, Trainable};
fn main() {
let text = "aaabdaaabac" ;
let mut tokenizer = BasicTokenizer::new() ;
tokenizer.train( text , 256 + 3 , false ) ;
println!( "{:?}" , tokenizer.encode(text) ) ;
println!( "{:?}" , tokenizer.decode( &[258, 100, 258, 97, 99] ) ) ;
tokenizer.save( Path::new( "./" ) , "toy" ) ;
}
which on execution prints,
$> cargo run
...
Compiling minbpe-test v0.1.0 (~/minbpe-test)
Finished dev [unoptimized + debuginfo] target(s) in 15.71s
Running `target/debug/minbpe-test`
[258, 100, 258, 97, 99]
"aaabdaaabac"
@gnp is the lead developer with me, @shubham0204, working on the docs, examples and the README of the project.
minbpe-rs will be a good start for the 2nd point in todo section of the README: write an even more optimized C or Rust version (think through)
The project also contains a test comparing RegexTokenizer with the GPT-4 tokenizer from tictoken-rs(Rust version of tictoken), similar to inference: GPT-4 comparison from the README. See the test here.
Currently, the project has a base level of documentation, which can be enriched by adding more docstrings and examples for the tokenizers
It would be great if minbpe-rs can be added as a community extension in the README of this repository, encouraging more developers to work on this Rust implementation and build more features into it (ex. Python bindings, multi-threading support, or wrappers for Java/C). We would like the community to review minbpe-rs and provide their feedback or contributions.
The text was updated successfully, but these errors were encountered:
Gregor Purdy (@gnp) is working on a Rust version of
minbpe
: minbpe-rsThe Rust crate (similar to a package in Python) contains the three tokenizers currently included in the Python version of
minbpe
:BasicTokenizer
,RegexTokenizer
and theGPT4Tokenizer
. Here's an example, similar to the one in the README of this project, but usingminbpe-rs
,which on execution prints,
@gnp is the lead developer with me, @shubham0204, working on the docs, examples and the
README
of the project.minbpe-rs
will be a good start for the 2nd point intodo
section of theREADME
: write an even more optimized C or Rust version (think through)RegexTokenizer
with the GPT-4 tokenizer fromtictoken-rs
(Rust version oftictoken
), similar toinference: GPT-4 comparison
from theREADME
. See the test here.It would be great if
minbpe-rs
can be added as a community extension in theREADME
of this repository, encouraging more developers to work on this Rust implementation and build more features into it (ex. Python bindings, multi-threading support, or wrappers for Java/C). We would like the community to review minbpe-rs and provide their feedback or contributions.The text was updated successfully, but these errors were encountered: