Vaporetto is a fast and lightweight pointwise prediction based tokenizer.
use std::fs::File;
use std::io::{prelude::*, stdin, BufReader};
use vaporetto::{Model, Predictor, Sentence};
let mut f = BufReader::new(File::open("model.raw").unwrap());
let model = Model::read(&mut f).unwrap();
let predictor = Predictor::new(model);
let s = Sentence::from_raw("火星猫の生態").unwrap();
let s = predictor.predict(s);
println!("{:?}", s.to_tokenized_vec().unwrap());
// ["火星", "猫", "の", "生態"]
The following features are disabled by default:
kytea
- Enables the reader for models generated by KyTea.train
- Enables the trainer.portable-simd
- Uses the portable SIMD API instead of our SIMD-conscious data layout. (Nightly Rust is required.)
The following features are enabled by default:
cache-type-score
- Enables caching type scores for faster processing. If disabled, type scores are calculated in a straightforward manner.fix-weight-length
- Uses fixed-size arrays for storing scores to facilitate optimization. If disabled, vectors are used instead.tag-prediction
- Enables tag prediction.
Licensed under either of
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)
at your option.
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.