Skip to content

Latest commit

 

History

History

vaporetto

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

Vaporetto

Vaporetto is a fast and lightweight pointwise prediction based tokenizer.

Examples

use std::fs::File;
use std::io::{prelude::*, stdin, BufReader};

use vaporetto::{Model, Predictor, Sentence};

let mut f = BufReader::new(File::open("model.raw").unwrap());
let model = Model::read(&mut f).unwrap();
let predictor = Predictor::new(model);

let s = Sentence::from_raw("火星猫の生態").unwrap();
let s = predictor.predict(s);

println!("{:?}", s.to_tokenized_vec().unwrap());
// ["火星", "猫", "の", "生態"]

Feature flags

The following features are disabled by default:

  • kytea - Enables the reader for models generated by KyTea.
  • train - Enables the trainer.
  • portable-simd - Uses the portable SIMD API instead of our SIMD-conscious data layout. (Nightly Rust is required.)

The following features are enabled by default:

  • cache-type-score - Enables caching type scores for faster processing. If disabled, type scores are calculated in a straightforward manner.
  • fix-weight-length - Uses fixed-size arrays for storing scores to facilitate optimization. If disabled, vectors are used instead.
  • tag-prediction - Enables tag prediction.

License

Licensed under either of

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.