Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add N best results #151

Open
kuroahna opened this issue Apr 21, 2024 · 5 comments
Open

Add N best results #151

kuroahna opened this issue Apr 21, 2024 · 5 comments

Comments

@kuroahna
Copy link

Is your feature request related to a problem? Please describe.
Mecab has the flag -N which provides the N best results

mecab --help
...
 -N, --nbest=INT                output N best results (default 1)

However, I couldn't find docs or in the source code how to do this with vibrato

Describe the solution you'd like
Allow support for providing the N best results

Describe alternatives you've considered
N/A

Additional context
Vibrato 0.5.1

@kampersanda
Copy link
Member

Vibrato does not currently support N-best results. This is because there are not so many use cases for the N-best results.

@kuroahna
Copy link
Author

kuroahna commented Apr 21, 2024

Vibrato does not currently support N-best results. This is because there are not so many use cases for the N-best results.

I see. I was considering to use vibrato over mecab-rs for tokenization for a Japanese dictionary lookup program I'm planning to build. Something similar to https://github.com/themoeway/yomitan which is useful for learning Japanese, but my version would be faster in rust, and would support 古文 by using UniDic 中古 dictionary. Yomichan/yomitan does not support 古文 deconjugation and it would be simpler+more accurate to use mecab/vibrato here, rather than write custom 活用 rules

Sometimes a word may have multiple readings such as 昨日 (きのう・さくじつ)

> echo "昨日" | mecab --dicdir="unidic-cwj-202302_full" -N5
昨日    名詞,普通名詞,副詞可能,*,*,*,キノウ,昨日,昨日,キノー,昨日,キノー,和,*,*,*,*,*,*,体,キノウ,キノウ,キノウ,キノウ,"2,0",C2,*,2407389399753216,8758
EOS
昨日    名詞,普通名詞,副詞可能,*,*,*,サクジツ,昨日,昨日,サクジツ,昨日,サクジツ,漢,*,*,*,*,*,*,体,サクジツ,サクジツ,サクジツ,サクジツ,2,C1,*,3851597855728128,14012
EOS
昨      接頭辞,*,*,*,*,*,サク,昨,昨,サク,昨,サク,漢,*,*,*,*,*,*,接頭,サク,サク,サク,サク,*,P2,*,3845000785961472,13988
日      名詞,普通名詞,助数詞可能,*,*,*,ニチ,日,日,ニチ,日,ニチ,漢,*,*,チ促,基本形,*,"B4WB,B4WB9G",体,ニチ,ニチ,ニチ,ニチ,1,C3,*,7799669233164800,28375
EOS
昨      接頭辞,*,*,*,*,*,サク,昨,昨,サク,昨,サク,漢,*,*,*,*,*,*,接頭,サク,サク,サク,サク,*,P2,*,3845000785961472,13988
日      名詞,普通名詞,副詞可能,*,*,*,ヒ,日,日,ヒ,日,ヒ,和,ヒ混合,基本形,*,*,*,*,体,ヒ,ヒ,ヒ,ヒ,"0,1",C3,*,8548161773773312,31098
EOS
昨      接頭辞,*,*,*,*,*,サク,昨,昨,サク,昨,サク,漢,*,*,*,*,*,*,接頭,サク,サク,サク,サク,*,P2,*,3845000785961472,13988
日      名詞,普通名詞,副詞可能,*,*,*,ヒ,日,日,ビ,日,ビ,和,ヒ混合,濁音形,*,*,*,*,体,ビ,ビ,ビ,ヒ,"0,1",C3,*,8548161773781504,31098
EOS

and it would be nice to get the N best results to consider all likely possibilities.

Additionally, there might be a sentence that is not tokenized properly in the first result, but is correct in the second result. I don't have an example on hand at the moment, but I have seen this before

@kuroahna
Copy link
Author

kuroahna commented Apr 21, 2024

Additionally, there might be a sentence that is not tokenized properly in the first result, but is correct in the second result. I don't have an example on hand at the moment, but I have seen this before

Here's an example:

「あんな調子でよく毎日暮らしてられるもんだと思うが……」

vibrato/mecab's first tokenization result for 暮らしてられる is

> echo '「あんな調子でよく毎日暮らしてられるもんだと思うが……」' | cargo run --release -p tokenize -- -i system.dic.zst
warning: virtual workspace defaulting to `resolver = "1"` despite one or more workspace members being on edition 2021 which implies `resolver = "2"`
note: to keep the current resolver, specify `workspace.resolver = "1"` in the workspace root's manifest
note: to use the edition 2021 resolver, specify `workspace.resolver = "2"` in the workspace root's manifest
note: for more details see https://doc.rust-lang.org/cargo/reference/resolver.html#resolver-versions
    Finished release [optimized] target(s) in 0.50s
     Running `target/release/tokenize -i system.dic.zst`
Loading the dictionary...
Ready to tokenize
「      補助記号,括弧開,*,*,*,*,*,「,「,*,「,*,記号,*,*,*,*,*,*,補助,*,*,*,*,*,*,*,9079594557952,33
あんな  連体詞,*,*,*,*,*,アンナ,あんな,あんな,アンナ,あんな,アンナ,和,*,*,*,*,*,*,相,アンナ,アンナ,アンナ,アンナ,0,*,*,386495550726656,1406
調子    名詞,普通名詞,一般,*,*,*,チョウシ,調子,調子,チョーシ,調子,チョーシ,漢,チ濁,基本形,*,*,*,*,体,チョウシ,チョウシ,チョウシ,チョウシ,0,C2,*,6599827169354240,24010
で      助詞,格助詞,*,*,*,*,デ,で,で,デ,で,デ,和,*,*,*,*,*,*,格助,デ,デ,デ,デ,*,"動詞%F2@0,名詞%F1",*,7014343053025792,25518
よく    副詞,*,*,*,*,*,ヨク,良く,よく,ヨク,よく,ヨク,和,*,*,*,*,*,*,相,ヨク,ヨク,ヨク,ヨク,1,*,*,10770283363443200,39182
毎日    名詞,普通名詞,副詞可能,*,*,*,マイニチ,毎日,毎日,マイニチ,毎日,マイニチ,漢,*,*,*,*,*,*,体,マイニチ,マイニチ,マイニチ,マイニチ,"1,0",C1,*,9738932866654720,35430
暮らし  動詞,一般,*,*,五段-サ行,連用形-一般,クラス,暮らす,暮らし,クラシ,暮らす,クラス,和,*,*,*,*,*,*,用,クラシ,クラス,クラシ,クラス,0,C2,*,2867535015977601,10432
て      助詞,接続助詞,*,*,*,*,テ,て,て,テ,て,テ,和,*,*,*,*,*,*,接助,テ,テ,テ,テ,*,"動詞%F1,形容詞%F2@-1",*,6837321680953856,24874
られる  助動詞,*,*,*,助動詞-レル,連体形-一般,ラレル,られる,られる,ラレル,られる,ラレル,和,*,*,*,*,*,*,助動,ラレル,ラレル,ラレル,ラレル,*,動詞%F3@2,*,10936575907209921,39787
もん    名詞,普通名詞,一般,*,*,*,モノ,物,もん,モン,もん,モン,和,*,*,*,*,*,*,体,モン,モン,モン,モン,1,C4,*,10411017939067392,37875
だ      助動詞,*,*,*,助動詞-ダ,終止形-一般,ダ,だ,だ,ダ,だ,ダ,和,*,*,*,*,*,*,助動,ダ,ダ,ダ,ダ,*,名詞%F1,*,6299110739157675,22916
と      助詞,格助詞,*,*,*,*,ト,と,と,ト,と,ト,和,*,*,*,*,*,*,格助,ト,ト,ト,ト,*,"名詞%F1,動詞%F1,形容詞%F2@-1",*,7099014038299136,25826
思う    動詞,一般,*,*,五段-ワア行,終止形-一般,オモウ,思う,思う,オモウ,思う,オモウ,和,*,*,*,*,*,*,用,オモウ,オモウ,オモウ,オモウ,2,C1,*,1444492058174123,5255
が      助詞,接続助詞,*,*,*,*,ガ,が,が,ガ,が,ガ,和,*,*,*,*,*,*,接助,ガ,ガ,ガ,ガ,*,"動詞%F2@0,形容詞%F2@-1",*,2168245553603072,7888
…       補助記号,一般,*,*,*,*,*,…,…,*,…,*,記号,*,*,*,*,*,*,補助,*,*,*,*,*,*,*,1657891070464,6
…       補助記号,一般,*,*,*,*,*,…,…,*,…,*,記号,*,*,*,*,*,*,補助,*,*,*,*,*,*,*,1657891070464,6
」      補助記号,括弧閉,*,*,*,*,*,」,」,*,」,*,記号,*,*,*,*,*,*,補助,*,*,*,*,*,*,*,9354472464896,34
EOS

where we have

暮らし  動詞,一般,*,*,五段-サ行,連用形-一般,クラス,暮らす,暮らし,クラシ,暮らす,クラス,和,*,*,*,*,*,*,用,クラシ,クラス,クラシ,クラス,0,C2,*,2867535015977601,10432
て      助詞,接続助詞,*,*,*,*,テ,て,て,テ,て,テ,和,*,*,*,*,*,*,接助,テ,テ,テ,テ,*,"動詞%F1,形容詞%F2@-1",*,6837321680953856,24874
られる  助動詞,*,*,*,助動詞-レル,連体形-一般,ラレル,られる,られる,ラレル,られる,ラレル,和,*,*,*,*,*,*,助動,ラレル,ラレル,ラレル,ラレル,*,動詞%F3@2,*,10936575907209921,39787

and labels て as 接続助詞, but it should be 助動詞 for てる

Using mecab with -N2, we get

> echo "あんな調子でよく毎日暮らしてられるもんだと思うが" | mecab --dicdir="unidic-cwj-202302_full" -N2
あんな  連体詞,*,*,*,*,*,アンナ,あんな,あんな,アンナ,あんな,アンナ,和,*,*,*,*,*,*,相,アンナ,アンナ,アンナ,アンナ,0,*,*,386495550726656,1406
調子    名詞,普通名詞,一般,*,*,*,チョウシ,調子,調子,チョーシ,調子,チョーシ,漢,チ濁,基本形,*,*,*,*,体,チョウシ,チョウシ,チョウシ,チョウシ,0,C2,*,6599827169354240,24010
で      助詞,格助詞,*,*,*,*,デ,で,で,デ,で,デ,和,*,*,*,*,*,*,格助,デ,デ,デ,デ,*,"動詞%F2@0,名詞%F1",*,7014343053025792,25518
よく    副詞,*,*,*,*,*,ヨク,良く,よく,ヨク,よく,ヨク,和,*,*,*,*,*,*,相,ヨク,ヨク,ヨク,ヨク,1,*,*,10770283363443200,39182
毎日    名詞,普通名詞,副詞可能,*,*,*,マイニチ,毎日,毎日,マイニチ,毎日,マイニチ,漢,*,*,*,*,*,*,体,マイニチ,マイニチ,マイニチ,マイニチ,"1,0",C1,*,9738932866654720,35430
暮らし  動詞,一般,*,*,五段-サ行,連用形-一般,クラス,暮らす,暮らし,クラシ,暮らす,クラス,和,*,*,*,*,*,*,用,クラシ,クラス,クラシ,クラス,0,C2,*,2867535015977601,10432
て      助詞,接続助詞,*,*,*,*,テ,て,て,テ,て,テ,和,*,*,*,*,*,*,接助,テ,テ,テ,テ,*,"動詞%F1,形容詞%F2@-1",*,6837321680953856,24874
られる  助動詞,*,*,*,助動詞-レル,連体形-一般,ラレル,られる,られる,ラレル,られる,ラレル,和,*,*,*,*,*,*,助動,ラレル,ラレル,ラレル,ラレル,*,動詞%F3@2,*,10936575907209921,39787
もん    名詞,普通名詞,一般,*,*,*,モノ,物,もん,モン,もん,モン,和,*,*,*,*,*,*,体,モン,モン,モン,モン,1,C4,*,10411017939067392,37875
だ      助動詞,*,*,*,助動詞-ダ,終止形-一般,ダ,だ,だ,ダ,だ,ダ,和,*,*,*,*,*,*,助動,ダ,ダ,ダ,ダ,*,名詞%F1,*,6299110739157675,22916
と      助詞,格助詞,*,*,*,*,ト,と,と,ト,と,ト,和,*,*,*,*,*,*,格助,ト,ト,ト,ト,*,"名詞%F1,動詞%F1,形容詞%F2@-1",*,7099014038299136,25826
思う    動詞,一般,*,*,五段-ワア行,終止形-一般,オモウ,思う,思う,オモウ,思う,オモウ,和,*,*,*,*,*,*,用,オモウ,オモウ,オモウ,オモウ,2,C1,*,1444492058174123,5255
が      助詞,接続助詞,*,*,*,*,ガ,が,が,ガ,が,ガ,和,*,*,*,*,*,*,接助,ガ,ガ,ガ,ガ,*,"動詞%F2@0,形容詞%F2@-1",*,2168245553603072,7888
EOS
あんな  連体詞,*,*,*,*,*,アンナ,あんな,あんな,アンナ,あんな,アンナ,和,*,*,*,*,*,*,相,アンナ,アンナ,アンナ,アンナ,0,*,*,386495550726656,1406
調子    名詞,普通名詞,一般,*,*,*,チョウシ,調子,調子,チョーシ,調子,チョーシ,漢,チ濁,基本形,*,*,*,*,体,チョウシ,チョウシ,チョウシ,チョウシ,0,C2,*,6599827169354240,24010
で      助詞,格助詞,*,*,*,*,デ,で,で,デ,で,デ,和,*,*,*,*,*,*,格助,デ,デ,デ,デ,*,"動詞%F2@0,名詞%F1",*,7014343053025792,25518
よく    副詞,*,*,*,*,*,ヨク,良く,よく,ヨク,よく,ヨク,和,*,*,*,*,*,*,相,ヨク,ヨク,ヨク,ヨク,1,*,*,10770283363443200,39182
毎日    名詞,普通名詞,副詞可能,*,*,*,マイニチ,毎日,毎日,マイニチ,毎日,マイニチ,漢,*,*,*,*,*,*,体,マイニチ,マイニチ,マイニチ,マイニチ,"1,0",C1,*,9738932866654720,35430
暮らし  動詞,一般,*,*,五段-サ行,連用形-一般,クラス,暮らす,暮らし,クラシ,暮らす,クラス,和,*,*,*,*,*,*,用,クラシ,クラス,クラシ,クラス,0,C2,*,2867535015977601,10432
て      助動詞,*,*,*,下一段-タ行,連用形-一般,テル,てる,て,テ,てる,テル,和,*,*,*,*,*,*,助動,テ,テル,テ,テル,*,動詞%F1,M4@1,6950846256521857,25287
られる  助動詞,*,*,*,助動詞-レル,連体形-一般,ラレル,られる,られる,ラレル,られる,ラレル,和,*,*,*,*,*,*,助動,ラレル,ラレル,ラレル,ラレル,*,動詞%F3@2,*,10936575907209921,39787
もん    名詞,普通名詞,一般,*,*,*,モノ,物,もん,モン,もん,モン,和,*,*,*,*,*,*,体,モン,モン,モン,モン,1,C4,*,10411017939067392,37875
だ      助動詞,*,*,*,助動詞-ダ,終止形-一般,ダ,だ,だ,ダ,だ,ダ,和,*,*,*,*,*,*,助動,ダ,ダ,ダ,ダ,*,名詞%F1,*,6299110739157675,22916
と      助詞,格助詞,*,*,*,*,ト,と,と,ト,と,ト,和,*,*,*,*,*,*,格助,ト,ト,ト,ト,*,"名詞%F1,動詞%F1,形容詞%F2@-1",*,7099014038299136,25826
思う    動詞,一般,*,*,五段-ワア行,終止形-一般,オモウ,思う,思う,オモウ,思う,オモウ,和,*,*,*,*,*,*,用,オモウ,オモウ,オモウ,オモウ,2,C1,*,1444492058174123,5255
が      助詞,接続助詞,*,*,*,*,ガ,が,が,ガ,が,ガ,和,*,*,*,*,*,*,接助,ガ,ガ,ガ,ガ,*,"動詞%F2@0,形容詞%F2@-1",*,2168245553603072,7888
EOS

In the 2nd best result, it correctly labels て as 助動詞 for てる

In other words, 暮らしていられる -> 暮らしてられる

@kampersanda
Copy link
Member

Thank you for the examples! It's interesting. I agree that the N-best option is useful in such applications.

We will consider supporting the N-best option. (I apologize it might be difficult for me to support it soon since I have been busy these days. I'd like to recommend using mecab as an alternative solution for the time being.)

@kuroahna
Copy link
Author

Thank you for the examples! It's interesting. I agree that the N-best option is useful in such applications.

We will consider supporting the N-best option. (I apologize it might be difficult for me to support it soon since I have been busy these days. I'd like to recommend using mecab as an alternative solution for the time being.)

Thank you! No rush on it and take your time! I'm happy that adding support is being considered

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants