A field-tested Hebrew tokenizer for dirty texts (ben-yehuda project, bible, cc100, mc4, opensubs, oscar, twitter) focused on multi-word expression extraction.
- Nikud and teamim are ignored. For ktiv-male use cases you may want to set sanitize='leave_diacritics' to discard words with nikud or teamim.
- Punctuation is normalized to ASCII (using unidecode).
- Correct usage of final letters (ךםןףץ) is enforced. Final פ and 'צ (with geresh) are allowed.
- Minimal word length is 2 proper letters.
- Same character repetition (שולטתתתת), which is a common form of slang writing, is limited to a maximum of max_char_repetition (default=2),
and for the end of words or complete words, a same or more restrictive, maximum max_end_of_word_char_repetition (default=2). Use 0 or None for no limit.
- Note that these will throw away a small number of words with legitimate repetitions, most notably 'מממ' as in 'מממשלת' ,'מממש' ,'מממן'.
- allow_mmm (default=True) will specifically allow 'מממ' for the case max_char_repetition==2.
- Other less common legitimate repetitions include: 'תתת' ,'ששש' ,'נננ' ,'ממממ' ,'כככ' ,'ייי' ,'וווו' ,'ווו' ,'ההה' ,'בבב'.
- Words having only one or two distinct characters (חיחיחיחיחי), also a common form of slang writing, are limited to lengths up to max_one_two_char_word_len (default=7).
- Acronyms (צה"ל) and abbreviations ('וכו) are excluded, as well as numerals (42). (TBD)
- MWE refers to multi-word expression candidates, which are tokenized based on hyphen/makaf or surrounding punctuation.
- Hyphen-based MWE's are discarded if they contain more than max_mwe_hyphens (default=1). Use 0 not allowing hyphens (e.g. for biblical texts) or None for unlimited hyphens.
- Line opening hyphens as used in conversation and enumeration, can be ignored by allow_line_opening_hyphens (default=True)
- Strict mode can enforce the absence of extraneous hebrew letters in the same "clause" (strict=HebTokenizer.CLAUSE), sentence (strict=HebTokenizer.SENTENCE) or line (strict=HebTokenizer.LINE) of the MWE. Use 0 or None to not be strict (default=None).
- Optionally allow number references with allow_number_refs (default=False).