Skip to content

A field-tested Hebrew tokenizer for dirty texts (ben-yehuda project, bible, cc100, mc4, opensubs, oscar, twitter) focused on multi-word expression extraction.

Notifications You must be signed in to change notification settings

eyaler/hebrew_tokenizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

48 Commits
 
 
 
 
 
 
 
 

Repository files navigation

hebrew_tokenizer

A field-tested Hebrew tokenizer for dirty texts (ben-yehuda project, bible, cc100, mc4, opensubs, oscar, twitter) focused on multi-word expression extraction.

  • Nikud and teamim are ignored. For ktiv-male use cases you may want to set sanitize='leave_diacritics' to discard words with nikud or teamim.
  • Punctuation is normalized to ASCII (using unidecode).
  • Correct usage of final letters (ךםןףץ) is enforced. Final פ and 'צ (with geresh) are allowed.
  • Minimal word length is 2 proper letters.
  • Same character repetition (שולטתתתת), which is a common form of slang writing, is limited to a maximum of max_char_repetition (default=2), and for the end of words or complete words, a same or more restrictive, maximum max_end_of_word_char_repetition (default=2). Use 0 or None for no limit.
    • Note that these will throw away a small number of words with legitimate repetitions, most notably 'מממ' as in 'מממשלת' ,'מממש' ,'מממן'.
    • allow_mmm (default=True) will specifically allow 'מממ' for the case max_char_repetition==2.
    • Other less common legitimate repetitions include: 'תתת' ,'ששש' ,'נננ' ,'ממממ' ,'כככ' ,'ייי' ,'וווו' ,'ווו' ,'ההה' ,'בבב'.
  • Words having only one or two distinct characters (חיחיחיחיחי), also a common form of slang writing, are limited to lengths up to max_one_two_char_word_len (default=7).
  • Acronyms (צה"ל) and abbreviations ('וכו) are excluded, as well as numerals (42). (TBD)
  • MWE refers to multi-word expression candidates, which are tokenized based on hyphen/makaf or surrounding punctuation.
  • Hyphen-based MWE's are discarded if they contain more than max_mwe_hyphens (default=1). Use 0 not allowing hyphens (e.g. for biblical texts) or None for unlimited hyphens.
  • Line opening hyphens as used in conversation and enumeration, can be ignored by allow_line_opening_hyphens (default=True)
  • Strict mode can enforce the absence of extraneous hebrew letters in the same "clause" (strict=HebTokenizer.CLAUSE), sentence (strict=HebTokenizer.SENTENCE) or line (strict=HebTokenizer.LINE) of the MWE. Use 0 or None to not be strict (default=None).
  • Optionally allow number references with allow_number_refs (default=False).

About

A field-tested Hebrew tokenizer for dirty texts (ben-yehuda project, bible, cc100, mc4, opensubs, oscar, twitter) focused on multi-word expression extraction.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

 

Packages

No packages published

Languages