Description
Here, i'm going to raise some issues related to Tesseract's Hebrew support.
Dear participants who have interest in Arabic support, I suggest to raise Arabic issues in a separate 'issue', even if there are similar issues for both Arabic/Persian and Hebrew.
Let's start with the nikud issue.
Hebrew has two writing forms:
- Hebrew with nikud
- Hebrew without nikud
Nikud - Diacritical signs used in Hebrew writing.
Modern Hebrew is written (mostly) without nikud.
Children's books are written with nikud. Poetry is also usually written with nikud. Hebrew dictionaries also use nikud. The Hebrew bible use nikud. It also uses te'amim (Cantillation marks).
There are some mixed forms:
- In this form, most of the body text is written without nikud, but in a few places nikud is used.
1a) Some paragraphs/sentences use nikud, when quoting the bible or a poem for example.
1b) One or few words in some paragraphs use nikud. This form is used for example for foreign names of people and places (like cities). Without nikud many words will be ambiguous. Usually a native Hebrew speaker will use context to solve this ambiguousness. Sometimes there will still be ambiguousness, and then using nikud can be used to solve this issue. - In this form, most (or at least a large percent) of the words in the text is written with nikud, but for the words with nikud, the nikud is only partial.
The following part is relevant to both (1b) and (2) above.
When adding nikud to a word, it might be in 'full' or 'partial' form. Sometimes adding just one nikud sign is enough to make the word unambiguous.
Ray, If you only use the web for building the langdata, you won't find many good sources for Hebrew with nikud.
Here is an excellent source which has both Hebrew with nikud (mostly poetry) and without nikud (most of the prose):
http://benyehuda.org/
Project Ben-Yehuda, named after Eliezer Ben-Yehuda, is like the famous Project Gutenberg, but it just for Hebrew.
Note that some parts are copyrighted. In some other parts the copyrights were expired according to the Israeli law, but might be still copyrighted in the US. For your use case, building a corpus, I don't think the copyrights matters, but IANAL.
Do you use the Hebrew Bible as a source (like the one from Wikisource)?
I don't sure if it is a good idea to use it for modern Hebrew.
More information will follow later.