Skip to content

Commit

Permalink
Prepare 0.1.6
Browse files Browse the repository at this point in the history
zverok committed Oct 17, 2021
1 parent f596c6f commit bf93941
Showing 10 changed files with 543 additions and 39 deletions.
9 changes: 9 additions & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,15 @@
Changelog
=========

0.1.6 - 2021-10-17
------------------

* Fix several problems in code and comments pointed out by Daniel Höh (thanks!);
* Fix for single-letter words capitalization (thanks `@vletard <https://github.com/vletard>`_!);
* Change licence to MPL. It was `pointed <https://github.com/wooorm/nspell/issues/11#issuecomment-915802969>`_ to me that Spylls, being an "explantory rewrite" of Hunspell, can't have a permissive MIT license;
* Bundle ``unmunch.py`` script (now for real);
* Remove Patreon link, as Patreon's admins decided to demote me to "user" from "creator" for not posting my work on Patreon directly and not receiving any donations.

0.1.5 - 2021-05-12
------------------

392 changes: 373 additions & 19 deletions LICENSE

Large diffs are not rendered by default.

5 changes: 4 additions & 1 deletion README.rst
Original file line number Diff line number Diff line change
@@ -45,4 +45,7 @@ Project Links
License
-------

MIT licensed. See the bundled `LICENSE <https://github.com/spylls/spylls/blob/master/LICENSE>`_ file for more details.
MPL 2.0. See the bundled `LICENSE <https://github.com/spylls/spylls/blob/master/LICENSE>`_ file for more details.
Note that being an "explanatory rewrite", spylls should considered a derivative work of Hunspell, and so would be all of its ports/rewrites.

We are incredibly grateful to Hunspell's original authors and current maintainers for all the hard work they've put into the most used spellchecker in the world!
19 changes: 10 additions & 9 deletions docs/index.rst
Original file line number Diff line number Diff line change
@@ -10,15 +10,15 @@ Reasons

Spellchecking is a notoriously hard task that looks easy. The MVP everybody starts from is "just look if the word in the known list, and if it is not, calculate Levenstein distance to know what's the most similar one and suggest it", but things get complicated very quickly once you start working with real texts, and languages other than English.

There are some modern approaches to spell and grammar checking, which are based on machine learning, can recognize context, and do a lot of other interesting stuff. But "classic", dictionary-based spellcheckers are still the most widespread solution, with **Hunspell** being the most widespread of all. It is embedded into Chrome, Firefox, OpenOffice, Adobe's products, Linux, and macOS distributions; there are Hunspell-compatible dictionaries for most of the human languages.
There are some modern approaches to spell and grammar checking, that are based on machine learning, can recognize context, and do a lot of other interesting stuff. But "classic", dictionary-based spellcheckers are still the most widespread solution, with **Hunspell** being the most widespread of all. It is embedded into Chrome, Firefox, OpenOffice, Adobe's products, Linux, and macOS distributions; there are Hunspell-compatible dictionaries for most of the human languages.

At the same time, Hunspell is a long-living, complicated, almost undocumented piece of software, and it was our feeling that the significant part of human knowledge is somehow "locked" in a form of a large C++ project. That's how **Spylls** was born: as an attempt to "unlock" it, via well-structured and well-documented implementation in a high-level language.

Design choices
--------------

* **Spylls** is implemented in Python, as a most widespread high-level language of the 2020s (besides EcmaScript, but I just can't do it... for personal reasons);
* The code is as "vanilla Python" as possible, so it should be reasonably readable for a developer in any modern language; the most Python-specific feature used is a method returning generators (instead of arrays);
* The code is as "vanilla Python" as possible, so it should be reasonably readable for a developer in any modern language; the most Python-specific feature used is methods returning generators (instead of arrays);
* Code is structured in a (reasonably) low amount of classes with (reasonably) large methods, exposing the imperative nature of Hunspell's algorithms; probably "very OO" or "very functional" approach could've made code more appealing for some, but I tried to communicate the algorithms themselves (for possible reimplementations in other languages and architectures), not my views on how to code;
* ...At the same time, it doesn't try to reproduce Hunspell's structure of classes, method names, and calls, but rather express "what it does" in the most simple/straightforward ways.

@@ -70,7 +70,7 @@ The current state of the port:
* Of **34** Hunspell's suggest tests, **3 are "pending"** (mostly due to handling of dots, which is related to tokenization)
* spylls is confirmed to at least read successfully all dictionaries available in Firefox and LibreOffice official dictionary repositories

So, it is, like ~80% theoretically complete and ~95% pragmatically complete.
So, it is, like, ~80% theoretically complete and ~95% pragmatically complete.

On the other hand, I haven't used it extensively in a large production project or tried to spellcheck large texts in all supported languages, so there still might be some weird behavior in edge cases, not covered by Hunspell's tests. Also, it should be noted there are a lot of ``TODO:`` and ``FIXME:`` in the code, frequently signifying places where Hunspell's code was more complicated (simplifications not manifesting in failing tests, but probably slightly changing edge case behavior).

@@ -116,18 +116,19 @@ Other ports

Here only "pure" ports of Hunspell to other languages are listed, not wrappers around the original Hunspell (of which there are plenty):

* .NET: `WeCantSpell <https://github.com/aarondandy/WeCantSpell.Hunspell>`_
* JS: `nspell <https://github.com/wooorm/nspell>`_ (only some directives)
* C++: `nuspell <https://github.com/nuspell/nuspell>`_ (weirdly, pretends to be an independent project with no relations to anything, while at the same time seeming to support the same format of aff/dic, and striving to conform to Hunspell's test suite)
* .NET: `WeCantSpell <https://github.com/aarondandy/WeCantSpell.Hunspell>`_;
* JS: `nspell <https://github.com/wooorm/nspell>`_ (only some directives);
* JS/TS: `espells <https://github.com/Monkatraz/espells>`_ is a "post-Spylls" port that was ported from Spylls and then enhanced and extended;
* C++: `nuspell <https://github.com/nuspell/nuspell>`_ (weirdly, pretends to be an independent project with no relations to anything, while at the same time seeming to support the same format of aff/dic, and striving to conform to Hunspell's test suite).

Some other approaches to spellchecking
--------------------------------------

* `aspell <https://github.com/GNUAspell/aspell>`_, while being in some sense a "grandparent" of Hunspell, is said to `sometimes provide better suggestions <https://battlepenguin.com/tech/aspell-and-hunspell-a-tale-of-two-spell-checkers/>`_;
* `morphologik <https://github.com/morfologik/morfologik-stemming>`_: stemmer/POS-tagger/spellchecker used by `LanguageTool <https://languagetool.org/>`_; it uses a very interesting technique of encoding dictionaries with FSA, making dictionary lookup much more effective than Hunspell's;
* `morphologik <https://github.com/morfologik/morfologik-stemming>`_: stemmer/POS-tagger/spellchecker used by `LanguageTool <https://languagetool.org/>`_; it uses a very interesting technique of encoding dictionaries with FSA, making dictionary lookup much more efficient than Hunspell's;
* `voikko <https://voikko.puimula.org/>`_, developed for Finnish, which Hunspell can't handle too well due to its complicated affixes;
* `SymSpell <https://github.com/wolfgarbe/SymSpell>`_: very fast algorithm (relying on the availability of a full list of all language's words)
* `JamSpell <https://github.com/bakwc/JamSpell>`_: machine learning-based one
* `SymSpell <https://github.com/wolfgarbe/SymSpell>`_: very fast algorithm (relying on the availability of a full list of all language's words);
* `JamSpell <https://github.com/bakwc/JamSpell>`_: machine learning-based one.

Author
------
140 changes: 140 additions & 0 deletions examples/unmunch.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
# This is "unmunching" script for Hunspell dictionaries, based on Spylls (full Python port of Hunspell):
# https://github.com/zverok/spylls
#
# "Unmunching" (Hunspell's term) is a process of turning affix-compressed dictionary into plain list
# of all language's words. E.g. for English, in the dictionary we have "spell/JSMDRZG" (stem + flags
# declaring what suffixes and prefixes it might have), and we can run this script:
#
# python unmunch.py path/to/en_US spell
#
# Which will produce this list:
#
# spell
# spell's
# spelled
# speller
# spellers
# spelling
# spellings
# spells
#
# Running without second argument will unmunch the entire dictionary.
#
# WARNINGS!
#
# 1. The script is not extensively tested, just a demo for discussion in https://github.com/zverok/spylls/issues/10
# (see the issue for more discussion)
# 2. It doesn't try to produce all possible words for compounding, because the list is potentiall infinite.
#

import sys
from optparse import OptionParser

from spylls.hunspell.dictionary import Dictionary

parser = OptionParser()
parser.add_option("-d", "--dictionary", dest="dictionary", metavar='DICTIONARY',
help="dictionary path to unmunch (<path>.aff and <path>.dic should be present)")
parser.add_option("-w", "--word", dest="word", default=None, metavar='WORD',
help="singular word to unmunch (if absent, unmunch the whole dictionary")
parser.add_option("-i", "--immediate", dest="immediate", default=False, action='store_true',
help="output unmunch for each word immediately (more memory-effective, but not sorted and might contain duplicates)")

(options, args) = parser.parse_args()

dictionary = Dictionary.from_files(options.dictionary)

def unmunch(word, aff):
result = set()

if aff.FORBIDDENWORD and aff.FORBIDDENWORD in word.flags:
return result

if not (aff.NEEDAFFIX and aff.NEEDAFFIX in word.flags):
result.add(word.stem)

suffixes = [
suffix
for flag in word.flags
for suffix in aff.SFX.get(flag, [])
if suffix.cond_regexp.search(word.stem)
]
prefixes = [
prefix
for flag in word.flags
for prefix in aff.PFX.get(flag, [])
if prefix.cond_regexp.search(word.stem)
]

for suffix in suffixes:
root = word.stem[0:-len(suffix.strip)] if suffix.strip else word.stem
suffixed = root + suffix.add
if not (aff.NEEDAFFIX and aff.NEEDAFFIX in suffix.flags):
result.add(suffixed)

secondary_suffixes = [
suffix2
for flag in suffix.flags
for suffix2 in aff.SFX.get(flag, [])
if suffix2.cond_regexp.search(suffixed)
]
for suffix2 in secondary_suffixes:
root = suffixed[0:-len(suffix2.strip)] if suffix2.strip else suffixed
result.add(root + suffix2.add)

for prefix in prefixes:
root = word.stem[len(prefix.strip):]
prefixed = prefix.add + root
if not (aff.NEEDAFFIX and aff.NEEDAFFIX in prefix.flags):
result.add(prefixed)

if prefix.crossproduct:
additional_suffixes = [
suffix
for flag in prefix.flags
for suffix in aff.SFX.get(flag, [])
if suffix.crossproduct and not suffix in suffixes and suffix.cond_regexp.search(prefixed)
]
for suffix in suffixes + additional_suffixes:
root = prefixed[0:-len(suffix.strip)] if suffix.strip else prefixed
suffixed = root + suffix.add
result.add(suffixed)

secondary_suffixes = [
suffix2
for flag in suffix.flags
for suffix2 in aff.SFX.get(flag, [])
if suffix2.crossproduct and suffix2.cond_regexp.search(suffixed)
]
for suffix2 in secondary_suffixes:
root = suffixed[0:-len(suffix2.strip)] if suffix2.strip else suffixed
result.add(root + suffix2.add)

return result

result = set()

if options.word:
lookup = options.word
print(f"Unmunching only words with stem {lookup}")
else:
lookup = None
print(f"Unmunching the whole dictionary")

print('')

for word in dictionary.dic.words:
if not lookup or word.stem == lookup:
if lookup:
print(f"Unmunching {word}")
if options.immediate:
for word in sorted(unmunch(word, dictionary.aff)):
print(word)
else:
result.update(unmunch(word, dictionary.aff))

print('')

if not options.immediate:
for word in sorted(result):
print(word)
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[tool.poetry]
name = "spylls"
version = "0.1.5"
version = "0.1.6"
description = ""
authors = ["Victor Shepelev <zverok.offline@gmail.com>"]

2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
@@ -5,7 +5,7 @@

setuptools.setup(
name="spylls",
version="0.1.5",
version="0.1.6",
author="Victor Shepelev",
author_email="zverok.offline@gmail.com",
description="Hunspell ported to pure Python",
7 changes: 2 additions & 5 deletions spylls/hunspell/algo/lookup.py
Original file line number Diff line number Diff line change
@@ -226,12 +226,9 @@ def is_correct(w):
if NUMBER_REGEXP.fullmatch(word):
return True

# If the whole word is correct
if is_correct(word):
return True

# ``try_break`` recursively produces all possible lists of word breaking by break patterns
# (like dashes).
# (like dashes). "The whole word" is yielded as a first alternative, so if the whole word is
# correct, the loop will return early.
for parts in self.break_word(word):
# If all parts in this variant of the breaking is correct, the whole word considered correct.
if all(is_correct(part) for part in parts if part):
2 changes: 1 addition & 1 deletion spylls/hunspell/algo/suggest.py
Original file line number Diff line number Diff line change
@@ -428,7 +428,7 @@ def edits(self, word: str) -> Iterator[Union[Suggestion, MultiWordSuggestion]]:
# for space).
#
# ...in this case we should suggest both "<word1> <word2>" as one dictionary entry, and
# "<word1>" "<word1>" as a sequence -- but clarifying this sequence might NOT be joined by "-"
# "<word1>" "<word2>" as a sequence -- but clarifying this sequence might NOT be joined by "-"
for suggestion in pmt.replchars(word, self.aff.REP):
if isinstance(suggestion, list):
yield Suggestion(' '.join(suggestion), 'replchars')
4 changes: 2 additions & 2 deletions spylls/hunspell/readers/dic.py
Original file line number Diff line number Diff line change
@@ -96,8 +96,8 @@ def read_dic(source: BaseReader, *, aff: Aff, context: Context) -> dic.Dic:
# REP Wendsay Wednesday
# hunspell handles it by just `if (captype==INITCAP)`...
if pattern.endswith('*'):
# If it is ``pretty ph:prit*`` -- it means pair ``(prit, prett)`` should be added
# to REP-table
# If it is ``pretty ph:prity*`` -- it means pair ``(prit, prett)`` should be added
# to REP-table (stripping the last character both from word, and alternative)
aff.REP.append(RepPattern(pattern[:-2], word[:-1]))
elif '->' in pattern:
# If it is ``happy ph:hepi->happi`` -- it means pair ``(hepi, happi)`` should be added

0 comments on commit bf93941

Please sign in to comment.