JSymSpell

JSymSpell is a zero-dependency Java 8+ port of SymSpell

Overview

The Symmetric Delete spelling correction algorithm speeds up the process up by orders of magnitude.

It achieves this by generating delete-only candidates in advance from a given lexicon.

Setup

Add the latest JSymSpell dependency to your project

Getting Started

To start, we'll load the data sets of unigrams and bigrams:

Map<Bigram, Long> bigrams = Files.lines(Paths.get("src/test/resources/bigrams.txt"))
                                 .map(line -> line.split(" "))
                                 .collect(Collectors.toMap(tokens -> new Bigram(tokens[0], tokens[1]), tokens -> Long.parseLong(tokens[2])));
Map<String, Long> unigrams = Files.lines(Paths.get("src/test/resources/words.txt"))
                                  .map(line -> line.split(","))
                                  .collect(Collectors.toMap(tokens -> tokens[0], tokens -> Long.parseLong(tokens[1])));

Let's now create an instance of SymSpell by using the builder and load these maps. For this example we'll limit the max edit distance to 2:

SymSpell symSpell = new SymSpellBuilder().setUnigramLexicon(unigrams)
                                         .setBigramLexicon(bigrams)
                                         .setMaxDictionaryEditDistance(2)
                                         .createSymSpell();

And we are ready!

int maxEditDistance = 2;
boolean includeUnknowns = false;
List<SuggestItem> suggestions = symSpell.lookupCompound("Nostalgiais truly one of th greatests human weakneses", maxEditDistance, includeUnknowns);
System.out.println(suggestions.get(0).getSuggestion());
// Output: nostalgia is truly one of the greatest human weaknesses
// ... only second to the neck!

Custom String Distance Algorithms

By default, JSymSpell calculates Damerau-Levenshtein distance. Depending on your use case, you may want to use a different one.

Other algorithms to calculate String Distance that might result of interest are:

Here's an example using Hamming Distance:

SymSpell symSpell = new SymSpellBuilder().setUnigramLexicon(unigrams)
                                         .setStringDistanceAlgorithm((string1, string2, maxDistance) -> {
                                             if (string1.length() != string2.length()){
                                                 return -1;
                                             }
                                             char[] chars1 = string1.toCharArray();
                                             char[] chars2 = string2.toCharArray();
                                             int distance = 0;
                                             for (int i = 0; i < chars1.length; i++) {
                                                 if (chars1[i] != chars2[i]) {
                                                     distance += 1;
                                                 }
                                             }
                                             return distance;
                                         })
                                         .createSymSpell();

Custom character comparison

Let's say you are building a query engine for country names where the input form allows Unicode characters, but the database is all ASCII. You might want searches for Espana to return España entries with distance 0:

CharComparator customCharComparator = new CharComparator() {
    @Override
    public boolean areEqual(char ch1, char ch2) {
        if (ch1 == 'ñ' || ch2 == 'ñ') {
            return ch1 == 'n' || ch2 == 'n';
        }
        return ch1 == ch2;
    }
};
StringDistance damerauLevenshteinOSA = new DamerauLevenshteinOSA(customCharComparator);
SymSpell symSpell = new SymSpellBuilder().setUnigramLexicon(Map.of("España", 10L))
                                         .setStringDistanceAlgorithm(damerauLevenshteinOSA)
                                         .createSymSpell();
List<SuggestItem> suggestions = symSpell.lookup("Espana", Verbosity.ALL);
assertEquals(0, suggestions.get(0).getEditDistance());

Frequency dictionaries in other languages

As in the original SymSpell project, this port contains an English frequency dictionary that you can find at src/test/resources/words.txt If you need a different one, you just need to compute a Map<String, Long> where the key is the word and the value is the frequency in the corpus.

Map<String, Long> unigrams = Arrays.stream("A B A B C A B A C A".split(" "))
                                   .collect(Collectors.groupingBy(String::toLowerCase, Collectors.counting()));
System.out.println(unigrams);
// Output: {a=5, b=3, c=2}

Built With

Maven - Dependency Management

Versioning

We use SemVer for versioning.

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Acknowledgments

Wolf Garbe

Name		Name	Last commit message	Last commit date
Latest commit History 151 Commits
.github/workflows		.github/workflows
src		src
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
codecov.yml		codecov.yml
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

JSymSpell

Overview

Setup

Getting Started

Custom String Distance Algorithms

Custom character comparison

Frequency dictionaries in other languages

Built With

Versioning

License

Acknowledgments

About

Releases 2

Packages

Contributors 3

Languages

License

rxp90/jsymspell

Folders and files

Latest commit

History

Repository files navigation

JSymSpell

Overview

Setup

Getting Started

Custom String Distance Algorithms

Custom character comparison

Frequency dictionaries in other languages

Built With

Versioning

License

Acknowledgments

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 3

Languages

Packages