diff --git a/README.md b/README.md index ae3fae3..2e01cd3 100644 --- a/README.md +++ b/README.md @@ -3,6 +3,14 @@ Akin Akin is a collection of string comparison algorithms for Elixir. This solution was born of a [Record Linking](https://en.wikipedia.org/wiki/Record_linkage) project. It combines and modifies [The Fuzz](https://github.com/smashedtoatoms/the_fuzz) and [Fuzzy Compare](https://github.com/patrickdet/fuzzy_compare). Algorithms can be called independently or in total to return a map of metrics. This library was built to facilitiate the disambiguation of names but can be used to compare any two binaries. +## New! Notebooks + +Disambiguation +[![Run Disambiguation in Livebook](https://livebook.dev/badge/v1/blue.svg)](https://livebook.dev/run?url=https%3A%2F%2Fgithub.com%2Fvanessaklee%2Fakin%2Fblob%2Fmain%2Fnotebooks%2Fdisambiguation.livemd) + +Name Disambiguation +[![Run Name Disambiguation in Livebook](https://livebook.dev/badge/v1/blue.svg)](https://livebook.dev/run?url=https%3A%2F%2Fgithub.com%2Fvanessaklee%2Fakin%2Fblob%2Fmain%2Fnotebooks%2Fname_disambiguation.livemd) +
Table of Contents diff --git a/notebooks/examples.livemd b/notebooks/disambiguation.livemd similarity index 99% rename from notebooks/examples.livemd rename to notebooks/disambiguation.livemd index cacad1e..a184fd9 100644 --- a/notebooks/examples.livemd +++ b/notebooks/disambiguation.livemd @@ -1,4 +1,4 @@ -# Akin Examples +# Disambiguation ## Akin diff --git a/notebooks/name_disambiguation.livemd b/notebooks/name_disambiguation.livemd new file mode 100644 index 0000000..1363351 --- /dev/null +++ b/notebooks/name_disambiguation.livemd @@ -0,0 +1,93 @@ +# Name Disambiguation + +## Match + +_UNDER DEVELOPMENT_ + +Identity is the challenge of author name disambiguation (AND). The aim of AND is to match an author's name to that author when the author appears in a list of many authors. Complexity arises from homonymity (many people with the same name) and synonymity (when one person uses different forms/spellings of their name in publications). + +Given the name of an author which is divided into the given, middle, and family name parts (i.e. "Virginia", nil, "Woolf") and a list of possible matching author names, find and return the matches for the author in the list. If initials exist in the left name, a separate comparison is performed for the initals and the sets of the right string. + +If the comparison metrics produce a score greater than or equal to 0.9, they considered a match and returned in the list. + +We want to find possible matches to the name "V. Woolf" + +```elixir +name = "Virginia Woolf" +``` + +in a list of other names + +```elixir +other_names = [ + "V Woolf", + "V Woolfe", + "Virginia Woolf", + "V White", + "Viginia Wolverine", + "Virginia Woolfe" +] +``` + +The most likely matches are returned. + +```elixir +Akin.match_names(name, other_names) +``` + +Use options to require stricter matching. + +```elixir +other_names = [ + "Victor Woolf", + "V Woolf", + "V Woolfe", + "Virginia Woolf", + "V White", + "Viginia Wolverine", + "Virginia Woolfe" +] +``` + +```elixir +opts = [match_at: 0.99] + +Akin.match_names(name, other_names, opts) +``` + +### Initials + +The results are good even if we only have an initial for part of the name we are disambiguating. + +```elixir +name = "V. Woolf" +``` + +```elixir +Akin.match_names(name, other_names) +``` + +### Not Perfect + +The results are imperfect and can lead to unwanted matches. See how "Victor" fairs. + +```elixir +other_names = [ + "Victor Woolf", + "V Woolfe", + "Virginia Woolf", + "V White", + "Viginia Wolverine", + "Virginia Woolfe" +] +``` + +```elixir +Akin.match_names(name, other_names) +``` + +```elixir +opts = [match_at: 0.99, algorithms: ["bag_distance", "jaccard", "jaro_winkler"]] + +Akin.match_names(name, other_names, opts) +```