An ontology is a knowledge representation framework that is machine readable. It facilitates logical relationships between classes and allows us to standardise the formalised vocabulary within this domain. The metadata contained within an ontology is valuable for research having shown to address the challenge presented by unstructured text.
Unstructured text can be processed, mined, and empowered by NLP tools, yet majority of tools are not designed to consider ontologies.
Jabberwocky allows users to easily manipulate ontologies with extraction and enhancements for conducting NLP tasks. Here provides an explanation - with a working example - for the Jabberwocky toolkit.
See the Jabberwocky repository for code.
Extract metadata from ontology classes based on a list of tags.
Users should use ontologies that are in the OWL
RDF/XML syntax.
(if not in this format, users can open ontology in Protégé and export in correct format)
Metadata in ontologies are in various formats, below shows a list of tags as an example:
oboInOWL:hasExactSynonym
oboInOWL:hasRelatedSynonym
Words of interest (recommended to match ontology)
dragon
water
large
snatch_output.txt
will include the ontology classes and corresponding metadata based on chosen classes & tags.
If users have no words of interest, then the output will include all ontology classes but users will still need to include a list of tags.
Annotation of a corpus (unstructured text).
Words of interest - the bandersnatch
output can be used here:
dragon
water
ocean
large
big
It is important to note: phrases work in Jabberwocky.
The corpus should be a txt
with sentences/posts separated in new lines:
This is post 1 in a corpus
This is post 2
This is post 3 - as you can see there is a gap between post 2 and 3
This is post 4 - don't worry about extra whitespace as the code will drop "empty lines"
catch_output.txt
will include the posts that were annotated.
Users can choose output type: grep
format or w/ corresponding tags.
Moreover, users can choose to export the posts that weren't annotated.
Users can generate a wordcloud figure from the corpus.
Rank all words in a corpus in terms of importance (via the TF-IDF statistical technique).
One valuable parameter is being able to adjust input for TF-IDF so the technique measures multiple n-grams. Users can request more than unigrams: bigrams, trigrams, and more.
Users can provide a list of words to remove from the corpus to avoid being weighted/measured - the bandersnatch
output can be used here.
bite_output.tsv
is a dataframe with Word and Score.
Scores are the average TF-IDF values across posts, normalised for readability.
Moreover, normalised scores that are 0 are dropped.
Word | Raw score | Normalised score ------- | ----------- mega | 0.078 | 1.0 path | 0.06 | 0.719
Users can export a bar plot of the top N ranked terms (default 30).
Updating ontology classes with new metadata.
Users will provide a dataframe with three columns: the annotation, class (exact ontology match), and tag:
annotation class tag
sea water oboInOWL:hasExactSynonym
mega large oboInOWL:hasRelatedSynonym
https://pokemon.fandom.com/wiki/Types type oboInOWL:DbXref
This can be derived from the bite
output (e.g. synonyms).
[ontology]_updated.owl
is the updated ontology.
Plot an ontology in web or tree style. By default, superclasses will have overlay text but users can choose whether to include for subclasses.
[ontology]_[plottype].png
is the updated ontology.
You have curated unstructured text: blog posts from a social media platform (with permission of course, in this example I invented these fake conversations).
Your aim is to text mine the corpus and only have posts covering a particular topic (or set of topics). But you realise, although you know some words in this topic of yours, you may be missing related/broad synonyms.
This is where ontologies are useful. Ontologies are a controlled set of vocabulary with annotations.
With your words of interest (ontology classes) you can run bandersnatch
to extract all synonyms.
With these classes and corresponding synonyms, you can annotate the corpus using catch
- the PhraseMatcher()
function1 tags each post in the corpus.
You've chosen to have two outputs: one with the annotated posts for downstream analysis. The other you decided to investigate if there is anything valuable in the posts that weren't annotated.
You can proceed to use bite
- investigating if there are any "important" terms.
The statistical TF-IDF method2 is applied and all words are ranked in terms of importance.
With this bite
output, you may have noticed new synonyms...
You can use arise
to update your ontology classes with these new synonyms.
Finally, you may want to rerun bandersnatch
to extract an updated list of key terms and then rerun catch
.
This concludes the NLP workflow: you noticed the 2nd round of catch
provides more data and so a more fruitful downstream analysis.
This work was published in JOSS, you can cite here:
@article{Pendleton2020,
doi = {10.21105/joss.02168},
url = {https://doi.org/10.21105/joss.02168},
year = {2020},
publisher = {The Open Journal},
volume = {5},
number = {51},
pages = {2168},
author = {Samantha C. Pendleton and Georgios V. Gkoutos},
title = {Jabberwocky: an ontology-aware toolkit for manipulating text},
journal = {Journal of Open Source Software}
}
This repository was inspired by (and the inspiration of) the OcIMIDo project.
End of page