The Multilayer Shift-and-Stitch Deep Convolutional Neural Network machine learning algorithm was developed by University of Virginia's Bioinformatics Laboratory. More information on this algorithm can be found at its github repo.
In order to test the ability of this algorithm to efficiently process natural language data, four python files were constructed and utilized to pre-process text from the Rhetorical Structure Theory data collection. The Python toolkit used for pre-processing was the Natural Language Toolkit (NLTK). The following python files can be found in the RST\data\RSTtrees-WSJ-main-1.0 pathway:
- 0_Dict_Build.py
- 1_EDU_tag.py
- 1_EDU_word.py
- RST_ALL_EDUs.py
In addition to these code files, all of the text files processed are included in this pathway as well. Additional text files are included in the repo as well.