-
Notifications
You must be signed in to change notification settings - Fork 7
cnap/anno-pipeline
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
# This is the pipeline used to annotate Gigaword English v.5 # (Annotated Gigaword, Napoles et al. 2012). # # Courtney Napoles, cdnapoles@gmail.com # 2012-07-03 NOTES See pipeline.sh for the full pipeline and usage of individual steps If you are running a copy of this, you need to modify scripts/ splitta.1.03/sbd.py so that the paths for SVM_LEARN and SVM_CLASSIFY point to your installation. Note that the pipeline uses a parallel environment (8 threads for parsing) so please set your configurations accordingly (qsub -l num_proc=8,mem_free=16G,h_vmem=22G). Be sure to set the environment encoding to UTF-8. USAGE To run: ./pipeline file_to_annotate.xml working_directory [OPTIONS] To just run the annotators: java -Xmx16g -cp bin:lib/stanford-corenlp-2012-05-22.jar:lib/my-xom.jar:lib/stanford-corenlp-2012-05-22-models.jar:lib/joda-time.jar \ edu.jhu.annotation.GigawordAnnotator --in <TESTFILE> If you'd like to annotate a file that contains a single document without any SGML markup, add "--sgml f". However, for annotating a large quantity of files this is unadvisable, because loading the Stanford models takes a couple of minutes. It is more efficient to include several documents in one file (and documents should be formatted like <DOC><TEXT>parses</TEXT></DOC>). FILE FORMAT sample.txt contains a sample file format. If using SGML markup (which is recommended because then multiple documents can be stored in the same file), the following format is assumed: <DOC id="xx"> <TEXT> ... </TEXT> </DOC> <DOC ... Any tags in between <DOC> and <TEXT> are ignored but passed through intact. The only tag allowed in <TEXT> is <P>. All text in the <TEXT> element will be processed and annotated. The pipeline assumes that each line is EITHER sgml markup or text (so do not put a tag on the same line as text. The pipeline does not detect/correct invalid SGML but it will convert SGML to XML (by adding a root element and escaping <, >, and &. DEPENDENCIES Software versions used: Splitta 1.03 Stanford CoreNLP 1.3.2 Requirements: jgrapht.jar joda-time.jar my-xom.jar splitta.1.03 stanford-corenlp-2012-05-22.jar stanford-corenlp-2012-05-22-models.jar svm_light.6.02 umd-parser.jar wsj-6.pml # grammar file
About
tool chain used for Annotated Gigaword
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published