GitHub - cnap/anno-pipeline: tool chain used for Annotated Gigaword

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
scripts		scripts
src/edu/jhu/annotation		src/edu/jhu/annotation
README		README
annotate.sh		annotate.sh
build.xml		build.xml
pipeline.sh		pipeline.sh
sample.txt		sample.txt

Repository files navigation

# This is the pipeline used to annotate Gigaword English v.5
# (Annotated Gigaword, Napoles et al. 2012). 
#
# Courtney Napoles, cdnapoles@gmail.com
# 2012-07-03

NOTES

See pipeline.sh for the full pipeline and usage of individual steps
If you are running a copy of this, you need to modify scripts/
splitta.1.03/sbd.py so that the paths for SVM_LEARN and SVM_CLASSIFY
point to your installation.

Note that the pipeline uses a parallel environment (8 threads for 
parsing) so please set your configurations accordingly 
(qsub -l num_proc=8,mem_free=16G,h_vmem=22G).

Be sure to set the environment encoding to UTF-8.


USAGE

To run:

./pipeline file_to_annotate.xml working_directory [OPTIONS]

To just run the annotators:
java -Xmx16g -cp bin:lib/stanford-corenlp-2012-05-22.jar:lib/my-xom.jar:lib/stanford-corenlp-2012-05-22-models.jar:lib/joda-time.jar \
     edu.jhu.annotation.GigawordAnnotator --in <TESTFILE>

If you'd like to annotate a file that contains a single document 
without any SGML markup, add "--sgml f". However, for annotating a 
large quantity of files this is unadvisable, because loading the 
Stanford models takes a couple of minutes. It is more efficient to
include several documents in one file (and documents should be
formatted like <DOC><TEXT>parses</TEXT></DOC>). 


FILE FORMAT

sample.txt contains a sample file format. If using SGML markup
(which is recommended because then multiple documents can be stored
in the same file), the following format is assumed:

<DOC id="xx">
<TEXT>
...
</TEXT>
</DOC>
<DOC ...

Any tags in between <DOC> and <TEXT> are ignored but passed through
intact. The only tag allowed in <TEXT> is <P>. All text in the
<TEXT> element will be processed and annotated. The pipeline assumes
that each line is EITHER sgml markup or text (so do not put a tag
on the same line as text. The pipeline does not detect/correct 
invalid SGML but it will convert SGML to XML (by adding a root 
element and escaping <, >, and &. 


DEPENDENCIES

Software versions used:
Splitta 1.03
Stanford CoreNLP 1.3.2

Requirements:
jgrapht.jar
joda-time.jar
my-xom.jar
splitta.1.03
stanford-corenlp-2012-05-22.jar
stanford-corenlp-2012-05-22-models.jar
svm_light.6.02
umd-parser.jar
wsj-6.pml	# grammar file