-
Notifications
You must be signed in to change notification settings - Fork 7
/
README
executable file
·76 lines (57 loc) · 2.19 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
# This is the pipeline used to annotate Gigaword English v.5
# (Annotated Gigaword, Napoles et al. 2012).
#
# Courtney Napoles, cdnapoles@gmail.com
# 2012-07-03
NOTES
See pipeline.sh for the full pipeline and usage of individual steps
If you are running a copy of this, you need to modify scripts/
splitta.1.03/sbd.py so that the paths for SVM_LEARN and SVM_CLASSIFY
point to your installation.
Note that the pipeline uses a parallel environment (8 threads for
parsing) so please set your configurations accordingly
(qsub -l num_proc=8,mem_free=16G,h_vmem=22G).
Be sure to set the environment encoding to UTF-8.
USAGE
To run:
./pipeline file_to_annotate.xml working_directory [OPTIONS]
To just run the annotators:
java -Xmx16g -cp bin:lib/stanford-corenlp-2012-05-22.jar:lib/my-xom.jar:lib/stanford-corenlp-2012-05-22-models.jar:lib/joda-time.jar \
edu.jhu.annotation.GigawordAnnotator --in <TESTFILE>
If you'd like to annotate a file that contains a single document
without any SGML markup, add "--sgml f". However, for annotating a
large quantity of files this is unadvisable, because loading the
Stanford models takes a couple of minutes. It is more efficient to
include several documents in one file (and documents should be
formatted like <DOC><TEXT>parses</TEXT></DOC>).
FILE FORMAT
sample.txt contains a sample file format. If using SGML markup
(which is recommended because then multiple documents can be stored
in the same file), the following format is assumed:
<DOC id="xx">
<TEXT>
...
</TEXT>
</DOC>
<DOC ...
Any tags in between <DOC> and <TEXT> are ignored but passed through
intact. The only tag allowed in <TEXT> is <P>. All text in the
<TEXT> element will be processed and annotated. The pipeline assumes
that each line is EITHER sgml markup or text (so do not put a tag
on the same line as text. The pipeline does not detect/correct
invalid SGML but it will convert SGML to XML (by adding a root
element and escaping <, >, and &.
DEPENDENCIES
Software versions used:
Splitta 1.03
Stanford CoreNLP 1.3.2
Requirements:
jgrapht.jar
joda-time.jar
my-xom.jar
splitta.1.03
stanford-corenlp-2012-05-22.jar
stanford-corenlp-2012-05-22-models.jar
svm_light.6.02
umd-parser.jar
wsj-6.pml # grammar file