-
Notifications
You must be signed in to change notification settings - Fork 19
/
README
120 lines (99 loc) · 5.87 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
Please refer to our paper entitled "Paraphrasing for Style" at COLING 2012 for more details.
Please also cite our paper accordingly.
@inproceedings{xu2012paraphrasing,
title={Paraphrasing for Style},
author={Xu, Wei and Ritter, Alan and Dolan, Bill and Grishman, Ralph and Cherry, Colin},
booktitle={COLING},
pages={2899--2914},
year={2012}
}
============================================================================================
python/crawl_{plays|sonnets}.py
crawls the nfs.sparknotes.com website and downloads html
python/scrapper_soupparser.py
extracts original and modern english sentences from the HTML
scripts/extract_lines.sh
runs scrapper_soupparser.py on all the HTML files and puts output into format suitable for sentence alignment (in data/plays/align)
scrapper2/
scripts and data for scrapping parallel texts from www.enotes.com
data/align/plays/ from nfs.sparknotes.com
data/align/plays2/ from www.enotes.com
data/align/plays/notmerged
data/align/plays2/notmerged
Contains the aligned sentences at the level of HTML pages
data/align/plays/merged
data/align/plays2/merged
All sentences merged together into plays.
The bilingual sentence aligner doesn't really seem to work with very small files (e.g. the HTML pages). Presumably
it should be possible to do a better job with the smaller files, since they are more tightly aligned, but I don't
think this is the type of data the sentence aligner was designed for.
Anyway, it seems to work fine on the merged plays. Out of 31,718 original lines the sentence aligner finds 21,079
alignments. It may be possible to improve on this...
Also note: in the merged plays the pages are in random order. I think this probably shouldn't make much difference,
but might be worth fixing...
data/align/plays/model_16plays
Moses model trained on 16 plays (except R&J), following the instructions on http://www.statmt.org/moses_steps.html
moses.ini
moses-bin.ini
The two above use same parameters but provide different outputs (?)
./tuning/
I put R&J over there, planning to split into dev and test.
data/test_small
Wei_facebook.txt
A small set of Facebook posts collected by hand from my Facebook.
03_natural_tweets.pl
natural_tweets_10000.txt
Made an attempt to find 'meaningful' Tweets by heuristic rules, e.g. propotion of words in dictionary etc.
data/shakespere.dict
A Shakespeare Glossary from http://www.william-shakespeare.info/
scripts and html pages used:
dictionary/*.htm
python/crawl_dictionary.py
python/scrap_dictionary.py
mert/
Contains held out dev data from Romeo & Juliet for discriminative training.
mert/mert-work/moses.ini contains tuned weights for the 16 plays model
eval/
Files for evaluating BLEU. Below are some results from the 16 plays model with/without MERT:
First note that the 2nd translation (enotes.com) appears to be much closer to the original text than the first (sparknotes):
-bash-4.1$ ~/mt/mosesdecoder/scripts/generic/multi-bleu.perl ascii.romeojuliet_tokenized_lower_original < ascii.romeojuliet_tokenized_lower_modern.1
BLEU = 24.67, 56.5/30.2/18.8/11.5 (BP=1.000, ratio=1.047, hyp_len=4312, ref_len=4120)
-bash-4.1$ ~/mt/mosesdecoder/scripts/generic/multi-bleu.perl ascii.romeojuliet_tokenized_lower_original < ascii.romeojuliet_tokenized_lower_modern.2
BLEU = 52.30, 75.9/57.7/46.0/37.1 (BP=1.000, ratio=1.044, hyp_len=4300, ref_len=4120)
Current BLEU scores (looks like the larger LM has a big effect):
16and7plays_16LM.1
BLEU = 27.35, 59.8/33.3/21.0/13.4 (BP=1.000, ratio=1.041, hyp_len=4287, ref_len=4120)
16and7plays_16LM_mert.1
BLEU = 27.61, 59.9/33.8/21.4/13.5 (BP=1.000, ratio=1.039, hyp_len=4281, ref_len=4120)
16and7plays_16LM.2
BLEU = 52.68, 77.4/58.6/46.2/36.7 (BP=1.000, ratio=1.041, hyp_len=4289, ref_len=4120)
16and7plays_16LM_mert.2
BLEU = 53.27, 77.5/59.3/46.9/37.4 (BP=1.000, ratio=1.045, hyp_len=4306, ref_len=4120)
16and7plays_37LM.1
BLEU = 30.56, 61.1/36.1/24.2/16.3 (BP=1.000, ratio=1.044, hyp_len=4300, ref_len=4120)
16and7plays_37LM_mert.1
BLEU = 30.54, 60.7/36.0/24.3/16.4 (BP=1.000, ratio=1.048, hyp_len=4319, ref_len=4120)
16and7plays_37LM.2
BLEU = 57.80, 79.3/63.0/52.0/42.9 (BP=1.000, ratio=1.045, hyp_len=4305, ref_len=4120)
16and7plays_37LM_mert.2
BLEU = 57.32, 78.9/62.3/51.5/42.7 (BP=1.000, ratio=1.052, hyp_len=4335, ref_len=4120)
16plays_37LM.1
BLEU = 30.63, 61.1/36.1/24.2/16.5 (BP=1.000, ratio=1.032, hyp_len=4250, ref_len=4120)
16plays_37LM_mert.1
BLEU = 30.53, 60.8/35.9/24.2/16.4 (BP=1.000, ratio=1.030, hyp_len=4244, ref_len=4120)
16plays_37LM.2
BLEU = 55.54, 78.0/61.1/49.6/40.2 (BP=1.000, ratio=1.036, hyp_len=4267, ref_len=4120)
16plays_37LM_mert.2
BLEU = 56.02, 78.0/60.9/50.1/41.4 (BP=1.000, ratio=1.040, hyp_len=4285, ref_len=4120)
singleword1.1
BLEU = 24.39, 56.4/29.8/18.6/11.3 (BP=1.000, ratio=1.047, hyp_len=4312, ref_len=4120)
singleword1.2
BLEU = 51.51, 75.8/56.9/45.1/36.2 (BP=1.000, ratio=1.044, hyp_len=4300, ref_len=4120)
./models/Lexicons/singleword1/
A very simple (not sure about correctness) dictionary-based model, which can translate "No way excuse his disadvantages when we do bear" into "No way excuse his foils when we do bear". The phrase table P(o|m) based on a verbatim dictionary (1520 pairs) and the frequencies of the 'original' words in the 37 Shakespeare plays. E.g.
- There are two entries of Shakespearean word 'abrook' in the dictionary 'abrook -> abide' and 'abrook' -> 'brook'. Extend it to 'modern -> original' phrase table, by considering the reverse direction (e.g. 'abide -> abrook' etc ), the identical word pair (e.g. 'abide -> abide' etc), and adding suffixes (e.g. 'abides -> abrooks', 'abideed -> abrooked', etc)
- Look into the language model built on 37 original plays (37plays_tokenized_lowercased.original.1gram) and derive P(o|m) based on unigram conditional probability. E.g.
abide ||| abide ||| 0.940088299386192
abide ||| abrook ||| 0.0599117006138077
brook ||| brook ||| 0.95608871056976
brook ||| abrook ||| 0.0439112894302401