-
Notifications
You must be signed in to change notification settings - Fork 1
results
See also
determinism constraints (no connected constraint) same features as March12 LR parser
predicted singletons including predicted tops
30 iterations SVM, with averaging
LR | Alg1 | |
---|---|---|
PAS | .803 | .810 |
DM | .740 | .770 |
PCEDT | .648 | .624 |
Logreg Model: BasicFeatures + LinearOrderFeatures + CoarseDependencyFeatures + LinearContextFeatures + DependencyPathv1 + SubcatSequenceFE + UnlabeledDepFE
One "onefull" iteration on sec00-19.
Formalism | Test LF |
---|---|
PAS | .861 |
DM | .809 |
PCEDT | .714 |
Medtrain (30 iterations)
Formalism | Test LF |
---|---|
PAS | .831 |
DM | .763 |
PCEDT | .646 |
gold singletons, no lexical bigrams
including predicted tops
30 iterations. I was frustrated that the regularizer wasn't working, so I tried a bunch of different regularizers from .1 all the way down to 1E-9, and including 0.0. Picked the best regularizer on dev. Honestly, with regularizers this small, this is just like picking the best run on dev, and reporting the accuracy on dev (which is not the right way to do experiments). I should really do multiple runs for each regularizer strength, pick the regularizer using dev, and report accuracy and standard deviation on a devtest. Maybe that's overkill, but one should at least do multiple runs at the same regularizer strength to see the standard deviation. Based on the variance of the different runs with regularizers so small they can't be doing anything, I would say the standard deviation for these runs is about .01.
Dev F1 | L2 Reg | |
---|---|---|
PAS | .794 | 1E-9 |
DM | .751 | 1E-7 |
PCEDT | .632 | 1E-7 |
(compared to LogReg parser PAS F1 =.803, DM F1 = .740, PCEDT F1 = .648)
gold singletons, no lexical bigrams
including predicted tops
10 iterations. lambda = 0.0 (best from .1, .01, .001, 0.0)
Dev F1 | |
---|---|
PAS | .780 |
DM | .760 |
PCEDT | .634 |
(compared to PAS F1 =.803, DM F1 = .740, PCEDT F1 = .648 from below)
I compared train vs test scores for the different formalisms, and I also experimented with removing lexical bigrams. It turns out lexical bigrams don't help at all for our current regularization (.5), so we should take them out. To do this experiment right (i.e. compare with and without), we should really tune the regularization strength and precision/recall for each formalism without lexical bigrams to compare. Maybe it is actually better without them.
There are a few interesting things to note:
- On med train, we do poorly on PCEDT mainly because we don't generalize well to the test data (train scores are the same, i.e. high 90s but test is 60's).
- Recall drops more than precision on test. (I guess this makes sense. tuning the precision/recall makes it conservative)
(Take the onefull train vs test numbers with a grain of salt because it hasn't converged.)
Medtrain, 30 iterations with lexical bigrams (including tops)
Train P | Train R | Train F1 | Test P | Test R | Test F1 | |
---|---|---|---|---|---|---|
PAS | 96 | 95 | 95.8 | 83 | 77 | 80.3 |
DM | 94 | 94 | 94.5 | 77 | 70 | 74.0 |
PCEDT |
Medtrain, 30 iterations without lexical bigrams (including tops)
Train P | Train R | Train F1 | Test P | Test R | Test F1 | |
---|---|---|---|---|---|---|
PAS | 94 | 94 | 94.7 | 83 | 77 | 80.3 |
DM | 90 | 93 | 92.2 | 76 | 72 | 74.0 |
PCEDT | 96 | 90 | 94.8 | 70 | 59 | 64.8 |
Onefull with lexical bigrams (including tops)
Train P | Train R | Train F1 | Test P | Test R | Test F1 | |
---|---|---|---|---|---|---|
PAS | 86 | 87.1 | 87 | 81 | 84 | 82.7 |
DM | ||||||
PCEDT | 79 | 76 | 78.2 | 70 | 69 | 70.2 |
Onefull without lexical bigrams (including tops)
Train P | Train R | Train F1 | Test P | Test R | Test F1 | |
---|---|---|---|---|---|---|
PAS | 85 | 86 | 85.7 | 81 | 83 | 82.5 |
DM | 78 | 83 | 81.1 | 74 | 80 | 77.6 |
PCEDT | 76 | 74 | 75.6 | 70 | 69 | 69.9 |
only one iteration on sec00-19.
dm_onefull_model.eval.log LF 0.783744
pas_onefull_model.eval.log LF 0.827898
pcedt_onefull_model.eval.log LF 0.702516
DM, sec00-19 (hashing): still going up, slowly, after 30 iterations.
big2.iter 0 .log:LF: 0.783744
big2.iter 1 .log:LF: 0.789072
big2.iter 2 .log:LF: 0.792465
big2.iter 3 .log:LF: 0.794552
big2.iter 4 .log:LF: 0.796020
big2.iter 5 .log:LF: 0.796875
big2.iter 6 .log:LF: 0.797945
big2.iter 7 .log:LF: 0.798727
big2.iter 8 .log:LF: 0.798596
big2.iter 9 .log:LF: 0.798702
big2.iter 10 .log:LF: 0.799353
big2.iter 11 .log:LF: 0.799619
big2.iter 12 .log:LF: 0.799683
big2.iter 13 .log:LF: 0.799777
big2.iter 14 .log:LF: 0.799625
big2.iter 15 .log:LF: 0.799704
big2.iter 16 .log:LF: 0.799834
big2.iter 17 .log:LF: 0.800108
big2.iter 18 .log:LF: 0.800325
big2.iter 19 .log:LF: 0.800347
big2.iter 20 .log:LF: 0.800369
big2.iter 21 .log:LF: 0.800355
big2.iter 22 .log:LF: 0.800369
big2.iter 23 .log:LF: 0.800492
big2.iter 24 .log:LF: 0.800355
big2.iter 25 .log:LF: 0.800471
big2.iter 26 .log:LF: 0.800420
big2.iter 27 .log:LF: 0.800442
big2.iter 28 .log:LF: 0.800515
big2.iter 29 .log:LF: 0.800696
DM, medtrain (no hashing): this is bouncier.
med.iter 0 .log:LF: 0.714132
med.iter 10 .log:LF: 0.737409
med.iter 20 .log:LF: 0.739642
med.iter 30 .log:LF: 0.740195
med.iter 40 .log:LF: 0.740066
med.iter 50 .log:LF: 0.739977
med.iter 60 .log:LF: 0.740080
med.iter 70 .log:LF: 0.739992
med.iter 80 .log:LF: 0.739890
med.iter 90 .log:LF: 0.739985
med.iter 100 .log:LF: 0.739931
med.iter 110 .log:LF: 0.740063
med.iter 120 .log:LF: 0.740153
med.iter 130 .log:LF: 0.740091
med.iter 140 .log:LF: 0.740125
med.iter 150 .log:LF: 0.740067
med.iter 160 .log:LF: 0.740123
med.iter 170 .log:LF: 0.740147
Train on sec00-19, test on sec20.
Features: BasicFeatures + LinearOrderFeatures + DependencyPathv1 + SubcatSequenceFE + CoarseDependencyFeatures + logreg top classifier.
Labeled F1 (including tops):
it. | PAS | DM | PCEDT |
---|---|---|---|
0 | 0.784818 | 0.755264 | 0.674663 |
10 | 0.814511 | 0.785971 | 0.704177 |
20 | 0.816579 | 0.788184 | 0.705842 |
30 | 0.81723 | 0.788877 | 0.706791 |
I'm calling this experiment "feb7" since that's the last checkin for the code i'm running (this one)
Train on sec0019, test in sec20.
Runtime info
- 15g was enough for pas,dm but crashed on pcedt. 18g was enough for pcedt. cab can't do >18g. stampede was able to do 25g.
- maybe it's time to go to feature hashing
- 4 hours to train
Look at iteration=30 for final results.
The "including tops" version is the version used for the reported baseline on the website. As opposed to "excluding tops".
form itr in_F in_P in_R ex_F ex_P ex_R
1 dm 0 0.6259 0.5060 0.8203 0.7624 0.7011 0.8354
2 dm 10 0.6429 0.5347 0.8060 0.7981 0.7790 0.8182
3 dm 20 0.6437 0.5368 0.8038 0.8001 0.7851 0.8156
4 dm 30 0.6442 0.5377 0.8032 0.8008 0.7872 0.8149
5 pas 0 0.6343 0.5130 0.8307 0.7896 0.7280 0.8626
6 pas 10 0.6496 0.5397 0.8156 0.8237 0.8029 0.8456
7 pas 20 0.6503 0.5415 0.8138 0.8255 0.8082 0.8436
8 pas 30 0.6508 0.5425 0.8131 0.8265 0.8110 0.8427
9 pcedt 0 0.6455 0.5988 0.7001 0.6658 0.6369 0.6974
10 pcedt 10 0.6743 0.6594 0.6900 0.6987 0.7163 0.6821
11 pcedt 20 0.6759 0.6643 0.6880 0.7014 0.7242 0.6799
12 pcedt 30 0.6765 0.6660 0.6874 0.7022 0.7270 0.6790
LF LP LR
DM 54.68 83.2 40.73
PAS 50.89 88.34 35.74
PCEDT 67.84 74.82 62.08
Other notes
- Relatively speaking, we're much better at PAS, DM than PCEDT. In fact, our PCEDT looks worse than the baseline.
- 30 iterations doesn't look totally converged yet, though probably there's less than a 0.1% gain left?
graph of above table
Predictions
/home/brendano/sem/semeval1/feb7_recsplit.dm.pred
/home/brendano/sem/semeval1/feb7_recsplit.pas.pred
/home/brendano/sem/semeval1/feb7_recsplit.pcedt.pred
For the sec20 data, whose gold is e.g.:
/cab1/corpora/LDC2013E167/splits/sec20.dm.sdp