Skip to content

results

Jeffrey Flanigan edited this page Mar 28, 2014 · 39 revisions

See also

mar28 Graph parser Alg1 Medtrain (JMF)

determinism constraints (no connected constraint) same features as March12 LR parser

predicted singletons including predicted tops

30 iterations SVM, with averaging

LR Alg1
PAS .803 .810
DM .740 .770
PCEDT .648 .624

mar25

Logreg Model: BasicFeatures + LinearOrderFeatures + CoarseDependencyFeatures + LinearContextFeatures + DependencyPathv1 + SubcatSequenceFE + UnlabeledDepFE

One "onefull" iteration on sec00-19.

Formalism Test LF
PAS .861
DM .809
PCEDT .714

Medtrain (30 iterations)

Formalism Test LF
PAS .831
DM .763
PCEDT .646

mar18 Graph parser Medtrain again (JMF)

gold singletons, no lexical bigrams

including predicted tops

30 iterations. I was frustrated that the regularizer wasn't working, so I tried a bunch of different regularizers from .1 all the way down to 1E-9, and including 0.0. Picked the best regularizer on dev. Honestly, with regularizers this small, this is just like picking the best run on dev, and reporting the accuracy on dev (which is not the right way to do experiments). I should really do multiple runs for each regularizer strength, pick the regularizer using dev, and report accuracy and standard deviation on a devtest. Maybe that's overkill, but one should at least do multiple runs at the same regularizer strength to see the standard deviation. Based on the variance of the different runs with regularizers so small they can't be doing anything, I would say the standard deviation for these runs is about .01.

Dev F1 L2 Reg
PAS .794 1E-9
DM .751 1E-7
PCEDT .632 1E-7

(compared to LogReg parser PAS F1 =.803, DM F1 = .740, PCEDT F1 = .648)

mar15 Graph parser Medtrain (JMF)

gold singletons, no lexical bigrams

including predicted tops

10 iterations. lambda = 0.0 (best from .1, .01, .001, 0.0)

Dev F1
PAS .780
DM .760
PCEDT .634

(compared to PAS F1 =.803, DM F1 = .740, PCEDT F1 = .648 from below)

mar12 Removing bigrams experiment (JMF)

I compared train vs test scores for the different formalisms, and I also experimented with removing lexical bigrams. It turns out lexical bigrams don't help at all for our current regularization (.5), so we should take them out. To do this experiment right (i.e. compare with and without), we should really tune the regularization strength and precision/recall for each formalism without lexical bigrams to compare. Maybe it is actually better without them.

There are a few interesting things to note:

  • On med train, we do poorly on PCEDT mainly because we don't generalize well to the test data (train scores are the same, i.e. high 90s but test is 60's).
  • Recall drops more than precision on test. (I guess this makes sense. tuning the precision/recall makes it conservative)

(Take the onefull train vs test numbers with a grain of salt because it hasn't converged.)

Medtrain, 30 iterations with lexical bigrams (including tops)

Train P Train R Train F1 Test P Test R Test F1
PAS 96 95 95.8 83 77 80.3
DM 94 94 94.5 77 70 74.0
PCEDT

Medtrain, 30 iterations without lexical bigrams (including tops)

Train P Train R Train F1 Test P Test R Test F1
PAS 94 94 94.7 83 77 80.3
DM 90 93 92.2 76 72 74.0
PCEDT 96 90 94.8 70 59 64.8

Onefull with lexical bigrams (including tops)

Train P Train R Train F1 Test P Test R Test F1
PAS 86 87.1 87 81 84 82.7
DM
PCEDT 79 76 78.2 70 69 70.2

Onefull without lexical bigrams (including tops)

Train P Train R Train F1 Test P Test R Test F1
PAS 85 86 85.7 81 83 82.5
DM 78 83 81.1 74 80 77.6
PCEDT 76 74 75.6 70 69 69.9

mar12 onefull (BTO)

only one iteration on sec00-19.

dm_onefull_model.eval.log LF 0.783744
pas_onefull_model.eval.log LF 0.827898
pcedt_onefull_model.eval.log LF 0.702516

mar12 training curves

DM, sec00-19 (hashing): still going up, slowly, after 30 iterations.

big2.iter 0 .log:LF:  0.783744
big2.iter 1 .log:LF:  0.789072
big2.iter 2 .log:LF:  0.792465
big2.iter 3 .log:LF:  0.794552
big2.iter 4 .log:LF:  0.796020
big2.iter 5 .log:LF:  0.796875
big2.iter 6 .log:LF:  0.797945
big2.iter 7 .log:LF:  0.798727
big2.iter 8 .log:LF:  0.798596
big2.iter 9 .log:LF:  0.798702
big2.iter 10 .log:LF: 0.799353
big2.iter 11 .log:LF: 0.799619
big2.iter 12 .log:LF: 0.799683
big2.iter 13 .log:LF: 0.799777
big2.iter 14 .log:LF: 0.799625
big2.iter 15 .log:LF: 0.799704
big2.iter 16 .log:LF: 0.799834
big2.iter 17 .log:LF: 0.800108
big2.iter 18 .log:LF: 0.800325
big2.iter 19 .log:LF: 0.800347
big2.iter 20 .log:LF: 0.800369
big2.iter 21 .log:LF: 0.800355
big2.iter 22 .log:LF: 0.800369
big2.iter 23 .log:LF: 0.800492
big2.iter 24 .log:LF: 0.800355
big2.iter 25 .log:LF: 0.800471
big2.iter 26 .log:LF: 0.800420
big2.iter 27 .log:LF: 0.800442
big2.iter 28 .log:LF: 0.800515
big2.iter 29 .log:LF: 0.800696

DM, medtrain (no hashing): this is bouncier.

med.iter 0 .log:LF:   0.714132
med.iter 10 .log:LF:  0.737409
med.iter 20 .log:LF:  0.739642
med.iter 30 .log:LF:  0.740195
med.iter 40 .log:LF:  0.740066
med.iter 50 .log:LF:  0.739977
med.iter 60 .log:LF:  0.740080
med.iter 70 .log:LF:  0.739992
med.iter 80 .log:LF:  0.739890
med.iter 90 .log:LF:  0.739985
med.iter 100 .log:LF: 0.739931
med.iter 110 .log:LF: 0.740063
med.iter 120 .log:LF: 0.740153
med.iter 130 .log:LF: 0.740091
med.iter 140 .log:LF: 0.740125
med.iter 150 .log:LF: 0.740067
med.iter 160 .log:LF: 0.740123
med.iter 170 .log:LF: 0.740147

mar3 (Mar 4, DB)

Train on sec00-19, test on sec20.

Features: BasicFeatures + LinearOrderFeatures + DependencyPathv1 + SubcatSequenceFE + CoarseDependencyFeatures + logreg top classifier.

Labeled F1 (including tops):

it. PAS DM PCEDT
0 0.784818 0.755264 0.674663
10 0.814511 0.785971 0.704177
20 0.816579 0.788184 0.705842
30 0.81723 0.788877 0.706791

"feb7" (Feb 21, BTO)

I'm calling this experiment "feb7" since that's the last checkin for the code i'm running (this one)

Train on sec0019, test in sec20.

Runtime info

  • 15g was enough for pas,dm but crashed on pcedt. 18g was enough for pcedt. cab can't do >18g. stampede was able to do 25g.
    • maybe it's time to go to feature hashing
  • 4 hours to train

Look at iteration=30 for final results.

The "including tops" version is the version used for the reported baseline on the website. As opposed to "excluding tops".

    form itr  in_F   in_P   in_R     ex_F   ex_P   ex_R
1     dm  0 0.6259 0.5060 0.8203   0.7624 0.7011 0.8354
2     dm 10 0.6429 0.5347 0.8060   0.7981 0.7790 0.8182
3     dm 20 0.6437 0.5368 0.8038   0.8001 0.7851 0.8156
4     dm 30 0.6442 0.5377 0.8032   0.8008 0.7872 0.8149

5    pas  0 0.6343 0.5130 0.8307   0.7896 0.7280 0.8626
6    pas 10 0.6496 0.5397 0.8156   0.8237 0.8029 0.8456
7    pas 20 0.6503 0.5415 0.8138   0.8255 0.8082 0.8436
8    pas 30 0.6508 0.5425 0.8131   0.8265 0.8110 0.8427

9  pcedt  0 0.6455 0.5988 0.7001   0.6658 0.6369 0.6974
10 pcedt 10 0.6743 0.6594 0.6900   0.6987 0.7163 0.6821
11 pcedt 20 0.6759 0.6643 0.6880   0.7014 0.7242 0.6799
12 pcedt 30 0.6765 0.6660 0.6874   0.7022 0.7270 0.6790

Their baseline results:

        LF      LP      LR
DM      54.68   83.2    40.73
PAS     50.89   88.34   35.74
PCEDT   67.84   74.82   62.08

Other notes

  • Relatively speaking, we're much better at PAS, DM than PCEDT. In fact, our PCEDT looks worse than the baseline.
  • 30 iterations doesn't look totally converged yet, though probably there's less than a 0.1% gain left?

graph of above table

"Feb7" predicted outputs

Predictions

/home/brendano/sem/semeval1/feb7_recsplit.dm.pred
/home/brendano/sem/semeval1/feb7_recsplit.pas.pred
/home/brendano/sem/semeval1/feb7_recsplit.pcedt.pred

For the sec20 data, whose gold is e.g.:

/cab1/corpora/LDC2013E167/splits/sec20.dm.sdp