python wrapper + speed #77

AlJohri · 2018-07-11T09:43:25Z

hi @JannikStroetgen, I'm trying to wrap heideltime to use in python but I'm running into issues with speed. it seems invoking via the CLI is becoming very slow on the order of 7-8 seconds per document. it's a rather short script as you can see here:

https://github.com/AlJohri/heideltime-python/blob/master/heideltime.py

two questions:

notice anything wrong in my invocation that would be slowing it down considerably?
do you have an web API version somewhere that powers the online demo? perhaps the majority of the time is just spent starting the JVM repeatedly. hitting an API where everything is already loaded and ready to go may be much faster

thanks!

kno10 · 2018-07-11T09:53:25Z

It is not just the startup cost of the JVM (but even that would already hurt if you care about performance).
Stanford CoreNLP loads a huge language model. Loading this again and again for every document is likely where most of the time goes to. I doubt you will be able to "fix" this - NLP just requires large models. So avoid loading them repeatedly.

AlJohri · 2018-07-11T17:54:16Z

makes sense, thanks @kno10.

do you know of where the code that powers the online demo lives?

kno10 · 2018-07-13T15:02:08Z

I don't know.

AlJohri · 2018-07-24T03:03:35Z

@JannikStroetgen @kno10 I switched to writing an API in java.

The Stanford models are still getting loaded each time the process method is called.

Here is my heideltime wrapper factory:

package com.washpost.heideltime.heideltimeapi;

import java.util.Date;
import java.text.DateFormat;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import de.unihd.dbs.heideltime.standalone.*;
import de.unihd.dbs.heideltime.standalone.exceptions.*;
import de.unihd.dbs.uima.annotator.heideltime.resources.Language;

public class HeideltimeFactory {

    private static HeidelTimeStandalone ht = new HeidelTimeStandalone(
        Language.ENGLISH,
        DocumentType.NEWS,
        OutputType.XMI, // or OutputType.TIMEML
        "src/main/resources/config.props",
        POSTagger.STANFORDPOSTAGGER); // POSTagger.TREETAGGER, POSTagger.NO;

    public static String process(String text, String dctString) throws DocumentCreationTimeMissingException, ParseException {
        DateFormat df = new SimpleDateFormat("yyyy-MM-dd");
        Date dct = df.parse(dctString);

        ht.process(text, dct);
        ht.process(text, dct);
        ht.process(text, dct);

        return ht.process(text, dct);
    }

}

As you can see, it uses the same ht to process the text three times in a row for test purposes.

2018-07-23 22:59:14.548  INFO 44987 --- [nio-8080-exec-1] HeidelTimeStandalone                     : HeidelTimeStandalone initialized with language english
2018-07-23 22:59:14.549  INFO 44987 --- [nio-8080-exec-1] HeidelTimeStandalone                     : trying to read in file src/main/resources/config.props
2018-07-23 22:59:17.367  INFO 44987 --- [nio-8080-exec-1] HeidelTimeStandalone                     : HeidelTime initialized
2018-07-23 22:59:17.481  INFO 44987 --- [nio-8080-exec-1] HeidelTimeStandalone                     : JCas factory initialized
2018-07-23 22:59:17.484  INFO 44987 --- [nio-8080-exec-1] HeidelTimeStandalone                     : Processing started
Loading default properties from tagger src/main/resources/english-bidirectional-distsim.tagger
2018-07-23 22:59:22.361  INFO 44987 --- [nio-8080-exec-1] HeidelTimeStandalone                     : Processing finished
2018-07-23 22:59:22.505  INFO 44987 --- [nio-8080-exec-1] HeidelTimeStandalone                     : Result formatted
2018-07-23 22:59:22.505  INFO 44987 --- [nio-8080-exec-1] HeidelTimeStandalone                     : Processing started
Loading default properties from tagger src/main/resources/english-bidirectional-distsim.tagger
2018-07-23 22:59:25.935  INFO 44987 --- [nio-8080-exec-1] HeidelTimeStandalone                     : Processing finished
2018-07-23 22:59:25.962  INFO 44987 --- [nio-8080-exec-1] HeidelTimeStandalone                     : Result formatted
2018-07-23 22:59:25.962  INFO 44987 --- [nio-8080-exec-1] HeidelTimeStandalone                     : Processing started
Loading default properties from tagger src/main/resources/english-bidirectional-distsim.tagger
2018-07-23 22:59:28.753  INFO 44987 --- [nio-8080-exec-1] HeidelTimeStandalone                     : Processing finished
2018-07-23 22:59:28.768  INFO 44987 --- [nio-8080-exec-1] HeidelTimeStandalone                     : Result formatted
2018-07-23 22:59:28.769  INFO 44987 --- [nio-8080-exec-1] HeidelTimeStandalone                     : Processing started
Loading default properties from tagger src/main/resources/english-bidirectional-distsim.tagger
2018-07-23 22:59:31.975  INFO 44987 --- [nio-8080-exec-1] HeidelTimeStandalone                     : Processing finished
2018-07-23 22:59:31.995  INFO 44987 --- [nio-8080-exec-1] HeidelTimeStandalone                     : Result formatted

The logs show its taking about 3 seconds to process each document perhaps becaues it is re-initializing the StanfordPOSTaggerWrapper each time since it says the Loading default properties from tagger src/main/resources/english-bidirectional-distsim.tagger line multiple times.

Is there anyway to prevent reloading the Stanford models?

madimov · 2019-07-03T15:10:55Z

@AlJohri did you have any luck with this?

madimov · 2019-07-03T15:31:50Z

@AlJohri in that case, if you don't mind me asking, did you find a decent alternative?

AlJohri · 2019-07-05T16:43:42Z

@madimov we ended up switching projects so I didn't purse it much farther. I think you can do what @kno10 suggested of trying to prevent the models from loading on each iteration by digging into the Java code. The TreeTagger also works quickly enough. Alternatively, you can check out:

https://github.com/cnorthwood/ternip (pure python impl.)
https://github.com/bear/parsedatetime
https://github.com/eadmundo/python-natty
https://github.com/FraBle/python-duckling (this one seemed promising, https://duckling.wit.ai/)
https://github.com/FraBle/python-sutime (https://nlp.stanford.edu/software/sutime.html)

If you're working in python, my colleague found that using jpype is a good alternative to talking to a constantly running JAR if there's issues deploying a REST API (https://github.com/AlJohri/heideltime-api/).

madimov · 2019-07-05T17:30:00Z

@AlJohri thanks a lot for taking the time and the detailed response. I've actually been looking at the last one you listed, python-sutime, and it seems to be quite good. I'll be sure to check out all the rest as well and get back to you. Much appreciated

kno10 · 2019-07-09T23:36:56Z

I have been handling the Stanford NLP in my own code for other reasons, and only running HeidelTime on the already annotated document. I've been annotating hundreds of documents per second this way.
I'm not a big fan of nesting libraries to deep exactly because of such issues: when to reload a GB-sized language model, and when allowing it to be garbage collected, is not a decision to "outsource" into a library, but something you need to control in the "driver".

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

python wrapper + speed #77

python wrapper + speed #77

AlJohri commented Jul 11, 2018 •

edited

Loading

kno10 commented Jul 11, 2018

AlJohri commented Jul 11, 2018

kno10 commented Jul 13, 2018

AlJohri commented Jul 24, 2018

madimov commented Jul 3, 2019

madimov commented Jul 3, 2019

AlJohri commented Jul 5, 2019

madimov commented Jul 5, 2019

kno10 commented Jul 9, 2019

python wrapper + speed #77

python wrapper + speed #77

Comments

AlJohri commented Jul 11, 2018 • edited Loading

kno10 commented Jul 11, 2018

AlJohri commented Jul 11, 2018

kno10 commented Jul 13, 2018

AlJohri commented Jul 24, 2018

madimov commented Jul 3, 2019

madimov commented Jul 3, 2019

AlJohri commented Jul 5, 2019

madimov commented Jul 5, 2019

kno10 commented Jul 9, 2019

AlJohri commented Jul 11, 2018 •

edited

Loading