Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

python wrapper + speed #77

Open
AlJohri opened this issue Jul 11, 2018 · 9 comments
Open

python wrapper + speed #77

AlJohri opened this issue Jul 11, 2018 · 9 comments

Comments

@AlJohri
Copy link

AlJohri commented Jul 11, 2018

hi @JannikStroetgen, I'm trying to wrap heideltime to use in python but I'm running into issues with speed. it seems invoking via the CLI is becoming very slow on the order of 7-8 seconds per document. it's a rather short script as you can see here:

https://github.com/AlJohri/heideltime-python/blob/master/heideltime.py

two questions:

  1. notice anything wrong in my invocation that would be slowing it down considerably?
  2. do you have an web API version somewhere that powers the online demo? perhaps the majority of the time is just spent starting the JVM repeatedly. hitting an API where everything is already loaded and ready to go may be much faster

thanks!

@kno10
Copy link

kno10 commented Jul 11, 2018

It is not just the startup cost of the JVM (but even that would already hurt if you care about performance).
Stanford CoreNLP loads a huge language model. Loading this again and again for every document is likely where most of the time goes to. I doubt you will be able to "fix" this - NLP just requires large models. So avoid loading them repeatedly.

@AlJohri
Copy link
Author

AlJohri commented Jul 11, 2018

makes sense, thanks @kno10.

do you know of where the code that powers the online demo lives?

@kno10
Copy link

kno10 commented Jul 13, 2018

I don't know.

@AlJohri
Copy link
Author

AlJohri commented Jul 24, 2018

@JannikStroetgen @kno10 I switched to writing an API in java.

The Stanford models are still getting loaded each time the process method is called.

Here is my heideltime wrapper factory:

package com.washpost.heideltime.heideltimeapi;

import java.util.Date;
import java.text.DateFormat;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import de.unihd.dbs.heideltime.standalone.*;
import de.unihd.dbs.heideltime.standalone.exceptions.*;
import de.unihd.dbs.uima.annotator.heideltime.resources.Language;

public class HeideltimeFactory {

    private static HeidelTimeStandalone ht = new HeidelTimeStandalone(
        Language.ENGLISH,
        DocumentType.NEWS,
        OutputType.XMI, // or OutputType.TIMEML
        "src/main/resources/config.props",
        POSTagger.STANFORDPOSTAGGER); // POSTagger.TREETAGGER, POSTagger.NO;

    public static String process(String text, String dctString) throws DocumentCreationTimeMissingException, ParseException {
        DateFormat df = new SimpleDateFormat("yyyy-MM-dd");
        Date dct = df.parse(dctString);

        ht.process(text, dct);
        ht.process(text, dct);
        ht.process(text, dct);

        return ht.process(text, dct);
    }

}

As you can see, it uses the same ht to process the text three times in a row for test purposes.

2018-07-23 22:59:14.548  INFO 44987 --- [nio-8080-exec-1] HeidelTimeStandalone                     : HeidelTimeStandalone initialized with language english
2018-07-23 22:59:14.549  INFO 44987 --- [nio-8080-exec-1] HeidelTimeStandalone                     : trying to read in file src/main/resources/config.props
2018-07-23 22:59:17.367  INFO 44987 --- [nio-8080-exec-1] HeidelTimeStandalone                     : HeidelTime initialized
2018-07-23 22:59:17.481  INFO 44987 --- [nio-8080-exec-1] HeidelTimeStandalone                     : JCas factory initialized
2018-07-23 22:59:17.484  INFO 44987 --- [nio-8080-exec-1] HeidelTimeStandalone                     : Processing started
Loading default properties from tagger src/main/resources/english-bidirectional-distsim.tagger
2018-07-23 22:59:22.361  INFO 44987 --- [nio-8080-exec-1] HeidelTimeStandalone                     : Processing finished
2018-07-23 22:59:22.505  INFO 44987 --- [nio-8080-exec-1] HeidelTimeStandalone                     : Result formatted
2018-07-23 22:59:22.505  INFO 44987 --- [nio-8080-exec-1] HeidelTimeStandalone                     : Processing started
Loading default properties from tagger src/main/resources/english-bidirectional-distsim.tagger
2018-07-23 22:59:25.935  INFO 44987 --- [nio-8080-exec-1] HeidelTimeStandalone                     : Processing finished
2018-07-23 22:59:25.962  INFO 44987 --- [nio-8080-exec-1] HeidelTimeStandalone                     : Result formatted
2018-07-23 22:59:25.962  INFO 44987 --- [nio-8080-exec-1] HeidelTimeStandalone                     : Processing started
Loading default properties from tagger src/main/resources/english-bidirectional-distsim.tagger
2018-07-23 22:59:28.753  INFO 44987 --- [nio-8080-exec-1] HeidelTimeStandalone                     : Processing finished
2018-07-23 22:59:28.768  INFO 44987 --- [nio-8080-exec-1] HeidelTimeStandalone                     : Result formatted
2018-07-23 22:59:28.769  INFO 44987 --- [nio-8080-exec-1] HeidelTimeStandalone                     : Processing started
Loading default properties from tagger src/main/resources/english-bidirectional-distsim.tagger
2018-07-23 22:59:31.975  INFO 44987 --- [nio-8080-exec-1] HeidelTimeStandalone                     : Processing finished
2018-07-23 22:59:31.995  INFO 44987 --- [nio-8080-exec-1] HeidelTimeStandalone                     : Result formatted

The logs show its taking about 3 seconds to process each document perhaps becaues it is re-initializing the StanfordPOSTaggerWrapper each time since it says the Loading default properties from tagger src/main/resources/english-bidirectional-distsim.tagger line multiple times.

Is there anyway to prevent reloading the Stanford models?

@madimov
Copy link

madimov commented Jul 3, 2019

@AlJohri did you have any luck with this?

@madimov
Copy link

madimov commented Jul 3, 2019

@AlJohri in that case, if you don't mind me asking, did you find a decent alternative?

@AlJohri
Copy link
Author

AlJohri commented Jul 5, 2019

@madimov we ended up switching projects so I didn't purse it much farther. I think you can do what @kno10 suggested of trying to prevent the models from loading on each iteration by digging into the Java code. The TreeTagger also works quickly enough. Alternatively, you can check out:

If you're working in python, my colleague found that using jpype is a good alternative to talking to a constantly running JAR if there's issues deploying a REST API (https://github.com/AlJohri/heideltime-api/).

@madimov
Copy link

madimov commented Jul 5, 2019

@AlJohri thanks a lot for taking the time and the detailed response. I've actually been looking at the last one you listed, python-sutime, and it seems to be quite good. I'll be sure to check out all the rest as well and get back to you. Much appreciated

@kno10
Copy link

kno10 commented Jul 9, 2019

I have been handling the Stanford NLP in my own code for other reasons, and only running HeidelTime on the already annotated document. I've been annotating hundreds of documents per second this way.
I'm not a big fan of nesting libraries to deep exactly because of such issues: when to reload a GB-sized language model, and when allowing it to be garbage collected, is not a decision to "outsource" into a library, but something you need to control in the "driver".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants