-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
python wrapper + speed #77
Comments
It is not just the startup cost of the JVM (but even that would already hurt if you care about performance). |
makes sense, thanks @kno10. do you know of where the code that powers the online demo lives? |
I don't know. |
@JannikStroetgen @kno10 I switched to writing an API in java. The Stanford models are still getting loaded each time the Here is my heideltime wrapper factory: package com.washpost.heideltime.heideltimeapi;
import java.util.Date;
import java.text.DateFormat;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import de.unihd.dbs.heideltime.standalone.*;
import de.unihd.dbs.heideltime.standalone.exceptions.*;
import de.unihd.dbs.uima.annotator.heideltime.resources.Language;
public class HeideltimeFactory {
private static HeidelTimeStandalone ht = new HeidelTimeStandalone(
Language.ENGLISH,
DocumentType.NEWS,
OutputType.XMI, // or OutputType.TIMEML
"src/main/resources/config.props",
POSTagger.STANFORDPOSTAGGER); // POSTagger.TREETAGGER, POSTagger.NO;
public static String process(String text, String dctString) throws DocumentCreationTimeMissingException, ParseException {
DateFormat df = new SimpleDateFormat("yyyy-MM-dd");
Date dct = df.parse(dctString);
ht.process(text, dct);
ht.process(text, dct);
ht.process(text, dct);
return ht.process(text, dct);
}
} As you can see, it uses the same
The logs show its taking about 3 seconds to process each document perhaps becaues it is re-initializing the Is there anyway to prevent reloading the Stanford models? |
@AlJohri did you have any luck with this? |
@AlJohri in that case, if you don't mind me asking, did you find a decent alternative? |
@madimov we ended up switching projects so I didn't purse it much farther. I think you can do what @kno10 suggested of trying to prevent the models from loading on each iteration by digging into the Java code. The TreeTagger also works quickly enough. Alternatively, you can check out:
If you're working in python, my colleague found that using |
@AlJohri thanks a lot for taking the time and the detailed response. I've actually been looking at the last one you listed, python-sutime, and it seems to be quite good. I'll be sure to check out all the rest as well and get back to you. Much appreciated |
I have been handling the Stanford NLP in my own code for other reasons, and only running HeidelTime on the already annotated document. I've been annotating hundreds of documents per second this way. |
hi @JannikStroetgen, I'm trying to wrap heideltime to use in python but I'm running into issues with speed. it seems invoking via the CLI is becoming very slow on the order of 7-8 seconds per document. it's a rather short script as you can see here:
https://github.com/AlJohri/heideltime-python/blob/master/heideltime.py
two questions:
thanks!
The text was updated successfully, but these errors were encountered: