-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Two places where tokenization should be standardized #473
Comments
@neubig Is there any consequence that we need to take care when the NLTK tokenizer has been removed? I agree with removing it though (for improving simplicity around data dependency). |
I don't think so. The NLTK tokenizer is only used in that one file, everywhere else we already use different tokenizers that achieve very similar results (e.g. the SacreBLEU tokenizer). |
@neubig There are some uses of |
Ah, thanks, I see. It'd be nice to have a wrapper for sentence tokenization too. |
In most places in ExplainaBoard, we use standardized tokenizers to tokenize the source and target texts. However, there are (at least) two places where we use other tokenizers.
First,
explainaboard/analysis/sum_attribute.py
uses the NLTK punkt tokenizer. This would be nice to remove because (1) it is non-standard for ExplainaBoard, (2) it's the only place where we use nltk so if we remove this we can reduce by one library dependency, and (3) it does download of the punkt tokenizer from the NLTK site, resulting in an extra external data access: https://github.com/neulab/ExplainaBoard/blob/main/explainaboard/analysis/sum_attribute.py#L4Second,
explainaboard/analysis/feature_funcs.py
simply splits on white space instead of using the standard tokenizer: https://github.com/neulab/ExplainaBoard/blob/main/explainaboard/analysis/feature_funcs.py#L50The text was updated successfully, but these errors were encountered: