Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Two places where tokenization should be standardized #473

Open
neubig opened this issue Sep 11, 2022 · 4 comments
Open

Two places where tokenization should be standardized #473

neubig opened this issue Sep 11, 2022 · 4 comments

Comments

@neubig
Copy link
Contributor

neubig commented Sep 11, 2022

In most places in ExplainaBoard, we use standardized tokenizers to tokenize the source and target texts. However, there are (at least) two places where we use other tokenizers.

First, explainaboard/analysis/sum_attribute.py uses the NLTK punkt tokenizer. This would be nice to remove because (1) it is non-standard for ExplainaBoard, (2) it's the only place where we use nltk so if we remove this we can reduce by one library dependency, and (3) it does download of the punkt tokenizer from the NLTK site, resulting in an extra external data access: https://github.com/neulab/ExplainaBoard/blob/main/explainaboard/analysis/sum_attribute.py#L4

Second, explainaboard/analysis/feature_funcs.py simply splits on white space instead of using the standard tokenizer: https://github.com/neulab/ExplainaBoard/blob/main/explainaboard/analysis/feature_funcs.py#L50

@odashi
Copy link
Contributor

odashi commented Sep 12, 2022

@neubig Is there any consequence that we need to take care when the NLTK tokenizer has been removed? I agree with removing it though (for improving simplicity around data dependency).

@neubig
Copy link
Contributor Author

neubig commented Sep 12, 2022

I don't think so. The NLTK tokenizer is only used in that one file, everywhere else we already use different tokenizers that achieve very similar results (e.g. the SacreBLEU tokenizer).

@odashi
Copy link
Contributor

odashi commented Sep 25, 2022

@neubig There are some uses of sent_tokenize, word_tokenize, and ngrams from NLTK. It looks word_tokenize can be replaced by SacreBleuTokenizer and ngrams can be implemented by ourselves, and there looks to be no alternatives in this repository for sent_tokenize.

@neubig
Copy link
Contributor Author

neubig commented Oct 10, 2022

Ah, thanks, I see. It'd be nice to have a wrapper for sentence tokenization too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants