Two places where tokenization should be standardized #473

neubig · 2022-09-11T00:33:25Z

In most places in ExplainaBoard, we use standardized tokenizers to tokenize the source and target texts. However, there are (at least) two places where we use other tokenizers.

First, explainaboard/analysis/sum_attribute.py uses the NLTK punkt tokenizer. This would be nice to remove because (1) it is non-standard for ExplainaBoard, (2) it's the only place where we use nltk so if we remove this we can reduce by one library dependency, and (3) it does download of the punkt tokenizer from the NLTK site, resulting in an extra external data access: https://github.com/neulab/ExplainaBoard/blob/main/explainaboard/analysis/sum_attribute.py#L4

Second, explainaboard/analysis/feature_funcs.py simply splits on white space instead of using the standard tokenizer: https://github.com/neulab/ExplainaBoard/blob/main/explainaboard/analysis/feature_funcs.py#L50

The text was updated successfully, but these errors were encountered:

odashi · 2022-09-12T13:49:49Z

@neubig Is there any consequence that we need to take care when the NLTK tokenizer has been removed? I agree with removing it though (for improving simplicity around data dependency).

neubig · 2022-09-12T13:55:04Z

I don't think so. The NLTK tokenizer is only used in that one file, everywhere else we already use different tokenizers that achieve very similar results (e.g. the SacreBLEU tokenizer).

odashi · 2022-09-25T16:52:56Z

@neubig There are some uses of sent_tokenize, word_tokenize, and ngrams from NLTK. It looks word_tokenize can be replaced by SacreBleuTokenizer and ngrams can be implemented by ourselves, and there looks to be no alternatives in this repository for sent_tokenize.

neubig · 2022-10-10T15:10:14Z

Ah, thanks, I see. It'd be nice to have a wrapper for sentence tokenization too.

odashi mentioned this issue Sep 25, 2022

refactor get_basic_words #507

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Two places where tokenization should be standardized #473

Two places where tokenization should be standardized #473

neubig commented Sep 11, 2022 •

edited

Loading

odashi commented Sep 12, 2022 •

edited

Loading

neubig commented Sep 12, 2022

odashi commented Sep 25, 2022 •

edited

Loading

neubig commented Oct 10, 2022

Two places where tokenization should be standardized #473

Two places where tokenization should be standardized #473

Comments

neubig commented Sep 11, 2022 • edited Loading

odashi commented Sep 12, 2022 • edited Loading

neubig commented Sep 12, 2022

odashi commented Sep 25, 2022 • edited Loading

neubig commented Oct 10, 2022

neubig commented Sep 11, 2022 •

edited

Loading

odashi commented Sep 12, 2022 •

edited

Loading

odashi commented Sep 25, 2022 •

edited

Loading