Tokenizer respects `filters` when `char_level` is `True` #302

paw-lu · 2020-06-09T21:34:46Z

Summary

As outlined in #301, this PR makes keras.preprocessing.text.Tokenizer remove the characters in the filters argument if char_level=True.

Closes #301.

Behavior before

❯ tokenizer = keras.preprocessing.text.Tokenizer(char_level=True, filters="e")
❯ tokenizer.fit_on_texts("ae")
❯ tokenizer.word_index
{'a': 1, 'e': 2}  # "e" is tokenized

Behavior after

❯ tokenizer = keras.preprocessing.text.Tokenizer(char_level=True, filters="e")
❯ tokenizer.fit_on_texts("ae")
❯ tokenizer.word_index
{'a': 1}  # "e" is not tokenized

Closes #301

Related Issues

PR Overview

This PR requires new unit tests [y/n] (make sure tests are included)
This PR requires to update the documentation [y/n] (make sure the docs are up-to-date)
This PR is backwards compatible [y/n]
This PR changes the current API [y/n] (all API changes need to be approved by fchollet)

paw-lu added 6 commits June 9, 2020 11:16

Test filters is respected at character level

1b037c2

Reword doc

01f1ab7

Use text import

8089e4f

Modify text to not expect . count

bff28ff

Test all possible inputs

bc1e30f

Remove filtered characters at char level

9b255a1

paw-lu changed the title ~~Ignore~~ Tokenizer respects filters when char_level is True Jun 9, 2020

paw-lu mentioned this pull request Jun 9, 2020

Tokenizer ignores filters if char_level is True #301

Open

2 tasks

paw-lu added 2 commits June 9, 2020 14:47

Don't remove filtered characters on lists elements

74afe1e

Remove test on lists of lists

489dcfd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer respects `filters` when `char_level` is `True` #302

Tokenizer respects `filters` when `char_level` is `True` #302

paw-lu commented Jun 9, 2020

Tokenizer respects filters when char_level is True #302

Are you sure you want to change the base?

Tokenizer respects filters when char_level is True #302

Conversation

paw-lu commented Jun 9, 2020

Summary

Behavior before

Behavior after

Related Issues

PR Overview

Tokenizer respects `filters` when `char_level` is `True` #302

Tokenizer respects `filters` when `char_level` is `True` #302