Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

Tokenizer respects filters when char_level is True #302

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

paw-lu
Copy link

@paw-lu paw-lu commented Jun 9, 2020

Summary

As outlined in #301, this PR makes keras.preprocessing.text.Tokenizer remove the characters in the filters argument if char_level=True.

Closes #301.

Behavior before

tokenizer = keras.preprocessing.text.Tokenizer(char_level=True, filters="e")
❯ tokenizer.fit_on_texts("ae")
❯ tokenizer.word_index
{'a': 1, 'e': 2}  # "e" is tokenized

Behavior after

tokenizer = keras.preprocessing.text.Tokenizer(char_level=True, filters="e")
❯ tokenizer.fit_on_texts("ae")
❯ tokenizer.word_index
{'a': 1}  # "e" is not tokenized

Closes #301

Related Issues

PR Overview

  • This PR requires new unit tests [y/n] (make sure tests are included)
  • This PR requires to update the documentation [y/n] (make sure the docs are up-to-date)
  • This PR is backwards compatible [y/n]
  • This PR changes the current API [y/n] (all API changes need to be approved by fchollet)

@paw-lu paw-lu changed the title Ignore Tokenizer respects filters when char_level is True Jun 9, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tokenizer ignores filters if char_level is True
1 participant