char-level Tokenizer.sequences_to_texts() insert additional SPACEs #346

XiYuan68 · 2021-07-29T14:05:02Z

Check that you are up-to-date with the master branch of keras-preprocessing. You can update with:
pip install git+git://github.com/keras-team/keras-preprocessing.git --upgrade --no-deps
Provide a link to a GitHub Gist of a Python script that can reproduce your issue (or just copy the script here if it is short).

Describe the problem.

from tensorflow.keras.preprocessing.text import Tokenizer

text = ['abc def']
tokenizer = Tokenizer(char_level=True, split='')
tokenizer.fit_on_texts(text)
sequence = tokenizer.texts_to_sequences(text)
text_after = tokenizer.sequences_to_texts(sequence)

print(text_after)
>>> ['a b c   d e f']

notice that text_after and text are different, additional SPACEs are inserted

Describe the expected behavior.

text_after should be same as text

I believe this line is where the problem is, replacing:

vect = ' '.join(vect)

with

vect = self.split.join(vect)

will fix the bug in my mini case

The text was updated successfully, but these errors were encountered:

XiYuan68 added the text Related to text label Jul 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

char-level Tokenizer.sequences_to_texts() insert additional SPACEs #346

char-level Tokenizer.sequences_to_texts() insert additional SPACEs #346

XiYuan68 commented Jul 29, 2021

char-level Tokenizer.sequences_to_texts() insert additional SPACEs #346

char-level Tokenizer.sequences_to_texts() insert additional SPACEs #346

Comments

XiYuan68 commented Jul 29, 2021