Fix displaying text and tokens in Python containing invalid UTF-8 encodings for byte BPE models #501

csukuangfj · 2023-12-23T11:05:01Z

Since byte BPE models use utf8 encoded bytes to represent some words. It is possible that some predicted token sequences cannot form a valid UTF-8 character, which causes utf8-decoding errors in Python. This PR fixes that by ignoring invalid UTF-8 encodings.

csukuangfj added 4 commits December 22, 2023 16:23

Fix byte BPE for Python

8cfda16

small fixes

6f168fb

fix invalid utf8 in python

71212ac

typo fixes

e9339ab

csukuangfj mentioned this pull request Jan 3, 2024

Fix Byte BPE string results for Python. #512

Merged

csukuangfj closed this in #512 Jan 3, 2024

csukuangfj deleted the fix-bbpe branch January 4, 2024 06:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix displaying text and tokens in Python containing invalid UTF-8 encodings for byte BPE models #501

Fix displaying text and tokens in Python containing invalid UTF-8 encodings for byte BPE models #501

csukuangfj commented Dec 23, 2023

Fix displaying text and tokens in Python containing invalid UTF-8 encodings for byte BPE models #501

Fix displaying text and tokens in Python containing invalid UTF-8 encodings for byte BPE models #501

Conversation

csukuangfj commented Dec 23, 2023