-
-
Notifications
You must be signed in to change notification settings - Fork 921
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve: Enhance code readability of prompt_tokenizers.py #707
Conversation
for i in range(0, len(val), self.sequence_len): | ||
res[key].append(val[i : i + self.sequence_len]) | ||
for i in range(0, len(val), self.max_length): | ||
res[key].append(val[i : i + self.max_length]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is a difference in the completion prompts between the sequence len and the max length as they aren't the same/interchangeable in this case. the completion prompter tokenizes at a longer length and then splits the text into correct length
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps, we should also document this (as comments?) as it would be confusing to others as they read the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please provide more details? I'm still having trouble understanding. I will leave a comment here if needed. If the difference exists only in the completion tokenizing strategy, doesn't it have to be here instead of its parent class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reverted the change that I unified sequence_len and max_length. PTAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@winglian do you still have a concern or can you review this PR?
result = self.tokenizer( | ||
prompt, | ||
truncation=True, | ||
max_length=self.sequence_len, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please note that it was sequence_len
before but the merged _tokenize
method will use max_length
. They were different.
@seungduk-yanolja looks good! thank you! I'll work on resolving the merge conflicts tomorrow. |
@seungduk-yanolja I rebased your PR over current main, so will get this merged once the tests pass. thanks again! |
I had to sort out these things before sorry and thank you 🙏 |
Description
Motivation and Context
As many tokenzer variations are added, its complexity is increasing and it is hard to understand what each code block means.
We better clean up duplicated code and enhance code readability.
How has this been tested?
I don't have Cuda devices so cannot run tests. It would be great if someone can run tests for me.
Changes
Removed Duplication:
_tokenize
method which was present twice.Simplified Empty Text Handling:
Code Clarifications:
IGNORE_INDEX
replacing varied constants like-100
.Handling Unexpected Cases:
Clean-up:
Screenshots (if appropriate)
Types of changes
Readability improvement