Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve: Enhance code readability of prompt_tokenizers.py #707

Merged
merged 4 commits into from
Oct 19, 2023

Conversation

seungduk-yanolja
Copy link
Contributor

Description

Motivation and Context

As many tokenzer variations are added, its complexity is increasing and it is hard to understand what each code block means.
We better clean up duplicated code and enhance code readability.

How has this been tested?

I don't have Cuda devices so cannot run tests. It would be great if someone can run tests for me.

Changes

  1. Removed Duplication:

    • Removed the redundant _tokenize method which was present twice.
    • Unified the logic of handling user and assistant roles in tokenization.
  2. Simplified Empty Text Handling:

    • Streamlined the empty text tokenization process.
  3. Code Clarifications:

    • Simplified the checks for adding eos tokens and stripping bos tokens, making it more readable.
    • Unified the constant IGNORE_INDEX replacing varied constants like -100.
  4. Handling Unexpected Cases:

    • Enhanced error handling for unexpected roles.
  5. Clean-up:

    • Removed unnecessary imports.
    • Simplified several parts of the code to improve readability.

Screenshots (if appropriate)

Types of changes

Readability improvement

for i in range(0, len(val), self.sequence_len):
res[key].append(val[i : i + self.sequence_len])
for i in range(0, len(val), self.max_length):
res[key].append(val[i : i + self.max_length])
Copy link
Collaborator

@winglian winglian Oct 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a difference in the completion prompts between the sequence len and the max length as they aren't the same/interchangeable in this case. the completion prompter tokenizes at a longer length and then splits the text into correct length

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps, we should also document this (as comments?) as it would be confusing to others as they read the code.

Copy link
Contributor Author

@seungduk-yanolja seungduk-yanolja Oct 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please provide more details? I'm still having trouble understanding. I will leave a comment here if needed. If the difference exists only in the completion tokenizing strategy, doesn't it have to be here instead of its parent class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted the change that I unified sequence_len and max_length. PTAL

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@winglian do you still have a concern or can you review this PR?

result = self.tokenizer(
prompt,
truncation=True,
max_length=self.sequence_len,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that it was sequence_len before but the merged _tokenize method will use max_length. They were different.

@winglian
Copy link
Collaborator

@seungduk-yanolja looks good! thank you! I'll work on resolving the merge conflicts tomorrow.

@winglian
Copy link
Collaborator

@seungduk-yanolja I rebased your PR over current main, so will get this merged once the tests pass. thanks again!

@seungduk-yanolja
Copy link
Contributor Author

I had to sort out these things before sorry and thank you 🙏

@winglian winglian merged commit 3a99495 into axolotl-ai-cloud:main Oct 19, 2023
mkeoliya pushed a commit to mkeoliya/axolotl that referenced this pull request Dec 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants