Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Show sample batch content #2145

Open
5 tasks done
fzyzcjy opened this issue Dec 7, 2024 · 2 comments
Open
5 tasks done

Show sample batch content #2145

fzyzcjy opened this issue Dec 7, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@fzyzcjy
Copy link

fzyzcjy commented Dec 7, 2024

⚠️ Please check that this feature request hasn't been suggested before.

  • I searched previous Ideas in Discussions didn't find any similar feature requests.
  • I searched previous Issues didn't find any similar feature requests.

🔖 Feature description

Hi thanks for the library! A common practice of deep learning seems to be that, we log the exact inputs and labels of (a portion of) a single batch. For example, I personally log the input_ids (and convert it back to text), attention_masks, model outputs, labels, etc.

This can help debug a lot of problems. For example, if someone has a wrong BOS/EOS token, then it can be spotted immediately. As another example, if we want to train on completions only but forgets to do so, the logged labels can hint us on that.

✔️ Solution

(see above)

❓ Alternatives

No response

📝 Additional Context

No response

Acknowledgements

  • My issue title is concise, descriptive, and in title casing.
  • I have searched the existing issues to make sure this feature has not been requested yet.
  • I have provided enough information for the maintainers to understand and evaluate this request.
@fzyzcjy fzyzcjy added the enhancement New feature or request label Dec 7, 2024
@winglian
Copy link
Collaborator

winglian commented Dec 8, 2024

This might be feasible using the trainer_callback (see https://github.com/huggingface/transformers/blob/main/src/transformers/trainer_callback.py#L283-L284)

However, since it passes the train_dataloader to say the on_log or on_step_end callback, you could in theory get data from it, but I'm not exactly sure how to get the current row at the current step without moving the iterator on the dataloader and affecting the actual training.

If you're up for tackling this and submitting a PR, we would be happy to help and get it merged in.

@fzyzcjy
Copy link
Author

fzyzcjy commented Dec 8, 2024

I am currently writing a subclass and hack the compute_loss for it for my internal code, at the same time I want to use axolotl as a comparison test to reveal potential bugs my internal code (i.e. axolotl and my code should get same accuracy). Thus I may not have enough time to PR to axolotl. But if you like to have a look at how I hacked it as a rough draft, feel free to ping me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants