Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Porting missing code filtering rules to dolma repo #86

Merged
merged 109 commits into from
Nov 27, 2023

Conversation

soldni
Copy link
Member

@soldni soldni commented Nov 22, 2023

  • adds support for taggers that use metadata
  • ports code taggers from allenai/LLM
  • adds new taggers to count repetitions with regex and tokenizers
  • added tagger to count length without whitespaces
  • added script to make plots for dolma papers (scripts/dolma_paper_plots.sh, scripts/wandb_to_plot.py)
  • added script to find document from tokenizer offset (scripts/find_offset.py)
  • added tests for new taggers
  • improved GitHub Action to cache state

@soldni soldni marked this pull request as ready for review November 27, 2023 03:53
@soldni soldni merged commit 38fa168 into main Nov 27, 2023
13 checks passed
@soldni soldni deleted the soldni/missing-starcoder branch November 27, 2023 04:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant