-
Notifications
You must be signed in to change notification settings - Fork 536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tessa/callibration script #937
base: main
Are you sure you want to change the base?
Conversation
lgtm! I kinda hate checking in notebooks but I do think it's better than a script in this case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that this notebook/script is mostly for y'all, has lots of hardcoded stuff, etc, lets note in the README that the calibration scripts are experimental and subject to change at any time.
@@ -0,0 +1,10 @@ | |||
# Callibration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Callibration | |
# Calibration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
throughout
A good benchmark is one that clearly shows which models are better and which are worse. We test our benchmark tasks by using a series of progressively more advanced models to see if the benchmarks effectively differentiate between them, and at which number of shots they performed best at. | ||
|
||
To run the code: | ||
* Select an independant variable and a series of models that correspond to the settings of that variable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Select an independant variable and a series of models that correspond to the settings of that variable | |
* Select an independent variable and a series of models that correspond to the settings of that variable |
throughout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest clearing the cell output before committing the notebook.
Easiest thing might be to just add the precommit hook from composer for this
- repo: https://github.com/kynan/nbstripout
rev: 0.5.0
hooks:
- id: nbstripout
types:
- "jupyter"
args:
# Strip all the metadata that vscode or colab may add to a notebook
- --strip-empty-cells
- --extra-keys
- >
metadata.colab metadata.interpreter metadata.accelerator
metadata.kernelspec metadata.language_info.version
cell.metadata.heading_collapsed metadata.name metadata.nbconvert_exporter
metadata.version metadata.vscode
integrations: | ||
- integration_type: git_repo | ||
git_repo: mosaicml/llm-foundry | ||
git_branch: main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should probably be pinned to a release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets move this to the eval/yamls
folder
Would you mind adding the MCLI name of a test run you launched so I can go back and Additionally a screenshot of the resulting notebook would be good so that when I go back to this later I can confirm that I got the correct results? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! can you just address Daniel's comments as well as update the description as I requested?
Thx Tessa!!!
Here is code we use to test our benchmark tasks by using a series of progressively more advanced models to see if the benchmarks effectively differentiate between them, and at which number of shots they performed best at.
base_callibration.yaml
to reflect the ones you want to seeanalyze_output
notebook which collates the results from wandb