-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scrape and upload 2024 training artifacts #948
Comments
Alright, modulo the few missing groups noted above, I believe I have everything scraped and ready to upload. The upload is likely to take many days, so I'd appreciate a sanity check before doing so. The list of files that I will upload can be found on this spreadsheet. They will be uploaded in that exact directory structure into the root of this bucket (which is the same place I uploaded last time I scraped). The total size of what I've scraped is 935G. As with last time, I've scraped everything from tasks prefixed with one of the following strings:
Within evaluation we have additional possible subdirectories:
@eu9ene - can you please sanity check the above before I start the upload? (I can also give you access to the instance the data is on if you'd like to poke around there instead.) |
@bhearsum the directory structure and your list look good but we can save space a bit by not uploading the following files from the model training artifacts: public/build/model.npz They are intermediate checkpoints for different metrics and we care only about the |
I've prepared a new directory with all files with those names excluded; the full list is available in a new tab on the sheet. This removed ~2,500 files compared to the previous round, and took us from 935GB to 108GB of data. @eu9ene - let me know how this revised version looks. |
It looks good now. |
Uploaded has completed. https://console.cloud.google.com/storage/browser/moz-fx-translations-data--303e-prod-translations-data;tab=objects?prefix=&forceOnObjectsSortingFiltering=false should have all of the data from the spreadsheet. |
We need to scrape and upload the artifacts from the large scale training that we did in 2024 before they expire from Taskcluster. Over Matrix, @eu9ene told me that all task groups listed on the first two sheets of https://docs.google.com/spreadsheets/d/1EzcB-BSfC-_U_lg4NOfaTtBasOGJ9LxkURoqOcHajnM/edit?gid=305193406#gid=305193406 should be processed. That full list is attached: big-scrape-groups.txt.
Of the list above, the following task groups had no tasks that we scrape artifacts from (they were almost entirely dataset/clean/bicleaner/split/merge tasks; some had
train
tasks scheduled, but never ran):The following groups did not exist at all:
I'm still in the process of scraping and organizing these artifacts for upload; once I have all the files locally I'll dump the directory tree into a spreadsheet for review before I upload.
The text was updated successfully, but these errors were encountered: