Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scrape and upload 2024 training artifacts #948

Open
bhearsum opened this issue Nov 27, 2024 · 5 comments
Open

scrape and upload 2024 training artifacts #948

bhearsum opened this issue Nov 27, 2024 · 5 comments
Assignees

Comments

@bhearsum
Copy link
Collaborator

We need to scrape and upload the artifacts from the large scale training that we did in 2024 before they expire from Taskcluster. Over Matrix, @eu9ene told me that all task groups listed on the first two sheets of https://docs.google.com/spreadsheets/d/1EzcB-BSfC-_U_lg4NOfaTtBasOGJ9LxkURoqOcHajnM/edit?gid=305193406#gid=305193406 should be processed. That full list is attached: big-scrape-groups.txt.

Of the list above, the following task groups had no tasks that we scrape artifacts from (they were almost entirely dataset/clean/bicleaner/split/merge tasks; some had train tasks scheduled, but never ran):

BxN54ej5Q8K4nBaBNdcZsQ
CJcaa0GLT9e2i5lNBwWazQ
D4JxfCWPTc22cU0du4rW_A
DbzZmDJZSLS9XNpmHX8OMA
E1yWYiDtRjeQs1khuAAZnw
FDIq7ZEmQB6BXWO2K9xrdw
FRubfBm4TLetQ4XlJIUISg
FZ-qmI7HSjyDeCOeJAWJBA
FnZjvwEvT9a0FTgJ_ll66Q
FxLCMgVyTjSuB839EZmTpA
GNZKnSbtQMiHczVfarqHwg
IwwoOph6RX-U1tAxA61l4Q
JQwJ5OITQmCNn0UAJ41T_w
KJ_cva4BSk623AV6wYIZlQ
KRXRJ_lTSWWrv0F20lORCQ
KhkyUfCIRD-ByQbxs310pA
KsSrCXPtRzCie4wkejInsA
Lhwmosd-R3aqMCt96ZugsQ
M8RvDoI7TnOuTk0kiFVIjg
O5nwoBdFSACkjNaOhhIwzw
OLD3_NcGRm-4RpQmXe0ngg
Ot4HVSSNSKqMVuVGthKsyg
Qg9PyeT9RRi_uv50g_f6sQ
QlPQlm85TAyEHL4qr_HXiA
RhuNiAW3SRqiBwoutaEGaQ
RkMIb_7XSEGHlNvNsdXmPQ
SsmDnqoBTyGStdOvvzK5Vg
T1RFo6nVQTy0iy1Bwdz7cQ
Tbkg0bxkROyaTd2C4tBpdg
Uv7EgA9SQdGT54nkWFwACQ
V-OmRM1yS_GwESTVDPl0VQ
V3x_-at2T5K2FU1ISqz1XA
VgR9RS46SIqkfem2mLGKcA
Vn1illbAREaz_zQy4VrbYw
WFoWgKmGRxa44hppCtvwmg
eZKkxqHISTCDwrsylAZZvA
ebpYrNxgQh-b5mRHbdi6bA

The following groups did not exist at all:

DXbS0zreSGSVYloAF8gwJg
EW7qV3U5SBSjegnTxGZHkw
K1iHndFUSxSEDRLg_H9l1A
Tkrf0fGBQEO6kH-gSKp5lg
aY25-4fXTcuJNuMcWXUYtQ
fYJkSp6IRYqnLvFOgwXPaA

I'm still in the process of scraping and organizing these artifacts for upload; once I have all the files locally I'll dump the directory tree into a spreadsheet for review before I upload.

@bhearsum bhearsum self-assigned this Nov 27, 2024
@bhearsum
Copy link
Collaborator Author

Alright, modulo the few missing groups noted above, I believe I have everything scraped and ready to upload. The upload is likely to take many days, so I'd appreciate a sanity check before doing so. The list of files that I will upload can be found on this spreadsheet. They will be uploaded in that exact directory structure into the root of this bucket (which is the same place I uploaded last time I scraped).

The total size of what I've scraped is 935G.

As with last time, I've scraped everything from tasks prefixed with one of the following strings: ("train-", "finetune-", "vocab", "export", "evaluate-", "quantize") (as well as the training configs). This results in the following possible sub directories in individual experiment+group directories:

backward
evaluation
exported
quantized
student
student-finetuned
teacher0
teacher1
vocab

Within evaluation we have additional possible subdirectories:

backward
speed
student
student-finetuned
teacher-ensemble
teacher0
teacher1

@eu9ene - can you please sanity check the above before I start the upload? (I can also give you access to the instance the data is on if you'd like to poke around there instead.)

@eu9ene
Copy link
Collaborator

eu9ene commented Dec 13, 2024

@bhearsum the directory structure and your list look good but we can save space a bit by not uploading the following files from the model training artifacts:

public/build/model.npz
public/build/model.npz.best-bleu-detok.npz
public/build/model.npz.best-bleu-detok.npz.decoder.yml
public/build/model.npz.best-ce-mean-words.npz
public/build/model.npz.best-ce-mean-words.npz.decoder.yml
public/build/model.npz.best-chrf.npz
public/build/model.npz.best-chrf.npz.decoder.yml
public/build/model.npz.decoder.yml
public/build/model.npz.optimizer.npz

They are intermediate checkpoints for different metrics and we care only about the final*.npz model files which are copies of the best-chrf ones. Also, I don't think we care about the 3GB large .optimizer.npz If we ever use those models to continue training them, we'll just fine-tune those final checkpoints and we don't need .optimizer.npz for that.

@bhearsum
Copy link
Collaborator Author

@bhearsum the directory structure and your list look good but we can save space a bit by not uploading the following files from the model training artifacts:

public/build/model.npz public/build/model.npz.best-bleu-detok.npz public/build/model.npz.best-bleu-detok.npz.decoder.yml public/build/model.npz.best-ce-mean-words.npz public/build/model.npz.best-ce-mean-words.npz.decoder.yml public/build/model.npz.best-chrf.npz public/build/model.npz.best-chrf.npz.decoder.yml public/build/model.npz.decoder.yml public/build/model.npz.optimizer.npz

They are intermediate checkpoints for different metrics and we care only about the final*.npz model files which are copies of the best-chrf ones. Also, I don't think we care about the 3GB large .optimizer.npz If we ever use those models to continue training them, we'll just fine-tune those final checkpoints and we don't need .optimizer.npz for that.

I've prepared a new directory with all files with those names excluded; the full list is available in a new tab on the sheet. This removed ~2,500 files compared to the previous round, and took us from 935GB to 108GB of data.

@eu9ene - let me know how this revised version looks.

@eu9ene
Copy link
Collaborator

eu9ene commented Dec 13, 2024

It looks good now.

@bhearsum
Copy link
Collaborator Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants