Data Processing #8

jialianwww · 2024-11-25T01:15:41Z

Dear author,

Thanks for introducing the great Prolong along with the full code!

I downloaded the raw data from AWS S3 and tried to prepare the data as I will use a different tokenizer. I met two issues as follows:

I used pack.py to pack the raw data using your provided command plus adding --tokenizer llama3. The dataset size is smaller than your provided packed data. For example, the 64K length "textbooks" has 3628 samples in your packed data but 3562 in my packed data. I also noticed that the first sample in your packed data is not starting from the first word of a data sample. Instead, it starts from the middle of somewhere. In my packed data, it starts from the first word of the dataset. I am wondering if you did any extra processing?
Is the pack.py script all we need to convert the raw data (tokenized) to the ready-to-training data format? As you instructed, you used the datatools to filter and pack data. I found the pack script in the datatools, but did not find a filtering script. Does the filtering only refer to discarding documents shorter than the minimum length. Was the filtering just done via the pack.py script by setting the minimum length?

Thanks for your time and appreciate your help!

gaotianyu1350 · 2024-11-27T02:10:32Z

Hi,

Sorry for the confusion. We actually randomly shuffled the data before uploading, which caused the differences. In fact the random shuffling is not necessary, as our training code automatically random shuffles it when loading.

To filter by length, you can check our updated REAME for an example code with datatools. You can use --min_length 65536 to specify the minimal length the documents should have.

As for the differences in the number of samples, unfortunately we could not trace back the reason either.... it is likely due to some slight logic changes in the data filtering/packing code, but both data should be correct and good for training (but do let us know if you observe any abnormality). Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Processing #8

Data Processing #8

jialianwww commented Nov 25, 2024

gaotianyu1350 commented Nov 27, 2024 •

edited

Loading

Data Processing #8

Data Processing #8

Comments

jialianwww commented Nov 25, 2024

gaotianyu1350 commented Nov 27, 2024 • edited Loading

gaotianyu1350 commented Nov 27, 2024 •

edited

Loading