Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Processing #8

Open
jialianwww opened this issue Nov 25, 2024 · 1 comment
Open

Data Processing #8

jialianwww opened this issue Nov 25, 2024 · 1 comment

Comments

@jialianwww
Copy link

Dear author,

Thanks for introducing the great Prolong along with the full code!

I downloaded the raw data from AWS S3 and tried to prepare the data as I will use a different tokenizer. I met two issues as follows:

  1. I used pack.py to pack the raw data using your provided command plus adding --tokenizer llama3. The dataset size is smaller than your provided packed data. For example, the 64K length "textbooks" has 3628 samples in your packed data but 3562 in my packed data. I also noticed that the first sample in your packed data is not starting from the first word of a data sample. Instead, it starts from the middle of somewhere. In my packed data, it starts from the first word of the dataset. I am wondering if you did any extra processing?

  2. Is the pack.py script all we need to convert the raw data (tokenized) to the ready-to-training data format? As you instructed, you used the datatools to filter and pack data. I found the pack script in the datatools, but did not find a filtering script. Does the filtering only refer to discarding documents shorter than the minimum length. Was the filtering just done via the pack.py script by setting the minimum length?

Thanks for your time and appreciate your help!

@gaotianyu1350
Copy link
Member

gaotianyu1350 commented Nov 27, 2024

Hi,

Sorry for the confusion. We actually randomly shuffled the data before uploading, which caused the differences. In fact the random shuffling is not necessary, as our training code automatically random shuffles it when loading.

To filter by length, you can check our updated REAME for an example code with datatools. You can use --min_length 65536 to specify the minimal length the documents should have.

As for the differences in the number of samples, unfortunately we could not trace back the reason either.... it is likely due to some slight logic changes in the data filtering/packing code, but both data should be correct and good for training (but do let us know if you observe any abnormality). Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants