You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It results in 100 train*.json and 100 test*.json under the AmazonQA/processed/ folder. After I read all the data, the training set has 158974 samples and the test set has 16763.
What is the number of samples you used in the paper? 3M or 158.9K? I am confused because it is different from the number listed in the repo.
P.S. I saw some filtering functions have been done in the create_data.py file.
Below are some statistics of the conversational dataset:
Input files: 38
Number of QA dictionaries: 1,569,513
Number of tuples: 4,035,625
Number of de-duplicated tuples: 3,689,912
Train set size: 3,316,905
Test set size: 373,007
Thank you in advance for your kind reply.
The text was updated successfully, but these errors were encountered:
Hi Jason,
the training set size should be 3.3M. Maybe check there are indeed 38 input files? TOTAL: 38 objects, 1935927109 bytes (1.8 GiB)
I just re-ran the pipeline (with Google cloud DataflowRunner, and json output) and confirm these numbers. A quick check is wc -l data/test-00099-of-00100.json giving 3729.
That's strange. I do have 38 files with around 1.8G. So is it the issue of using --runner DirectRunner?
When I ran wc AmazonQA/processed/test-00099-of-00100.json I got 167 6503 39507 AmazonQA/processed/test-00099-of-00100.json. Also I found that my AmazonQA/processed/ folder only has 41M.
Hi,
I have downloaded the Amazon data (38 files) and ran the create_data.py by
python amazon_qa/create_data.py --file_pattern AmazonQA/* --output_dir AmazonQA/processed/ --runner DirectRunner --temp_location AmazonQA/processed/temp --staging_location AmazonQA/processed/staging --dataset_format JSON
It results in 100 train*.json and 100 test*.json under the AmazonQA/processed/ folder. After I read all the data, the training set has 158974 samples and the test set has 16763.
What is the number of samples you used in the paper? 3M or 158.9K? I am confused because it is different from the number listed in the repo.
P.S. I saw some filtering functions have been done in the create_data.py file.
Thank you in advance for your kind reply.
The text was updated successfully, but these errors were encountered: