Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Yelp Reviews Datasets #20

Merged
merged 10 commits into from
Dec 7, 2021

Conversation

goodhamgupta
Copy link
Contributor

@goodhamgupta goodhamgupta commented Dec 4, 2021

Hi everyone,

Thanks for this great library! This PR aims to add training and test files for the Yelp Reviews datasets(#11 and #16). Based on the huggingface datasets library, the two datasets added are:

For the Yelp Full Reviews dataset:

  • The train dataset consists of 650k records and a total of 5 labels: 1,2,3,4 and 5
  • The test dataset consists of 50k records with the same number of labels as above.
  • It can be queried as follows:
%{review: inputs, rating: targets} = Scidata.YelpFullReviews.download
%{review: test_inputs, rating: test_targets} = Scidata.YelpFullReviews.download_test

Similarly, for the Yelp Polarity dataset:

  • The train dataset consists of 560k records with a total of 2 labels: 0(negative) and 1(positive)

    • NOTE: The labels in the dataset are present a "1" and "2", with "1" including all reviews with 1 and 2 star rating, and "2" including all reviews with 3 and 4 star rating. (Converted based on huggingface implementation here)
  • The test dataset consists of 38k records with the same number of labels as above.

  • It can be queried as follows:

%{review: inputs, sentiment: targets} = Scidata.YelpPolarityReviews.download
%{review: test_inputs, sentiment: test_targets} = Scidata.YelpPolarityReviews.download_test

Also, for downloading the datasets, I'm currently using the URLs provided by Fast AI. Do let me know if you would like me to change them.

Thanks!

@goodhamgupta goodhamgupta marked this pull request as draft December 4, 2021 07:04
@goodhamgupta goodhamgupta marked this pull request as ready for review December 4, 2021 07:05
@goodhamgupta goodhamgupta marked this pull request as draft December 4, 2021 07:07
@goodhamgupta goodhamgupta marked this pull request as ready for review December 4, 2021 07:07
mix.exs Outdated
@@ -30,7 +30,8 @@ defmodule Scidata.MixProject do

defp deps do
[
{:ex_doc, ">= 0.24.0", only: :dev, runtime: false}
{:ex_doc, ">= 0.24.0", only: :dev, runtime: false},
{:csv, "~> 2.4"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please use nimble_csv for CSV parsing? :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! I've made the required change.

Comment on lines 54 to 62
records
|> Enum.map(fn x ->
x
|> List.first()
|> case do
"1" -> 0
"2" -> 1
end
end)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
records
|> Enum.map(fn x ->
x
|> List.first()
|> case do
"1" -> 0
"2" -> 1
end
end)
Enum.map(records, fn
["1" | _] -> 0
["2" | _] -> 1
end)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is indeed much so much cleaner! I've made this change.

|> elem(1)
|> IO.binstream(:line)
|> CSV.decode!()
|> Enum.to_list()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NimbleCSV supports full text parsing, so you should consider using that instead, as it is more efficient (especially since you already have the whole binary in memory anyway).

If you want to stream, then it is best to do it from file, using File.stream or similar. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks to your suggestion, replacing CSV with Nimble CSV helped me remove this function completely 😄

Yes I had aimed to use File.stream initially, but I wasn't sure if I should be adding a function specific to a single dataset to the Utils file, which was why I adopted this method instead 😅 I've made the change now.

@josevalim
Copy link
Contributor

Thank you for the PR @goodhamgupta, I have dropped some initial comments.

@goodhamgupta
Copy link
Contributor Author

Thank you for the kind review @josevalim! I've made all the requested changes.

Copy link
Contributor

@josevalim josevalim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! @t-rutten, your call. :)

@goodhamgupta goodhamgupta changed the title Yelp Reviews Datasets Add Yelp Reviews Datasets Dec 6, 2021
README.md Outdated Show resolved Hide resolved
Copy link
Contributor

@t-rutten t-rutten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We really appreciate your PR @goodhamgupta! It looks great :) I just suggested tweaks to documentation.

goodhamgupta and others added 2 commits December 7, 2021 09:59
Fix formatting

Co-authored-by: Tom Rutten <[email protected]>
@goodhamgupta
Copy link
Contributor Author

Thanks so much for your kind review @t-rutten! 😄 I've made all the requested changes.

@t-rutten t-rutten merged commit a7d1db2 into elixir-nx:master Dec 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants