-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Yelp Reviews Datasets #20
Add Yelp Reviews Datasets #20
Conversation
mix.exs
Outdated
@@ -30,7 +30,8 @@ defmodule Scidata.MixProject do | |||
|
|||
defp deps do | |||
[ | |||
{:ex_doc, ">= 0.24.0", only: :dev, runtime: false} | |||
{:ex_doc, ">= 0.24.0", only: :dev, runtime: false}, | |||
{:csv, "~> 2.4"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please use nimble_csv
for CSV parsing? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion! I've made the required change.
lib/scidata/yelp_polarity_reviews.ex
Outdated
records | ||
|> Enum.map(fn x -> | ||
x | ||
|> List.first() | ||
|> case do | ||
"1" -> 0 | ||
"2" -> 1 | ||
end | ||
end) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
records | |
|> Enum.map(fn x -> | |
x | |
|> List.first() | |
|> case do | |
"1" -> 0 | |
"2" -> 1 | |
end | |
end) | |
Enum.map(records, fn | |
["1" | _] -> 0 | |
["2" | _] -> 1 | |
end) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is indeed much so much cleaner! I've made this change.
lib/scidata/yelp_polarity_reviews.ex
Outdated
|> elem(1) | ||
|> IO.binstream(:line) | ||
|> CSV.decode!() | ||
|> Enum.to_list() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NimbleCSV supports full text parsing, so you should consider using that instead, as it is more efficient (especially since you already have the whole binary in memory anyway).
If you want to stream, then it is best to do it from file, using File.stream
or similar. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks to your suggestion, replacing CSV with Nimble CSV helped me remove this function completely 😄
Yes I had aimed to use File.stream
initially, but I wasn't sure if I should be adding a function specific to a single dataset to the Utils
file, which was why I adopted this method instead 😅 I've made the change now.
Thank you for the PR @goodhamgupta, I have dropped some initial comments. |
Thank you for the kind review @josevalim! I've made all the requested changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! @t-rutten, your call. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We really appreciate your PR @goodhamgupta! It looks great :) I just suggested tweaks to documentation.
Fix formatting Co-authored-by: Tom Rutten <[email protected]>
Co-authored-by: Tom Rutten <[email protected]>
Thanks so much for your kind review @t-rutten! 😄 I've made all the requested changes. |
Hi everyone,
Thanks for this great library! This PR aims to add training and test files for the Yelp Reviews datasets(#11 and #16). Based on the huggingface datasets library, the two datasets added are:
For the Yelp Full Reviews dataset:
Similarly, for the Yelp Polarity dataset:
The train dataset consists of 560k records with a total of 2 labels: 0(negative) and 1(positive)
The test dataset consists of 38k records with the same number of labels as above.
It can be queried as follows:
Also, for downloading the datasets, I'm currently using the URLs provided by Fast AI. Do let me know if you would like me to change them.
Thanks!