Add Yelp Reviews Datasets #20

goodhamgupta · 2021-12-04T07:03:33Z

Hi everyone,

Thanks for this great library! This PR aims to add training and test files for the Yelp Reviews datasets(#11 and #16). Based on the huggingface datasets library, the two datasets added are:

For the Yelp Full Reviews dataset:

The train dataset consists of 650k records and a total of 5 labels: 1,2,3,4 and 5
The test dataset consists of 50k records with the same number of labels as above.
It can be queried as follows:

%{review: inputs, rating: targets} = Scidata.YelpFullReviews.download
%{review: test_inputs, rating: test_targets} = Scidata.YelpFullReviews.download_test

Similarly, for the Yelp Polarity dataset:

The train dataset consists of 560k records with a total of 2 labels: 0(negative) and 1(positive)
- NOTE: The labels in the dataset are present a "1" and "2", with "1" including all reviews with 1 and 2 star rating, and "2" including all reviews with 3 and 4 star rating. (Converted based on huggingface implementation here)
The test dataset consists of 38k records with the same number of labels as above.
It can be queried as follows:

%{review: inputs, sentiment: targets} = Scidata.YelpPolarityReviews.download
%{review: test_inputs, sentiment: test_targets} = Scidata.YelpPolarityReviews.download_test

Also, for downloading the datasets, I'm currently using the URLs provided by Fast AI. Do let me know if you would like me to change them.

Thanks!

josevalim · 2021-12-04T11:59:24Z

mix.exs

@@ -30,7 +30,8 @@ defmodule Scidata.MixProject do

  defp deps do
    [
-      {:ex_doc, ">= 0.24.0", only: :dev, runtime: false}
+      {:ex_doc, ">= 0.24.0", only: :dev, runtime: false},
+      {:csv, "~> 2.4"}


Can you please use nimble_csv for CSV parsing? :)

Thanks for the suggestion! I've made the required change.

josevalim · 2021-12-04T12:00:20Z

lib/scidata/yelp_polarity_reviews.ex

+    records
+    |> Enum.map(fn x ->
+      x
+      |> List.first()
+      |> case do
+        "1" -> 0
+        "2" -> 1
+      end
+    end)


This is indeed much so much cleaner! I've made this change.

josevalim · 2021-12-04T12:03:11Z

lib/scidata/yelp_polarity_reviews.ex

+    |> elem(1)
+    |> IO.binstream(:line)
+    |> CSV.decode!()
+    |> Enum.to_list()


NimbleCSV supports full text parsing, so you should consider using that instead, as it is more efficient (especially since you already have the whole binary in memory anyway).

If you want to stream, then it is best to do it from file, using File.stream or similar. :)

Thanks to your suggestion, replacing CSV with Nimble CSV helped me remove this function completely 😄

Yes I had aimed to use File.stream initially, but I wasn't sure if I should be adding a function specific to a single dataset to the Utils file, which was why I adopted this method instead 😅 I've made the change now.

josevalim · 2021-12-04T12:03:27Z

Thank you for the PR @goodhamgupta, I have dropped some initial comments.

goodhamgupta · 2021-12-04T18:02:55Z

Thank you for the kind review @josevalim! I've made all the requested changes.

josevalim

LGTM! @t-rutten, your call. :)

README.md

lib/scidata/yelp_polarity_reviews.ex

t-rutten

We really appreciate your PR @goodhamgupta! It looks great :) I just suggested tweaks to documentation.

Fix formatting Co-authored-by: Tom Rutten <[email protected]>

Co-authored-by: Tom Rutten <[email protected]>

goodhamgupta · 2021-12-07T02:04:02Z

Thanks so much for your kind review @t-rutten! 😄 I've made all the requested changes.

goodhamgupta added 4 commits December 4, 2021 12:45

Add Yelp Polarity Reviews dataset

605881e

Add unit test for yelp polarity reviews dataset

564caf1

Add support for yelp full reviews dataset

c12d0bb

Add assertions for unique values in targets

fe29230

goodhamgupta marked this pull request as draft December 4, 2021 07:04

goodhamgupta marked this pull request as ready for review December 4, 2021 07:05

goodhamgupta marked this pull request as draft December 4, 2021 07:07

goodhamgupta marked this pull request as ready for review December 4, 2021 07:07

Update README.md

c82b64e

josevalim reviewed Dec 4, 2021

View reviewed changes

goodhamgupta added 3 commits December 5, 2021 01:51

Replace CSV with Nimble CSV and address comments

fa23abb

Fix specs

5654824

Remove CSV deps from mix.lock

44c6076

josevalim approved these changes Dec 4, 2021

View reviewed changes

goodhamgupta mentioned this pull request Dec 6, 2021

Add Kuzushiji MNIST dataset #22

Merged

goodhamgupta changed the title ~~Yelp Reviews Datasets~~ Add Yelp Reviews Datasets Dec 6, 2021

t-rutten reviewed Dec 6, 2021

View reviewed changes

README.md Outdated Show resolved Hide resolved

t-rutten reviewed Dec 6, 2021

View reviewed changes

lib/scidata/yelp_polarity_reviews.ex Outdated Show resolved Hide resolved

t-rutten reviewed Dec 6, 2021

View reviewed changes

goodhamgupta and others added 2 commits December 7, 2021 09:59

Update README.md

d7b8bdd

Fix formatting Co-authored-by: Tom Rutten <[email protected]>

Update url for Yelp reviews dataset

00823c2

Co-authored-by: Tom Rutten <[email protected]>

t-rutten approved these changes Dec 7, 2021

View reviewed changes

t-rutten merged commit a7d1db2 into elixir-nx:master Dec 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Yelp Reviews Datasets #20

Add Yelp Reviews Datasets #20

goodhamgupta commented Dec 4, 2021 •

edited

Loading

josevalim Dec 4, 2021

goodhamgupta Dec 4, 2021

josevalim Dec 4, 2021

goodhamgupta Dec 4, 2021

josevalim Dec 4, 2021

goodhamgupta Dec 4, 2021

josevalim commented Dec 4, 2021

goodhamgupta commented Dec 4, 2021

josevalim left a comment

t-rutten left a comment

goodhamgupta commented Dec 7, 2021

Add Yelp Reviews Datasets #20

Add Yelp Reviews Datasets #20

Conversation

goodhamgupta commented Dec 4, 2021 • edited Loading

josevalim Dec 4, 2021

Choose a reason for hiding this comment

goodhamgupta Dec 4, 2021

Choose a reason for hiding this comment

josevalim Dec 4, 2021

Choose a reason for hiding this comment

goodhamgupta Dec 4, 2021

Choose a reason for hiding this comment

josevalim Dec 4, 2021

Choose a reason for hiding this comment

goodhamgupta Dec 4, 2021

Choose a reason for hiding this comment

josevalim commented Dec 4, 2021

goodhamgupta commented Dec 4, 2021

josevalim left a comment

Choose a reason for hiding this comment

t-rutten left a comment

Choose a reason for hiding this comment

goodhamgupta commented Dec 7, 2021

goodhamgupta commented Dec 4, 2021 •

edited

Loading