IMDB Reviews dataset #12

t-rutten · 2021-05-05T22:52:44Z

These changes add training and test sets for the IMDB reviews dataset related to #11. Train and test sets each contain 25000 labeled examples; each example includes a textual review of a movie and a binary label categorizing the example as having positive sentiment (1) or negative sentiment (0).

The dataset can be queried like:

{inputs, targets} = Scidata.IMDBReviews.download
{test_inputs, test_targets} = Scidata.IMDBReviews.download_test

Questions and Discussion

The source dataset also includes 50000 unlabeled examples useful for unsupervised learning. Do we want to add that set to the API as well?
The return tuple for download/1 and download_test/1 is two lists, but we could return raw binaries instead (where inputs might be separated by newline character).
We don't provide a type or a shape for the return values--there's no tensor type associated with the data, and the "shape" of each example input is variable because reviews have variable length. Is there type that seems sensible to provide in the return tuple? The shape we might specify as {25000, nil} for inputs and {25000} for labels.
We could make return labels and datasets as streams for faster initial load and to avoid keeping it in memory, but the dataset is not massive so I'm unsure if it's worth it.
It seems some codepoints aren't UTF-8 encoded and thus not "printable". This leads to some examples like

[..., 
   "...The BBC version is so superior it's not even funny and everything about this version is an insult to its memory. In short if you must see it be sure you have read the book first or seen the BBC version other wise you will be lead done the deluded road that this is what it's like, which its not!",
   <<75, 97, 122, 117, 111, 32, 75, 111, 109, 105, 122, 117, 32, 115, 116, 114,
     105, 107, 101, 115, 32, 97, 103, 97, 105, 110, 32, 119, 105, 116, 104, 32,
     34, 69, 110, 116, 114, 97, 105, 108, 115, ...>>,
   "Daniel Percival's \"Dirty War\", a BBC production made for television was shown recently on cable. The film has a documentary style in the way it goes after the people that caused the near holocaust in one of the big metropolis of the world, London...",
...]

We probably don't need to address this—users can clean and tokenize bitstrings of each entry as desired—but I wanted to flag. The experience on TFDS is similar.

seanmor5 · 2021-05-06T00:42:01Z

@t-rutten I think I addressed most of the discussion points in the issue I just opened. As far as different sequence lengths, we'll have to pad to max length or truncate to a fixed length to batch as we don't support ragged tensors. I think leaving everything as a list is okay for now.

t-rutten · 2021-05-06T02:13:12Z

Great @seanmor5, I'll have a look.

josevalim · 2021-05-09T10:09:59Z

lib/scidata/imdb_reviews.ex

+        file_match?(fname, dataset_type, :neg)
+      end)
+
+    :rand.seed(:exsss, {101, 102, 103})


If we are hardcoding a seed, then shuffle is deterministic. Is that what we really want? Also, maybe we should let the users shuffle?

Also, should we let positive and negative be an argument... or not prefilter by postive and negative? I think we should try to avoid doing passes over the data, as that can be expensive?

Deterministic shuffling is probably not what we want, and if we shuffle the dataset then ideally the user should pass a seed (seed can be an opt in a future dataset pipeline #13). I like your suggestion to let the user shuffle for now, let's go with that.

If a user wants only labeled examples for supervised learning and we don't filter the files in the positive and negative directories, then they'll load 50000 examples that'll be unused. I think it would be reasonable to let users specify which kinds of examples they want (positive, negative, unlabeled) in an argument--I'll add that.

lib/scidata/imdb_reviews.ex

josevalim

I dropped one comment on the Elixir code :) Looking great!

Co-authored-by: José Valim <[email protected]>

t-rutten · 2021-06-08T22:31:40Z

@seanmor5 do you still think the list version of the data set here is sufficient, or should we change to a truncated or padded tensor instead?

josevalim · 2021-08-20T16:41:35Z

According to our discussions on Slack, some datasets may not be suitable for tensors, therefore we should have a different return type. I think this one could be suitable for dataframes, so we could return the data in this format:

%{
  review: [review1, review2, review3, ...]
  sentiment: [1, -1, 0, ...]
}

All datasets should likely live in a flat namespace and we will use ExDoc grouping to categorize them and also use specs to clarify the return type (series, tensors, etc).

t-rutten · 2021-08-23T16:41:56Z

Thanks @josevalim! I will add changes to the returned data.

& adapt tests to new return type

t-rutten · 2021-08-27T04:01:08Z

@josevalim I've changed the return types and added specs for download/2 and download_test/2. Thanks for reviewing!

lib/scidata/imdb_reviews.ex

t-rutten · 2021-09-01T03:01:26Z

@josevalim let me know if I missed anything!

download and download_test now have specs which show argument and return types without opts
opts spec is more explicit

lib/scidata/imdb_reviews.ex

josevalim · 2021-09-01T13:49:10Z

lib/scidata/imdb_reviews.ex

+  """
+  @spec download([train_sentiment]) :: %{review: [binary(), ...], sentiment: 1 | -1}
+  def download(
+        example_types \\ [:pos, :neg],


Maybe we should make the example types an option to simplify the API? 🤔 For example:

download(example_types: [:pos])

Simpler API is good! Something like this?

@spec download(example_types: [test_sentiment]) :: %{review: [binary(), ...], sentiment: 1 | 0} def download( opts \\ [example_types: [:pos, :neg]] ) do {example_types, opts} = Keyword.pop(opts, :example_types) download_dataset(example_types, :train, opts) end

Yes! Although I think you can handle the options inside download_dataset:

download_dataset(:train, opts)

And in there:

example_types = opts[:example_types] || [:pos, :neg]

@josevalim good call, I've added those changes.

josevalim

I drop another round of reviews, sorry for the back and forth, but I should have probably dropped this one the first time. :)

Co-authored-by: José Valim <[email protected]>

download imdb reviews dataset

b6632a0

josevalim reviewed May 9, 2021

View reviewed changes

t-rutten added 2 commits May 9, 2021 18:20

remove shuffle, add arg for example types

c92e988

add basic docs

130a710

josevalim reviewed May 10, 2021

View reviewed changes

lib/scidata/imdb_reviews.ex Outdated Show resolved Hide resolved

josevalim reviewed May 10, 2021

View reviewed changes

t-rutten and others added 2 commits May 10, 2021 20:49

filter & reduce in comprehension

b97e61b

Co-authored-by: José Valim <[email protected]>

remove parenthesis

43229b1

t-rutten marked this pull request as ready for review May 11, 2021 14:05

add download tests

d45c130

change download* return type, add specs

80f8235

& adapt tests to new return type

josevalim reviewed Aug 27, 2021

View reviewed changes

lib/scidata/imdb_reviews.ex Outdated Show resolved Hide resolved

josevalim reviewed Aug 27, 2021

View reviewed changes

lib/scidata/imdb_reviews.ex Outdated Show resolved Hide resolved

spec download w/o opts, opts spec more explicit

da9ac9a

josevalim reviewed Sep 1, 2021

View reviewed changes

lib/scidata/imdb_reviews.ex Outdated Show resolved Hide resolved

josevalim reviewed Sep 1, 2021

View reviewed changes

lib/scidata/imdb_reviews.ex Outdated Show resolved Hide resolved

josevalim reviewed Sep 1, 2021

View reviewed changes

lib/scidata/imdb_reviews.ex Outdated Show resolved Hide resolved

josevalim reviewed Sep 1, 2021

View reviewed changes

t-rutten and others added 4 commits September 2, 2021 18:56

commit regex once

1e1376f

Co-authored-by: José Valim <[email protected]>

Convert to string once and use binary matching instead of regex

2d3c556

Co-authored-by: José Valim <[email protected]>

update tests

31b8422

-1 -> 0 for negative examples

20e24d2

simplify api

24b488a

josevalim approved these changes Sep 6, 2021

View reviewed changes

t-rutten added 2 commits September 16, 2021 21:34

bump release

e316434

alphabetize

fec412b

t-rutten merged commit b122ec6 into master Sep 17, 2021

t-rutten mentioned this pull request Jan 16, 2022

Make IMDB Reviews dataset consistent #26

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IMDB Reviews dataset #12

IMDB Reviews dataset #12

t-rutten commented May 5, 2021

seanmor5 commented May 6, 2021

t-rutten commented May 6, 2021

josevalim May 9, 2021 •

edited

Loading

t-rutten May 9, 2021

josevalim left a comment

t-rutten commented Jun 8, 2021

josevalim commented Aug 20, 2021

t-rutten commented Aug 23, 2021

t-rutten commented Aug 27, 2021

t-rutten commented Sep 1, 2021

josevalim Sep 1, 2021

t-rutten Sep 3, 2021 •

edited

Loading

josevalim Sep 3, 2021

t-rutten Sep 6, 2021

josevalim left a comment

IMDB Reviews dataset #12

IMDB Reviews dataset #12

Conversation

t-rutten commented May 5, 2021

Questions and Discussion

seanmor5 commented May 6, 2021

t-rutten commented May 6, 2021

josevalim May 9, 2021 • edited Loading

Choose a reason for hiding this comment

t-rutten May 9, 2021

Choose a reason for hiding this comment

josevalim left a comment

Choose a reason for hiding this comment

t-rutten commented Jun 8, 2021

josevalim commented Aug 20, 2021

t-rutten commented Aug 23, 2021

t-rutten commented Aug 27, 2021

t-rutten commented Sep 1, 2021

josevalim Sep 1, 2021

Choose a reason for hiding this comment

t-rutten Sep 3, 2021 • edited Loading

Choose a reason for hiding this comment

josevalim Sep 3, 2021

Choose a reason for hiding this comment

t-rutten Sep 6, 2021

Choose a reason for hiding this comment

josevalim left a comment

Choose a reason for hiding this comment

josevalim May 9, 2021 •

edited

Loading

t-rutten Sep 3, 2021 •

edited

Loading