Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IMDB Reviews dataset #12

Merged
merged 15 commits into from
Sep 17, 2021
Merged

IMDB Reviews dataset #12

merged 15 commits into from
Sep 17, 2021

Conversation

t-rutten
Copy link
Contributor

@t-rutten t-rutten commented May 5, 2021

These changes add training and test sets for the IMDB reviews dataset related to #11. Train and test sets each contain 25000 labeled examples; each example includes a textual review of a movie and a binary label categorizing the example as having positive sentiment (1) or negative sentiment (0).

The dataset can be queried like:

{inputs, targets} = Scidata.IMDBReviews.download
{test_inputs, test_targets} = Scidata.IMDBReviews.download_test

Questions and Discussion

  • The source dataset also includes 50000 unlabeled examples useful for unsupervised learning. Do we want to add that set to the API as well?
  • The return tuple for download/1 and download_test/1 is two lists, but we could return raw binaries instead (where inputs might be separated by newline character).
  • We don't provide a type or a shape for the return values--there's no tensor type associated with the data, and the "shape" of each example input is variable because reviews have variable length. Is there type that seems sensible to provide in the return tuple? The shape we might specify as {25000, nil} for inputs and {25000} for labels.
  • We could make return labels and datasets as streams for faster initial load and to avoid keeping it in memory, but the dataset is not massive so I'm unsure if it's worth it.
  • It seems some codepoints aren't UTF-8 encoded and thus not "printable". This leads to some examples like
[..., 
   "...The BBC version is so superior it's not even funny and everything about this version is an insult to its memory. In short if you must see it be sure you have read the book first or seen the BBC version other wise you will be lead done the deluded road that this is what it's like, which its not!",
   <<75, 97, 122, 117, 111, 32, 75, 111, 109, 105, 122, 117, 32, 115, 116, 114,
     105, 107, 101, 115, 32, 97, 103, 97, 105, 110, 32, 119, 105, 116, 104, 32,
     34, 69, 110, 116, 114, 97, 105, 108, 115, ...>>,
   "Daniel Percival's \"Dirty War\", a BBC production made for television was shown recently on cable. The film has a documentary style in the way it goes after the people that caused the near holocaust in one of the big metropolis of the world, London...",
...]

We probably don't need to address this—users can clean and tokenize bitstrings of each entry as desired—but I wanted to flag. The experience on TFDS is similar.

@seanmor5
Copy link
Contributor

seanmor5 commented May 6, 2021

@t-rutten I think I addressed most of the discussion points in the issue I just opened. As far as different sequence lengths, we'll have to pad to max length or truncate to a fixed length to batch as we don't support ragged tensors. I think leaving everything as a list is okay for now.

@t-rutten
Copy link
Contributor Author

t-rutten commented May 6, 2021

Great @seanmor5, I'll have a look.

file_match?(fname, dataset_type, :neg)
end)

:rand.seed(:exsss, {101, 102, 103})
Copy link
Contributor

@josevalim josevalim May 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are hardcoding a seed, then shuffle is deterministic. Is that what we really want? Also, maybe we should let the users shuffle?

Also, should we let positive and negative be an argument... or not prefilter by postive and negative? I think we should try to avoid doing passes over the data, as that can be expensive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deterministic shuffling is probably not what we want, and if we shuffle the dataset then ideally the user should pass a seed (seed can be an opt in a future dataset pipeline #13). I like your suggestion to let the user shuffle for now, let's go with that.

If a user wants only labeled examples for supervised learning and we don't filter the files in the positive and negative directories, then they'll load 50000 examples that'll be unused. I think it would be reasonable to let users specify which kinds of examples they want (positive, negative, unlabeled) in an argument--I'll add that.

Copy link
Contributor

@josevalim josevalim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dropped one comment on the Elixir code :) Looking great!

@t-rutten t-rutten marked this pull request as ready for review May 11, 2021 14:05
@t-rutten
Copy link
Contributor Author

t-rutten commented Jun 8, 2021

@seanmor5 do you still think the list version of the data set here is sufficient, or should we change to a truncated or padded tensor instead?

@josevalim
Copy link
Contributor

According to our discussions on Slack, some datasets may not be suitable for tensors, therefore we should have a different return type. I think this one could be suitable for dataframes, so we could return the data in this format:

%{
  review: [review1, review2, review3, ...]
  sentiment: [1, -1, 0, ...]
}

All datasets should likely live in a flat namespace and we will use ExDoc grouping to categorize them and also use specs to clarify the return type (series, tensors, etc).

@t-rutten
Copy link
Contributor Author

Thanks @josevalim! I will add changes to the returned data.

& adapt tests to new return type
@t-rutten
Copy link
Contributor Author

@josevalim I've changed the return types and added specs for download/2 and download_test/2. Thanks for reviewing!

@t-rutten
Copy link
Contributor Author

t-rutten commented Sep 1, 2021

@josevalim let me know if I missed anything!

  • download and download_test now have specs which show argument and return types without opts
  • opts spec is more explicit

"""
@spec download([train_sentiment]) :: %{review: [binary(), ...], sentiment: 1 | -1}
def download(
example_types \\ [:pos, :neg],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should make the example types an option to simplify the API? 🤔 For example:

download(example_types: [:pos])

Copy link
Contributor Author

@t-rutten t-rutten Sep 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simpler API is good! Something like this?

  @spec download(example_types: [test_sentiment]) :: %{review: [binary(), ...], sentiment: 1 | 0}
  def download(
        opts \\ [example_types: [:pos, :neg]]
      ) do
    {example_types, opts} = Keyword.pop(opts, :example_types)
    download_dataset(example_types, :train, opts)
  end

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! Although I think you can handle the options inside download_dataset:

download_dataset(:train, opts)

And in there:

example_types = opts[:example_types] || [:pos, :neg]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@josevalim good call, I've added those changes.

Copy link
Contributor

@josevalim josevalim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I drop another round of reviews, sorry for the back and forth, but I should have probably dropped this one the first time. :)

@t-rutten t-rutten merged commit b122ec6 into master Sep 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants