-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IMDB Reviews dataset #12
Conversation
@t-rutten I think I addressed most of the discussion points in the issue I just opened. As far as different sequence lengths, we'll have to pad to max length or truncate to a fixed length to batch as we don't support ragged tensors. I think leaving everything as a list is okay for now. |
Great @seanmor5, I'll have a look. |
lib/scidata/imdb_reviews.ex
Outdated
file_match?(fname, dataset_type, :neg) | ||
end) | ||
|
||
:rand.seed(:exsss, {101, 102, 103}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are hardcoding a seed, then shuffle is deterministic. Is that what we really want? Also, maybe we should let the users shuffle?
Also, should we let positive and negative be an argument... or not prefilter by postive and negative? I think we should try to avoid doing passes over the data, as that can be expensive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deterministic shuffling is probably not what we want, and if we shuffle the dataset then ideally the user should pass a seed (seed can be an opt in a future dataset pipeline #13). I like your suggestion to let the user shuffle for now, let's go with that.
If a user wants only labeled examples for supervised learning and we don't filter the files in the positive and negative directories, then they'll load 50000 examples that'll be unused. I think it would be reasonable to let users specify which kinds of examples they want (positive, negative, unlabeled) in an argument--I'll add that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dropped one comment on the Elixir code :) Looking great!
Co-authored-by: José Valim <[email protected]>
@seanmor5 do you still think the list version of the data set here is sufficient, or should we change to a truncated or padded tensor instead? |
According to our discussions on Slack, some datasets may not be suitable for tensors, therefore we should have a different return type. I think this one could be suitable for dataframes, so we could return the data in this format: %{
review: [review1, review2, review3, ...]
sentiment: [1, -1, 0, ...]
} All datasets should likely live in a flat namespace and we will use ExDoc grouping to categorize them and also use specs to clarify the return type (series, tensors, etc). |
Thanks @josevalim! I will add changes to the returned data. |
& adapt tests to new return type
@josevalim I've changed the return types and added specs for |
@josevalim let me know if I missed anything!
|
lib/scidata/imdb_reviews.ex
Outdated
""" | ||
@spec download([train_sentiment]) :: %{review: [binary(), ...], sentiment: 1 | -1} | ||
def download( | ||
example_types \\ [:pos, :neg], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should make the example types an option to simplify the API? 🤔 For example:
download(example_types: [:pos])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simpler API is good! Something like this?
@spec download(example_types: [test_sentiment]) :: %{review: [binary(), ...], sentiment: 1 | 0}
def download(
opts \\ [example_types: [:pos, :neg]]
) do
{example_types, opts} = Keyword.pop(opts, :example_types)
download_dataset(example_types, :train, opts)
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! Although I think you can handle the options inside download_dataset:
download_dataset(:train, opts)
And in there:
example_types = opts[:example_types] || [:pos, :neg]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@josevalim good call, I've added those changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I drop another round of reviews, sorry for the back and forth, but I should have probably dropped this one the first time. :)
Co-authored-by: José Valim <[email protected]>
Co-authored-by: José Valim <[email protected]>
These changes add training and test sets for the IMDB reviews dataset related to #11. Train and test sets each contain 25000 labeled examples; each example includes a textual review of a movie and a binary label categorizing the example as having positive sentiment (1) or negative sentiment (0).
The dataset can be queried like:
Questions and Discussion
download/1
anddownload_test/1
is two lists, but we could return raw binaries instead (where inputs might be separated by newline character).{25000, nil}
for inputs and{25000}
for labels.We probably don't need to address this—users can clean and tokenize bitstrings of each entry as desired—but I wanted to flag. The experience on TFDS is similar.