Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing Imbalanced Data #1

Open
ArmandGiraud opened this issue May 25, 2017 · 12 comments
Open

Fixing Imbalanced Data #1

ArmandGiraud opened this issue May 25, 2017 · 12 comments

Comments

@ArmandGiraud
Copy link

The NER corpus include many more 'O' label than any entities.
How can we fix this using keras?
I tried sample_weight to ajust the loss function during training, but it does not appear to fix the problem fully. What would you suggest?
Thx

@pandeydivesh15
Copy link
Owner

In case of Hindi data, there are surely many 'O' entries. Fixing this is not entirely possible, as we would have to go through entire dataset, or create a new one(that is extreme task). We can only make some assumptions, like using only those sentences which have some certain specific number of named entities, using sentences with max_len <= threshold, etc.
I dont understand fixing this by keras. Can you explain more?

@ArmandGiraud
Copy link
Author

Actually that was unclear from me, when I try to train the model on the english conll dataset, The classifier only predicts '0' label, and this yields a high accuracy (around 97%).
class imbalance evidence

Maybe I'm just doing something wrong, but i don't see what.
I already encountered class imbalances in other ML cases. But I'm wondering if there is any preferred solution for the NER problem.
There are many ways of addressing this problem, (such as oversampling, undersampling or SMOTE)
or some solutions within the keras options such as setting class weights in the loss function.

@pandeydivesh15
Copy link
Owner

That image suggests that you are surely doing something wrong.
How was the output when you were training the model using keras. Was valid. acc increasing steadily(at a optimum rate) and loss decreasing at a good rate?
For handling class imbalances, you can do something like I told in previous comment.

@ArmandGiraud
Copy link
Author

I tried to run the script with default settings, as it can be found in english_NER.ipnb.
The accuracy (and logloss) is stuck at 97.3 from the first epoch.
I'm trying to figure out what is going wrong.

@pandeydivesh15
Copy link
Owner

Sorry for late replying.
Were you also getting very low loss (in negative powers of 10) and NaN values during training?

@ArmandGiraud
Copy link
Author

Hello Divesh,
I have a very low loss from the first epoch, I joined a capture of training logs:
ner_with_deep_learning

The only thing I changed was adding a few parenthis to print functions since I'm running your scripts with python 3, maybe I'm also using a different version of Keras tensorflow, I have keras 2.0.0 and tensorflow 1.0.1 installed on windows 64, which versions did you use initially?
Thanks for helping

@pandeydivesh15
Copy link
Owner

The problem is version numbers. I should have made requirements.txt.
I used Keras==1.2.1 and tensorflow-gpu==0.12.1. Though I had tensorflow with GPU support, you can avoid that by installing just tensorflow==0.12.1. Try this in a new env and let me know.
About using python 3, some problems can occur while handling unicodes, but in our case, chances are less.

@jenniferzhu
Copy link

I got a similar issue that all are predicted to be 'O' for the English dataset, but my issue is even worse as the losses are all nan since the beginning. I will try to match the versions of Keras and tensorflow. Do you have any other advice on this issue? Thanks.

@jenniferzhu
Copy link

A follow-up with that, it does not seem that the versions of tesorflow and Keras can solve my loss: nan issues. I am wondering if this is due to the gpu vs cpu?

@jenniferzhu
Copy link

It turns out that I get the same answer with @ArmandGiraud right now. @pandeydivesh15 what was the accuracy?

@pandeydivesh15
Copy link
Owner

I trained one model just now. Output in my case:
22

@sayantanbbb
Copy link

sayantanbbb commented Jan 7, 2022

Still having this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants