-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Manipulating the training and test sets #11
Comments
Hi, sreevarsha. You can pass the train_idx , test_idx , label_idx , unlabel_idx when initializing the ToolBox object if you have your own data split setting. Each of them should be a 2D list which has the shape For example:
If you have the independent training data
|
Hi, thanks so much for your response. A couple of things, (1) I'm not splitting the data set in the code, I already have a split train and test sets, I merely want to read them and use them for test and train. So, how do I define that when I call ToolBox? Could you shed some light on these please ? Thanks so much !!! |
So you have a labeled train set, and an unlabeled test set for querying. All you want to do is labeling some instances from the test set. Please try this:
|
Thanks so much, I'll try this :) |
Hi, I am having a strange problem. If I update the data after choosing the query from the test set like you have shown, the accuracies that I get when I test it on my test set are of the order of 65%. But when I do that like this, |
Hi, If you have the ground-truth labels of your unlabeled set, and pass it to query strategy (
Then, Maybe you do not replace the example labeling code Besides, I think we have the different definitions of the
If you are labeling real data that do not have a test set for testing model, there should be an initially labeled set and an unlabeled pool for querying. You can further draw out a validation set from the initially labeled set for testing (And only for testing). |
Hi, I think I managed to solve this issue. I created a separate validation set from which queries are drawn, now it works. How do I write the classifier into a file that can be loaded separately and used on another data set ? Thanks !!! |
Hi, Have you tried the pickle module? You can save and load objects very easily with If your object has some attributes that can not be pickled, you can override the |
Thanks, I did, and it works. Now I have a new problem. If I choose 5 instances during one query, like this : select_ind = unc.select(label_ind, unlab_ind, model=None,batch_size=5) I am getting an error message, File "/home/sreejith/anaconda3/lib/python3.6/site-packages/alipy/oracle/knowledge_repository.py", line 422, in update_query I checked the lengths and they seem to correspond, so I am at a loss as to why this is. Please advise, thanks ! |
Hi, |
Hi, thanks so much ! I understand now. |
HI, I am using my own model and then routing it through alipy for active learning. I am choosing QueryInstanceUncertainty and specifying the measure entropy. Then why should I again choose a model (default linear regression) to choose the learning examples. When I say uncertainty sampling and batch size, shouldn't the algorithm already understand to choose the examples ? Also, when I update the training set with the queried elements from the test set, do they automatically get deleted from the test set ? If not, how can I remove them ? Please advise. Thank you so much !!! |
1. The best option is using your target model to select instance. But if you are not using a sklearn model AND you don't know how to re-encapsulate it, you can use the default model at the expense of some performance for convenience. See issue 2 and document for more information. Note that, uncertainty has the function 2. No, they will not automatically deleted, and we do not recommend this. ALiPy keeps the original data untouched and only records the indexes of selected data.
The labeled data and unlabeled data can be obtained by indexing the original feature matrix. |
Hi, thanks for your reply.
So, if I do like this, wherein gc is my model - I do an initial fit before running it through alipy, it is a combination of 4 scikit-learn models and it gives a predict_proba output.
thanks so much for your reply. |
Hi, I'm so sorry to bother you again, but is it possible to add metrics that I define into the saver ? instead of just accuracy I'd like to pass things like efficiency and purity etc. Thanks ! |
Hi, I'm glad to answer your question. Please feel free for that.
You can see the document for more details. |
Thanks so much :) I will try that ! |
Hi, so my alipy was working perfectly until yesterday. But today, every time I run my code, and when I collect the indexes of the unlabelled instances, like this: It gives me an error, "self._innercontainer = list(np.unique([i for i in data], axis=0)) Could you tell me why this happens and what can I do to rectify it please ? Thanks so much. |
Hi, maybe your numpy version is outdated. We will update the dependency information of ALiPy in the future, sorry for the inconvenience. For now, try to upgrade your numpy. |
Thanks, that worked :) I am testing a querying strategy where I have 3 distinct sets, a train set, a query set and a test set. The train set and the query set have the same number of objects. So, when I draw a query from the query set, their indexes are similar to those in the train set, leading to an error message such as this : "RepeatElementWarning: Adding element 129 has already in the collection, skip. Is there a way to get around this issue ? Please advise, thanks so much in advance. |
Hi, I guess you separate your dataset into 3 distinct sets and want to manipulate the instances directly. And you mark your data with 3 arrays:
However, ALiPy keeps the original data untouched and only manipulates the indexes of data. We assume that each instance should have a unique index number and the RepeatElementWarning should never appear.
To solve this confict, you can try to understand the following code:
|
Hi, I have a slight data manipulation problem. Instead of splitting the entire data set automatically with split_AL, I want to use a pre-defined training set with instances drawn from a pre-defined test set for queries. How might I go about doing this using alipy ?
Please advise, thanks !!!
The text was updated successfully, but these errors were encountered: