-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Application on large scale dataset. #1
Comments
Hi, Thanks for your feedback. Knowing this I have tried first to reduce the training data set and limit the number of slices by increasing the windows used. I was expecting the method to be robust enough to still work with a smaller training set. I was still facing memory storage issue. I'm still looking for a better solution such as mini batch training or maybe a different training strategy. I'm actually open to any suggestion on this! Just for development purpose on which machine did you run the code? Best |
Hi, Thanks for replying. Minibatch is a good idea, I also think it is necessary for large-scale datasets. And batch normalization can also be applied even a lot of people said that won't help in decision tree algorithms. I am not a random forest expert but do you think it can take inputs which have different length? |
Do you have mind the fact that there is a slicing step hence more or less slices might not make much difference ? |
Hello, as you can see in the several implementations of gcForest, the issue is always memory space, as the Multi-Grained Scanning implies slicing each input into a lot of inputs (especially on 2D features). This part seems for me undivisible into batches, as you have to check at each node if you have to split it or not, and this depends on the set associated to this node. Can you explain me what I am not getting here ? Thank you in advance ! |
Hello, Thanks for you feedback! I must admit that mini-batch training might not be the best choice of words, my bad. The idea then to divide a data set in batches that are then fed to random forest and using pruning for instance, sounds a bit silly as these batches will be divided again in subsets for the training. So we'll have subsets of subsets to train and validate Random Forest. Somehow I would not say that minibatch training is impossible but rather that it doesn't sound like an optimal way to deal with large data set for Random Forests. I am currently working on a more sensible approach to sequentially train random forests for large data sets plus a couple of ways to optimize the memory usage in the code. Stay connected! |
Thanks for your reply ! In this case, I still see why mini-batch would be difficult to implement... If you start by dividing your dataset into batches, distribute each batch to a subforest (that is, a subset of trees of your forest) and then randomly divide each batch into subsets for the trees, it wouldn't be equivalent to the creation of the subsets from the initial dataset. (in this case examples from different batches would have a 0 probability of being fed to the same tree) On the other hand, if you create mini-batch of size m of your dataset, and then do the bagging considering that each batch is like an original sample (and then create k subsets of M/m randomly chosen mini-batches), then the distribution would be biased as two sample in the same mini batch would always be together in the trees containing one of them. In fact, I'm trying to see if a parallelization of the gcForest algorithm could be feasible, as the issue is to store the examples for the construction of the trees, and is the data is too big, a storage in "far away caches" could slow down the process due to high commication costs, and hugely affect the potential speed-up comparing to the sequential version of the algorithm. In any case, thank you for your work and for your contribution to the community ! |
Hello there, thank you. |
Sure, "shape_1X" is the keyword telling the code what is the shape of a single sample from your traning set, it needs to be a list or an array such that the first element is the number of lines and the second element the number of columns. window it the size of the window you want to use for the data slicing. For instance if you are working with a sequence of shape [1,40] and you want slices of size 20 then just set "window=[20]". If your are working with a picture of size [20,20] you can ask the code to look at slices of size 4x4 just setting "window=[4]". The "shape_1X" argument is dictated by your data set (and all samples must have the same shape) while "window" is up to you. Does that answer your question? |
Thank you for your quick and precise answer. |
@bis-carbon I haven't run the code on the full MNIST data set due to a lack of computing resources. I'd be very happy to know the performance if anyone does though. I assume you that you ran the code on the scikit handwritten data set. Depending on the parameters you use you will get from 91% to 98% accuracy (based on tests I ran). |
Hi,pylablanche Thanks for your implementation! But I didn't understand the following code in 277-278 lines:
Can you give some explanation? Thank you ! |
@aaCherish What I did then is to define If you want to check it, you can copy-paste this part of the code in an independent file and with small modification you should be able to run it on an image. It is probably not the best way to do the image slicing so feel free to let me know if you have something working better. |
@pylablanche Thanks very much for your quick replying! I got it. Thanks again! |
@chenyangh I've just released the version 0.1.4 mainly addressing the memory usage during slicing. I'll still do some optimization in the code in the coming weeks and will keep everyone posted. @aaCherish as a consequence of the update the lines you mentioned have disappeared and I hope that the new names are more clear now. |
Thanks for your awesome work! I used it to predict stock index futures,but only 53% accuracy,can you raise it?https://www.ricequant.com/community/topic/3221//2 |
@zhliaoli |
I'm sorry, I think it is like this:https://www.ricequant.com/community/topic/3221//2 This is a Chinese quant community,the data comes from the community.Thanks you again!:D |
@zhliaoli Thanks for the link, it works. |
Okay, expect you to make a better awesome job! :D |
I think there is a small bug in function def cascade_forest(self, X, y=None):
""" Perform (or train if 'y' is not None) a cascade forest estimator.
:param X: np.array
Array containing the input samples.
Must be of shape [n_samples, data] where data is a 1D array.
:param y: np.array (default=None)
Target values. If 'None' perform training.
:return: np.array
1D array containing the predicted class for each input sample.
"""
if y is not None:
setattr(self, 'n_layer', 0)
test_size = getattr(self, 'cascade_test_size')
max_layers = getattr(self, 'cascade_layer')
tol = getattr(self, 'tolerance')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size)
self.n_layer += 1
prf_crf_pred_ref = self._cascade_layer(X_train, y_train)
accuracy_ref = self._cascade_evaluation(X_test, y_test)
feat_arr = self._create_feat_arr(X_train, prf_crf_pred_ref)
self.n_layer += 1
prf_crf_pred_layer = self._cascade_layer(feat_arr, y_train)
accuracy_layer = self._cascade_evaluation(X_test, y_test)
# the codes I added
if accuracy_layer <= (accuracy_ref + tol):
self.n_layer -= 1
while accuracy_layer > (accuracy_ref + tol) and self.n_layer <= max_layers:
accuracy_ref = accuracy_layer
prf_crf_pred_ref = prf_crf_pred_layer
feat_arr = self._create_feat_arr(X_train, prf_crf_pred_ref)
self.n_layer += 1
prf_crf_pred_layer = self._cascade_layer(feat_arr, y_train)
accuracy_layer = self._cascade_evaluation(X_test, y_test) |
@felixwzh Thanks for your feedback. |
@pylablanche def cascade_forest(self, X, y=None):
......
elif y is None:
at_layer = 1
prf_crf_pred_ref = self._cascade_layer(X, layer=at_layer)
while at_layer < getattr(self, 'n_layer'):
at_layer += 1
feat_arr = self._create_feat_arr(X, prf_crf_pred_ref)
prf_crf_pred_ref = self._cascade_layer(feat_arr, layer=at_layer)
return prf_crf_pred_ref After the while loop def cascade_forest(self, X, y=None):
""" Perform (or train if 'y' is not None) a cascade forest estimator.
:param X: np.array
Array containing the input samples.
Must be of shape [n_samples, data] where data is a 1D array.
:param y: np.array (default=None)
Target values. If 'None' perform training.
:return: np.array
1D array containing the predicted class for each input sample.
"""
if y is not None:
setattr(self, 'n_layer', 0)
test_size = getattr(self, 'cascade_test_size')
max_layers = getattr(self, 'cascade_layer')
tol = getattr(self, 'tolerance')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size)
self.n_layer += 1
prf_crf_pred_ref = self._cascade_layer(X_train, y_train)
accuracy_ref = self._cascade_evaluation(X_test, y_test)
feat_arr = self._create_feat_arr(X_train, prf_crf_pred_ref)
self.n_layer += 1
prf_crf_pred_layer = self._cascade_layer(feat_arr, y_train)
accuracy_layer = self._cascade_evaluation(X_test, y_test)
while accuracy_layer > (accuracy_ref + tol) and self.n_layer <= max_layers:
accuracy_ref = accuracy_layer
prf_crf_pred_ref = prf_crf_pred_layer
feat_arr = self._create_feat_arr(X_train, prf_crf_pred_ref)
self.n_layer += 1
prf_crf_pred_layer = self._cascade_layer(feat_arr, y_train)
accuracy_layer = self._cascade_evaluation(X_test, y_test)
# codes I added
self.n_layer -= 1
elif y is None:
at_layer = 1
prf_crf_pred_ref = self._cascade_layer(X, layer=at_layer)
while at_layer < getattr(self, 'n_layer'):
at_layer += 1
feat_arr = self._create_feat_arr(X, prf_crf_pred_ref)
prf_crf_pred_ref = self._cascade_layer(feat_arr, layer=at_layer)
return prf_crf_pred_ref I don't know whether I made it clear or not, or maybe I didn't understand your codes or gcForest correctly. However, thanks a lot for you great implementation of gcForest!!! |
@felixwzh I see exactly what you are talking about and indeed if the next layer is not performing better it is kept in memory which is not great behavior. In my mind I stupidly focus on the idea that accuracy can only increase or be the same. |
Hi,when I run the code on the ADULT data,sometimes it will raise ValueError: Input contains NaN, infinity or a value too large for dtype('float32').But I fill the NA and change the dtype before I run the code,do you have any idea where the problem is. |
@chibohe Hi, |
@chibohe Hi, |
@pylablanche , |
@missaaoo , |
can this code run with RGB data? if yes ,what is the code would be like?,i dont know how to give the values for shape_1X and window. |
Hello @SuperAlexander , Anyway, right now the code can't deal with RGB data. Have you looked at the original implementation ? |
Hi,
Your implementation is really elegant. But I tried your code on the real MINIST dataset, and it took up to almost 100 GB memory before I force stopped. Do you have any idea about this?
The text was updated successfully, but these errors were encountered: