Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Variable Importance Plot #228

Open
sushmitavgopalan16 opened this issue Feb 21, 2018 · 3 comments
Open

Variable Importance Plot #228

sushmitavgopalan16 opened this issue Feb 21, 2018 · 3 comments

Comments

@sushmitavgopalan16
Copy link
Collaborator

sushmitavgopalan16 commented Feb 21, 2018

Hello friends!

I know some of you had questions about 'variable importance' measures you were asked to obtain in PS6. Variable importance is basically a measure of how 'informative' a certain variable is. See http://blog.datadive.net/selecting-good-features-part-iii-random-forests/ for an intuitive explanation.

Here's how you would find it using scikitlearn:

Consider the Random Forest part of Dr. Evans's Trees.ipynb

We already have:

from sklearn.ensemble import RandomForestRegressor

hit_tree4 = RandomForestRegressor(n_estimators=53, max_features='sqrt', bootstrap=True,
                                  oob_score=True, random_state=15)
hit_tree4.fit(X, y)

hit_tree4.score(X, y)
y_pred4 = hit_tree4.oob_prediction_
MSE4 = mean_squared_error(y, y_pred4)
print('MSE=', MSE4)

First, we find variable importance measures

importances = hit_tree4.feature_importances_
std = np.std([tree.feature_importances_ for tree in hit_tree4.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

features = ['AtBat', 'Hits', 'HmRun', 'Runs', 'RBI', 'Walks',
             'Years', 'CAtBat', 'CHits', 'CHmRun', 'CRuns',
             'CRBI', 'CWalks', 'PutOuts', 'Assists', 'Errors']
for f in range(X.shape[1]):
    print(str(f+1), ". ", features[f], ": ",str(importances[indices[f]]))

screen shot 2018-02-21 at 1 47 54 pm

Then, we plot them!

# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(X.shape[1]), importances[indices],
       color="b", align="center")
plt.xticks(range(X.shape[1]), features)
plt.xlim([-1, X.shape[1]])
plt.show()

screen shot 2018-02-21 at 1 50 25 pm

@yilundai
Copy link

part (c) asks us to use the bagging approach, but BaggingRegressor doesn't have feature_importances_ attribute.

@jgdenby
Copy link

jgdenby commented Feb 24, 2018

to solve that issue, I just took the means of the feature_importances_ for each estimated tree in estimators_

@Otamio
Copy link

Otamio commented Feb 26, 2018

I tried to look into the scikit-learn document
(http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html),
where the last line before plotting is like:

for f in range(X.shape[1]):
  print("%d. feature %d (%f)" % (f + 1, features[indices[f]], importances[indices[f]]))

To put the label in the results, we need to replace indices[f] with feature[f],
however, since indices are ranked (unordered), we cannot directly map features[f] to importances[indices[f]]. Instead, we should map features[indices[f]] to importances[indices[f]].

I am still working on this issue and I appreciate your feedback. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants