Variable Importance Plot #228

sushmitavgopalan16 · 2018-02-21T19:50:14Z

Hello friends!

I know some of you had questions about 'variable importance' measures you were asked to obtain in PS6. Variable importance is basically a measure of how 'informative' a certain variable is. See http://blog.datadive.net/selecting-good-features-part-iii-random-forests/ for an intuitive explanation.

Here's how you would find it using scikitlearn:

Consider the Random Forest part of Dr. Evans's Trees.ipynb

We already have:

from sklearn.ensemble import RandomForestRegressor

hit_tree4 = RandomForestRegressor(n_estimators=53, max_features='sqrt', bootstrap=True,
                                  oob_score=True, random_state=15)
hit_tree4.fit(X, y)

hit_tree4.score(X, y)
y_pred4 = hit_tree4.oob_prediction_
MSE4 = mean_squared_error(y, y_pred4)
print('MSE=', MSE4)

First, we find variable importance measures

importances = hit_tree4.feature_importances_
std = np.std([tree.feature_importances_ for tree in hit_tree4.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

features = ['AtBat', 'Hits', 'HmRun', 'Runs', 'RBI', 'Walks',
             'Years', 'CAtBat', 'CHits', 'CHmRun', 'CRuns',
             'CRBI', 'CWalks', 'PutOuts', 'Assists', 'Errors']
for f in range(X.shape[1]):
    print(str(f+1), ". ", features[f], ": ",str(importances[indices[f]]))

Then, we plot them!

# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(X.shape[1]), importances[indices],
       color="b", align="center")
plt.xticks(range(X.shape[1]), features)
plt.xlim([-1, X.shape[1]])
plt.show()

The text was updated successfully, but these errors were encountered:

yilundai · 2018-02-24T18:08:34Z

part (c) asks us to use the bagging approach, but BaggingRegressor doesn't have feature_importances_ attribute.

jgdenby · 2018-02-24T19:44:04Z

to solve that issue, I just took the means of the feature_importances_ for each estimated tree in estimators_

Otamio · 2018-02-26T02:05:20Z

I tried to look into the scikit-learn document
(http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html),
where the last line before plotting is like:

for f in range(X.shape[1]):
print("%d. feature %d (%f)" % (f + 1, features[indices[f]], importances[indices[f]]))

To put the label in the results, we need to replace indices[f] with feature[f],
however, since indices are ranked (unordered), we cannot directly map features[f] to importances[indices[f]]. Instead, we should map features[indices[f]] to importances[indices[f]].

I am still working on this issue and I appreciate your feedback. Thanks.

Otamio mentioned this issue Feb 26, 2018

PS6 (Jiang Wang) #248

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Variable Importance Plot #228

Variable Importance Plot #228

sushmitavgopalan16 commented Feb 21, 2018 •

edited

Loading

yilundai commented Feb 24, 2018

jgdenby commented Feb 24, 2018

Otamio commented Feb 26, 2018 •

edited

Loading

Variable Importance Plot #228

Variable Importance Plot #228

Comments

sushmitavgopalan16 commented Feb 21, 2018 • edited Loading

yilundai commented Feb 24, 2018

jgdenby commented Feb 24, 2018

Otamio commented Feb 26, 2018 • edited Loading

sushmitavgopalan16 commented Feb 21, 2018 •

edited

Loading

Otamio commented Feb 26, 2018 •

edited

Loading