How to Compute Random Forest Feature Importance with Python in 6 Ways

Random Forest is a popular machine learning algorithm that is used to solve various problems, including classification and regression. Random Forest is an ensemble learning method that combines multiple decision trees to make predictions. One of the advantages of Random Forest is that it can compute feature importance, which is a measure of how important each feature is in making predictions. In this blog post, we will discuss six ways to compute Random Forest feature importance with Python. Feel free to try out the code with free Jupyter notebooks on Saturn Cloud.
- Mean Decrease Impurity
The Mean Decrease Impurity method is the most common way to compute feature importance in Random Forest. This method measures the reduction in impurity that is achieved by each feature. The impurity is calculated using the Gini index or the entropy. The Mean Decrease Impurity method computes the average impurity reduction over all the trees in the forest. The higher the impurity reduction, the more important the feature is.
Here is an example code snippet that demonstrates how to compute feature importance using the Mean Decrease Impurity method:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
X, y = make_classification()
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X, y)
importance = rf.feature_importances_
print(importance)
- Mean Decrease Accuracy
The Mean Decrease Accuracy method is similar to the Mean Decrease Impurity method, but it uses accuracy as the metric instead of impurity. This method computes the decrease in accuracy that is achieved by each feature. The accuracy is calculated by comparing the predicted labels with the true labels. The Mean Decrease Accuracy method computes the average accuracy decrease over all the trees in the forest. The higher the accuracy decrease, the more important the feature is.
Here is an example code snippet that demonstrates how to compute feature importance using the Mean Decrease Accuracy method:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
X, y = make_classification()
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X, y)
importance = rf.feature_importances_
print(importance)
- Permutation Importance
The Permutation Importance method is a model-agnostic way to compute feature importance. This method works by randomly permuting each feature and measuring the decrease in performance. The performance is measured using a metric such as accuracy or F1 score. The Permutation Importance method computes the average decrease in performance over all the permutations. The higher the decrease in performance, the more important the feature is.
Here is an example code snippet that demonstrates how to compute feature importance using the Permutation Importance method:
import eli5
from eli5.sklearn import PermutationImportance
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
X, y = make_classification()
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X, y)
perm = PermutationImportance(rf).fit(X, y)
importance = perm.feature_importances_
print(importance)
- Drop Column Importance
The Drop Column Importance method is a simple way to compute feature importance. This method works by training a Random Forest model on the original dataset and then measuring the decrease in performance when each feature is removed. The performance is measured using a metric such as accuracy or F1 score. The Drop Column Importance method computes the average decrease in performance over all the features. The higher the decrease in performance, the more important the feature is.
Here is an example code snippet that demonstrates how to compute feature importance using the Drop Column Importance method:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score
X, y = make_classification()
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X, y)
baseline = accuracy_score(y, rf.predict(X))
importance = []
for col in range(X.shape[1]):
X_perm = X.copy()
X_perm[:, col] = np.random.permutation(X[:, col])
permuted = accuracy_score(y, rf.predict(X_perm))
importance.append(baseline - permuted)
print(importance)
- SHAP Values
The SHAP (SHapley Additive exPlanations) Values method is a model-agnostic way to compute feature importance. This method works by computing the contribution of each feature to the prediction for each instance in the dataset. The contribution is measured using the Shapley value, which is a concept from cooperative game theory. The SHAP Values method computes the average contribution over all the instances. The higher the contribution, the more important the feature is.
Here is an example code snippet that demonstrates how to compute feature importance using the SHAP Values method:
import shap
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
X, y = make_classification()
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X, y)
explainer = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(X)
importance = np.abs(shap_values).mean(axis=0)
print(importance)
- Tree Interpreter
The Tree Interpreter method is a model-specific way to compute feature importance. This method works by interpreting the decision trees in the Random Forest model. The Tree Interpreter method computes the contribution of each feature to the prediction for each instance in the dataset. The contribution is measured using the difference between the prediction with the feature and the prediction without the feature. The Tree Interpreter method computes the average contribution over all the instances. The higher the contribution, the more important the feature is.
Here is an example code snippet that demonstrates how to compute feature importance using the Tree Interpreter method:
import treeinterpreter
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
X, y = make_classification()
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X, y)
prediction, bias, contributions = treeinterpreter.predict(rf, X)
importance = np.abs(contributions).mean(axis=0)
print(importance)
In conclusion, Random Forest is a powerful machine learning algorithm that can compute feature importance. There are six ways to compute Random Forest feature importance with Python: Mean Decrease Impurity, Mean Decrease Accuracy, Permutation Importance, Drop Column Importance, SHAP Values, and Tree Interpreter. Each method has its advantages and disadvantages, and the choice of method depends on the problem at hand. By computing feature importance, we can gain insights into the underlying patterns in the data and improve the performance of our machine learning models.