Recursive Feature Elimination(RFE)
It is a wrapper method for feature selection. It recursively removes the least important features based on their impact on the model’s performance. The goal is to rank features by importance and select the best subset.
How it works
- Train a model on the full set of features.
- Rank features based on their importance
- Coefficients in linear models
- Feature importance in tree-based models.
- Remove the least important feature(s).
- Repeat the process until the desired number of features is reached.
- The remaining features are the selected subset.
Pros of RFE
- It works well when you have a predefined idea of how many features you want.
Cons of RFE
- Easy to overfit if the dataset is small, as it doesn’t validate performance across different subsets.
What if we don’t know the optimal number of features that we want & we want our model to be robust, not prone to overfit
In this case, we can use RFECV
Recursive Feature Elimination with Cross-Validation (RFECV) is an enhanced versino of RFE. It extends RFE by incorporating cross-validation to automatically determine the optimal number of features, improving robustness.
How it works
- Similar to RFE, it starts with all features and iteratively removes the least important ones.
- At each step, it evaluates the model performance using cross-validation.
- Tracks the model performance for each subset of features.
- Selects the subset with the best cross-validated performance.
Pros of RFECV
- Automatically selects the optimal number of features based on model performance.
- Reduces overfitting by validating on multiple folds of data.
- Saves time compared to manually experimenting with different feature counts.
Cons of RFECV
- More computationlly expensive than RFE due to CV at each iteration.
- Might need more data to ensure stable cv results.
Summary
Use RFE if you already know the exact number of features to retain, when computational resources are limited, speet is a priority or just for a quick exploratory purposes.
정확히 원하는 피처 수가 정해져있을 때, 가볍고 빠르게 훑어보고싶을 때, 오버피팅될 수 있음 (train data로만 평가하기 때문)
Ohterwise use RFECV when you don’t know the optimal number of features to select, if you want to ensure robustness and reduce overfitting.
피처 수를 몇개로 추릴지 정하지않았을 때, 가볍고 빠른 것보다 computational cost가 좀 들더라도 robust 한 결과를 원할 때. train - valid 나눠서 평가하기때문에 좀 더 정확한 결과를 얻을 수 있음.