Feature selection is one of the most important parts of machine learning. In most datasets in the real world, there might be many features. But not all the features are necessary for a certain machine learning algorithm. Using too much unnecessary features may cause a lot of problems. The first one is definitely the computation cost. The unnecessarily big dataset will take an unnecessarily long time to run the algorithm. At the same time, it may cause an overfitting problem which is not expected at all.
There are several feature selection methods out there. I will demonstrate four popular feature selection methods in python here. They are a long and time-consuming process if we have to perform them from scratch. But luckily python has great functionality that makes it really easy to perform the feature selection using a few lines of code.
Let’s first import the dataset:
import pandas as pd
import numpy as np
df = pd.read_csv('Feature_Selection.csv')
The dataset is too big. It has 108 columns and 11933 rows. So, I cannot show a screenshot here. Here are some columns:
df.columns
Output:
Index(['x.aidtst3', 'employ1', 'income2', 'weight2', 'height3', 'children','veteran3', 'blind', 'renthom1', 'sex1',
...
'x.denvst3', 'x.prace1', 'x.mrace1', 'x.exteth3', 'x.asthms1', 'x.michd', 'x.ltasth1', 'x.casthm1', 'x.state', 'havarth3'], dtype='object', length=108)
Assume that we want to predict the ‘havarth3’ variable using a machine learning algorithm. So, here is the X and y:
X= df.drop(columns=["havarth3"])
y= df['havarth3']
Now, we will find out using different methods, which features are the best to predict the ‘havarth3’ variable.
Univariate Feature Selection
This method selects the best features based on univariate statistical tests. The function that will be used for this is the SelectKBest function from sklearn library. This function removes all the features except the top specified numbers of features. In this dataset, there are 107 features. A k value of 10 was used to keep only 10 features. The score_function of f_classif is chosen which uses F-value from the ANOVA table to select the features. There are two other options available for classification. They are chi2 and mutual_info_classif. Please feel free to check them out yourself.
This is the process.
Import the methods from sklearn library.
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
Pass the score_function ‘f_classif’ mentioned before and the number of features you want to keep in the ‘SelectKBest’ function and fit the X and y to the function:
uni = SelectKBest(score_func = f_classif, k = 10)
fit = uni.fit(X, y)
So, these are the 10 features selected by the function:
X.columns[fit.get_support(indices=True)].tolist()
Output:
['employ1',
'rmvteth4',
'genhlth',
'x.age.g',
'x.age80',
'x.ageg5yr',
'x.age65yr',
'x.rfhlth',
'x.phys14d',
'x.hcvu651']
The feature selection is done!
Feature Selection Using Correlation Matrix
This process calculates the correlations of all the features with the target feature. Based on those correlation values, features are chosen. For this project, a threshold of 0.2 was chosen. If the correlation of a feature is over 0.2 with the target, that feature is chosen for the classification.
cor = df.corr()
cor_target = abs(cor["havarth3"])
relevant_features = cor_target[cor_target > 0.2]
relevant_features.index
Output:
Index(['employ1', 'genhlth', 'x.age.g', 'x.age80', 'x.ageg5yr', 'x.age65yr', 'x.hcvu651', 'havarth3'],
dtype='object')
Please notice, the ‘havarth3’ variable is also selected. Because the ‘havarth3’ variable has the highest correlation with itself. So, please remember to remove this before you perform the machine learning algorithms.
Wrapper Method
This is an interesting method. In this method, one machine learning method is used to find the right features. This method uses the p-value. Here I will use the Ordinary Linear Model from statsmodels library. I choose statsmodels library because it provides the p-values as part of the models and I find it easy to use.
import statsmodels.api as sm
X_new = sm.add_constant(X)
model = sm.OLS(y, X_new).fit()
model.pvalues
Output:
const 2.132756e-01
x.aidtst3 6.269686e-01
employ1 4.025786e-20
income2 3.931291e-04
weight2 2.122768e-01
...
x.asthms1 3.445036e-01
x.michd 3.478433e-01
x.ltasth1 3.081917e-03
x.casthm1 9.802652e-01
x.state 6.724318e-01
Length: 108, dtype: float64
See, all the p-values. Now based on the p-values, we will remove the features one by one. We will keep running the machine learning algorithm from the statsmodels library and in each iteration, we will find the feature with the highest p-value. If that highest p-value is greater than 0.05, we will remove that feature. The same process will be done till we reach a point where the highest p-value is not greater than 0.05 anymore.
selected_features = list(X.columns)
pmax = 1
while (len(selected_features)>0):
p= []
X_new = X[selected_features]
X_new = sm.add_constant(X_new)
model = sm.OLS(y,X_new).fit()
p = pd.Series(model.pvalues.values[1:],index = selected_features)
pmax = max(p)
feature_pmax = p.idxmax()
if(pmax>0.05):
selected_features.remove(feature_pmax)
else:
break
selected_features
Output:
['employ1',
'income2',
'blind',
'sex1',
'pneuvac4',
'diffwalk',
'diffdres',
'smoke100',
'rmvteth4',
'physhlth',
'menthlth',
'hlthpln1',
'genhlth',
'persdoc2',
'checkup1',
'addepev2',
'chcscncr',
'asthma3',
'qstlang',
'x.metstat',
'htin4',
'wtkg3',
'x.age.g',
'x.ageg5yr',
'x.age65yr',
'x.chldcnt',
'x.incomg',
'x.rfseat3',
'x.rfsmok3',
'x.urbstat',
'x.llcpwt2',
'x.rfhlth',
'x.imprace',
'x.wt2rake',
'x.strwt',
'x.phys14d',
'x.hcvu651',
'x.denvst3',
'x.prace1',
'x.mrace1',
'x.ltasth1']
So, these are the selected features.
Conclusion
I used all four methods in one of my projects and I got the best results using the wrapper method. But that’s may not be the case in all the projects. It is a good idea to try at least a few feature selection methods when you have a lot of features like this project. Different feature selection may give you totally different set of features. In that case, using another method to confirm is a good solution. In machine learning, collecting the right feature is half the battle.
#DataScience #MachineLearning #ArtificialIntelligence #Python