In Python, the SimpleImputer class from the scikit-learn library serves as a tool to manage missing values within the dataset of a predictive model. This class enables us to substitute NaN (or missing values) in the dataset with a designated placeholder. To utilize this module, we can implement the SimpleImputer method within our program.
Syntax for SimpleImputer method:
To incorporate the SimpleImputer class method within a Python application, it is necessary to utilize the subsequent syntax:
SimpleImputer(missingValues, strategy)
Parameters:
Following are the parameters which has to be defined while using the SimpleImputer method:
- missingValues: It is the missing values placeholder in the SimpleImputer method which has to be imputed during the execution, and by default, the value for missing values placeholder is NaN.
- strategy : It is the data that is going to replace the missing values (NaN values) from the dataset, and by default, the value method for this parameter is 'Mean'. The strategy parameter of the SimpleImputer method can take 'Mean', 'Mode', Median' (Central tendency measuring methods) and 'Constant' value input in it.
- fillValue: This parameter is used only in the strategy parameter if we give 'Constant' as replacing value method. We have to define the constant value for the strategy parameter, which is going to replace the NaN values from the dataset.
The SimpleImputer class serves as the module class within the Sklearn library. To utilize this class, it is essential to first ensure that the Sklearn library is installed on your system, provided it is not already available.
Installation of Sklearn library
To install Sklearn, you can utilize the following command in the command terminal prompt of your operating system:
pip install sklearn
Upon hitting the enter key, the installation process for the sklearn module will commence on our device, as illustrated below:
At this point, the Sklearn library has been successfully installed on our system, allowing us to proceed with utilizing the SimpleImputer class function.
Handling NaN values in the dataset with SimpleImputer class
In this section, we will demonstrate the use of the SimpleImputer class in a Python application to manage the missing values found within our dataset (which will be utilized in the program). We will create a dataset in the sample program that includes some missing entries, and subsequently, we will apply the SimpleImputer class method to address those missing values by specifying its parameters. Let us explore the implementation of this through an illustrative Python program.
Python Example to Handle NaN values in the dataset with Simplelmputer class
Examine the subsequent Python script that includes a dataset containing NaN values:
# Import numpy module as nmp
import numpy as nmp
# Importing SimpleImputer class from sklearn impute module
from sklearn.impute import SimpleImputer
# Setting up imputer function variable
imputerFunc = SimpleImputer(missing_values = nmp.nan, strategy ='mean')
# Defining a dataset
dataSet = [[32, nmp.nan, 34, 47], [17, nmp.nan, 71, 53], [19, 29, nmp.nan, 79], [nmp.nan, 31, 23, 37], [19, nmp.nan, 79, 53]]
# Print original dataset
print("The Original Dataset we defined in the program: \n", dataSet)
# Imputing dataset by replacing missing values
imputerFunc = imputerFunc.fit(dataSet)
dataSet2 = imputerFunc.transform(dataSet)
# Printing imputed dataset
print("The imputed dataset after replacing missing values from it: \n", dataSet2)
Output:
The Original Dataset we defined in the program:
[[32, nan, 34, 47], [17, nan, 71, 53], [19, 29, nan, 79], [nan, 31, 23, 37], [19, nan, 79, 53]]
The imputed dataset after replacing missing values from it:
[[32. 30. 34. 47. ]
[17. 30. 71. 53. ]
[19. 29. 51.75 79. ]
[21.75 31. 23. 37. ]
[19. 30. 79. 53. ]]
Explanation:
Initially, we have imported the numpy library (to create a dataset) and the sklearn library (to utilize the SimpleImputer class method) into our program. Following this, we established the imputer to address the missing values through the SimpleImputer class method, opting for the 'mean' strategy to substitute the missing entries in the dataset. Subsequently, we constructed a dataset within the program using a function from the numpy module, intentionally introducing some missing values (NaN values) into the dataset. We then displayed the original dataset as output. Next, we employed the imputer defined earlier in the program to impute and replace the missing values within the dataset. After completing the imputation process and substituting the missing values, we printed the updated dataset as a result.
As demonstrated in the output, the dataset with imputed values contains mean values substituted for the missing entries. This illustrates how the SimpleImputer module class can be utilized to manage NaN values within a dataset.
Conclusion
In this tutorial, we explored the SimpleImputer class method, gaining insights into its application for managing NaN values within a dataset. We examined the strategy parameter, which allows us to specify the approach for substituting NaN values in the dataset. Additionally, we covered the installation procedure for the Sklearn library. Finally, we demonstrated the usage of the SimpleImputer class method through an example to impute the dataset effectively.