Handling missing data in non-numerical columns is an essential step in data preprocessing for machine learning tasks. When dealing with non-numerical data, such as categorical or text data, there are two main options for handling missing values: imputation and deletion. In this answer, we will explore these options in detail and provide examples to illustrate their application.
1. Imputation:
Imputation refers to the process of filling in missing values with estimated or imputed values. This approach aims to retain as much information as possible while dealing with missing data. There are various techniques for imputing missing values in non-numerical columns:
a. Mode Imputation: In this method, the most frequent value in a column is used to fill in missing values. This is suitable for categorical variables where the mode represents the most common category.
Example:
Consider a dataset with a column representing the color of a car, where the missing values are denoted by "NA". The mode of the color column is "blue". Using mode imputation, we would replace the missing values with "blue".
b. Regression Imputation: This technique involves using regression models to predict missing values based on other variables in the dataset. It is particularly useful when there is a relationship between the missing variable and other variables.
Example:
Suppose we have a dataset containing information about houses, including the number of rooms and the price. If the number of rooms is missing for some houses, we can use regression imputation by training a regression model on the available data to predict the number of rooms based on the price.
2. Deletion:
Deletion refers to the removal of rows or columns with missing values from the dataset. This approach is straightforward but can result in a loss of valuable information, especially if the missing values are not randomly distributed.
a. Listwise Deletion: Also known as complete case analysis, listwise deletion involves removing entire rows from the dataset if any of the values in those rows are missing. This method can be problematic if the missingness is related to the target variable or other important variables.
Example:
Consider a dataset with information about students, including their grades and extracurricular activities. If any of the variables are missing for a particular student, listwise deletion would remove the entire row of that student's data.
b. Pairwise Deletion: In this approach, missing values are ignored when computing statistics or performing calculations. Pairwise deletion allows for the use of available data for each specific analysis, but it can lead to biased estimates if the missingness is not random.
Example:
Suppose we have a dataset with variables representing the height, weight, and age of individuals. If the weight is missing for some individuals, pairwise deletion would only exclude those individuals when computing statistics involving weight, but still include them for height and age calculations.
When handling missing data in non-numerical columns, the two main options are imputation and deletion. Imputation involves filling in missing values using various techniques, such as mode imputation or regression imputation. On the other hand, deletion involves removing rows or columns with missing values, either completely (listwise deletion) or selectively (pairwise deletion). The choice between these options depends on the specific dataset and the nature of the missingness.
Other recent questions and answers regarding Examination review:
- What is the step-by-step process for converting non-numerical data into numerical form in a data frame?
- What are the potential issues with label encoding when dealing with a large number of categories in a column?
- What is label encoding and how does it convert non-numerical data into numerical form?
- How can non-numerical data be handled in machine learning algorithms?

