Chapter:1-Label Encoder vs One Hot Encoder in Machine Learning
These two encoders are parts of the SciKit Learn library in Python, and they are used to convert categorical data, or text data, into numbers, which are predictive models can better understand.
Label Encoding:
Data format
In this example, the first column is the country column, which is all text. As you might know by now, we can’t have text in our data if we’re going to run any kind of model on it. So before we can run a model, we need to make this data ready for the model.
And to convert this kind of categorical text data into model-understandable numerical data, we use the Label Encoder class. So all we have to do, to label encode the first column, is import the LabelEncoder class from the sklearn library, fit and transform the first column of the data, and then replace the existing text data with the new encoded data. Let’s have a look at the code.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd#import datasetdataset=pd.read_csv('Data.csv')
x= dataset.iloc[:,:-1].values# Taking care of missing data
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(x[:, 1:3])
x[:, 1:3] = imputer.transform(x[:, 1:3])#encoding categorical data
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
x[:, 0] = labelencoder.fit_transform(x[:, 0])
convert missing data into mean or any other mathematics on NaN value.
We’ve assumed that the data is in a variable called ‘x’. After running this piece of code, if you check the value of x, you’ll see that the three countries in the first column have been replaced by the numbers 0, 1, and 2.
That’s all label encoding is about. But depending on the data, label encoding introduces a new problem. For example, we have encoded a set of country names into numerical data. This is actually categorical data and there is no relation, of any kind, between the rows.
The problem here is since there are different numbers in the same column, the model will misunderstand the data to be in some kind of order, 0 < 1 < 2. But this isn’t the case at all. To overcome this problem, we use One Hot Encoder.
One Hot Encoder
If you’re interested in checking out the documentation, you can find it here. Now, as we already discussed, depending on the data we have, we might run into situations where, after label encoding, we might confuse our model into thinking that a column has data with some kind of order or hierarchy when we clearly don’t have it. To avoid this, we ‘OneHotEncode’ that column.
What one hot encoding does is, it takes a column which has categorical data, which has been label encoded and then splits the column into multiple columns. The numbers are replaced by 1s and 0s, depending on which column has what value. In our example, we’ll get three new columns, one for each country — France, Germany, and Spain.
For rows which have the first column value as France, the ‘France’ column will have a ‘1’ and the other two columns will have ‘0’s. Similarly, for rows which have the first column value as Germany, the ‘Germany’ column will have a ‘1’ and the other two columns will have ‘0’s.
The Python code for one hot encoding is also pretty simple:
from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder(categorical_features = [0])
x = onehotencoder.fit_transform(x).toarray()
As you can see in the constructor, we specify which column has to be one hot encoded, [0] in this case. Then we fit and transform the array ‘x’ with the onehotencoder object we just created. And that’s it, we now have three new columns in our dataset:
As you can see, we have three new columns with 1s and 0s, depending on the country that the rows represent.