Linear Regression with sklearn using categorical variables

As data scientists and software engineers, we often use linear regression to model the relationship between a dependent variable and one or more independent variables. However, when dealing with categorical variables, we need to take some additional steps to ensure that our model is accurate and reliable. In this article, we will explore how to use sklearn to build a linear regression model with categorical variables.

As data scientists and software engineers, we often use linear regression to model the relationship between a dependent variable and one or more independent variables. However, when dealing with categorical variables, we need to take some additional steps to ensure that our model is accurate and reliable. In this article, we will explore how to use sklearn to build a linear regression model with categorical variables.

Table of Contents

  1. Introduction

  2. One-hot encoding

  3. Label encoding

  4. Binary encoding

  5. Building a linear regression model with categorical variables

  6. Interpreting Model Performance

  7. Error Handling and Considerations

  8. Conclusion

What is linear regression?

Linear regression is a statistical method that allows us to model the relationship between a dependent variable and one or more independent variables. The goal of linear regression is to find a linear relationship between the dependent variable and the independent variables that can be used to make predictions.

What are categorical variables?

Categorical variables are variables that take on a limited number of values. For example, gender is a categorical variable that can take on the values “male” or “female”. Other examples of categorical variables include race, education level, and occupation.

Why do we need to use special techniques for categorical variables?

When we use linear regression to model the relationship between a dependent variable and one or more independent variables, we assume that the independent variables are continuous and can take on any value. However, categorical variables do not meet this assumption, and using them in a linear regression model can lead to inaccurate and unreliable results.

To use categorical variables in a linear regression model, we need to convert them into numerical variables that can be used in the model. There are several techniques for doing this, including one-hot encoding, label encoding, and binary encoding.

One-hot encoding

One-hot encoding is a technique for converting categorical variables into numerical variables that can be used in a linear regression model. With one-hot encoding, we create a new binary variable for each possible value of the categorical variable. For example, if we have a categorical variable “color” that can take on the values “red”, “green”, and “blue”, we would create three new binary variables: “color_red”, “color_green”, and “color_blue”.

To perform one-hot encoding in sklearn, we use the OneHotEncoder class. Here’s an example:

# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import mean_squared_error

# Create a synthetic dataset
data = {
    'color': ['red', 'green', 'blue', 'red', 'green'],
    'size': [3, 5, 2, 4, 1],
    'price': [10, 20, 15, 25, 5]
}

df = pd.DataFrame(data)

# Separate features (X) and target variable (y)
X = df[['color', 'size']]
y = df['price']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# One-Hot Encoding:
encoder_one_hot = OneHotEncoder()
X_train_one_hot = encoder_one_hot.fit_transform(X_train[['color']])

# Build linear regression model
model_one_hot = LinearRegression().fit(X_train_one_hot, y_train)

# Evaluate model on the test set
X_test_one_hot = encoder_one_hot.transform(X_test[['color']])
y_pred_one_hot = model_one_hot.predict(X_test_one_hot)
mse_one_hot = mean_squared_error(y_test, y_pred_one_hot)
print(f"One-Hot Encoding Model - Mean Squared Error: {mse_one_hot}")

Output:

One-Hot Encoding Model - Mean Squared Error: 225.0

In this example, X_cat is a matrix containing the categorical variables we want to encode. fit_transform() method of OneHotEncoder class will encode each categorical variable and return a new matrix X_cat_encoded with the binary variables created by the one-hot encoding.

Pros

  • Preserves all information about the categories.
  • Works well when the number of unique categories is not too large.
  • Suitable for linear regression models when the relationship between categories and the target variable is complex.

Cons

  • Can lead to a high-dimensional feature space, especially with many unique categories
  • May introduce multicollinearity issues.

Label encoding

Label encoding is another technique for converting categorical variables into numerical variables that can be used in a linear regression model. With label encoding, we assign a numerical value to each possible value of the categorical variable. For example, if we have a categorical variable “color” that can take on the values “red”, “green”, and “blue”, we would assign the values 1, 2, and 3 to the corresponding values.

To perform label encoding in sklearn, we use the LabelEncoder class. Here’s an example:

# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_squared_error

# Create a synthetic dataset
data = {
    'color': ['red', 'green', 'blue', 'red', 'green'],
    'size': [3, 5, 2, 4, 1],
    'price': [10, 20, 15, 25, 5]
}

df = pd.DataFrame(data)

# Separate features (X) and target variable (y)
X = df[['color', 'size']]
y = df['price']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Label Encoding:
label_encoder = LabelEncoder()
X_train_label = X_train.copy()
X_train_label['color'] = label_encoder.fit_transform(X_train_label['color'])

# Build linear regression model
model_label = LinearRegression().fit(X_train_label, y_train)

# Evaluate model on the test set
X_test_label = X_test.copy()
X_test_label['color'] = label_encoder.transform(X_test_label['color'])
y_pred_label = model_label.predict(X_test_label)
mse_label = mean_squared_error(y_test, y_pred_label)
print(f"Label Encoding Model - Mean Squared Error: {mse_label}")

Output:

Label Encoding Model - Mean Squared Error: 224.99999999999957

In this example, X_cat is a matrix containing the categorical variables we want to encode. fit_transform() method of LabelEncoder class will encode each categorical variable and return a new matrix X_cat_encoded with the numerical values assigned by the label encoding.

Pros

  • Reduces dimensionality compared to one-hot encoding.
  • Preserves ordinal relationships if they exist in the categorical variable.

Cons

  • Assumes an ordinal relationship that may not exist in some categorical variables.
  • May not be suitable for linear regression if ordinality is not meaningful.

Binary encoding

Binary encoding is a technique for converting categorical variables into numerical variables that can be used in a linear regression model. With binary encoding, we create a new binary variable for each possible value of the categorical variable, but instead of using a 1 for the corresponding value, we use the binary representation of the value.

To perform binary encoding in sklearn, we can use the category_encoders package. Here’s an example:

pip install category_encoders

# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import category_encoders as ce
from sklearn.metrics import mean_squared_error

# Create a synthetic dataset
data = {
    'color': ['red', 'green', 'blue', 'red', 'green'],
    'size': [3, 5, 2, 4, 1],
    'price': [10, 20, 15, 25, 5]
}

df = pd.DataFrame(data)

# Separate features (X) and target variable (y)
X = df[['color', 'size']]
y = df['price']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Binary Encoding:
binary_encoder = ce.BinaryEncoder(cols=['color'])
X_train_binary = binary_encoder.fit_transform(X_train[['color']])

# Build linear regression model
model_binary = LinearRegression().fit(X_train_binary, y_train)

# Evaluate model on the test set
X_test_binary = binary_encoder.transform(X_test[['color']])
y_pred_binary = model_binary.predict(X_test_binary)
mse_binary = mean_squared_error(y_test, y_pred_binary)
print(f"Binary Encoding Model - Mean Squared Error: {mse_binary}")

Output:

Binary Encoding Model - Mean Squared Error: 225.0

In this example, X_cat is a matrix containing the categorical variables we want to encode. We create a new instance of BinaryEncoder class from category_encoders package and pass the name of the categorical variable we want to encode to the cols parameter. Then, we call the fit_transform() method to encode each categorical variable and return a new matrix X_cat_encoded with the binary variables created by the binary encoding.

Pros

  • Reduces dimensionality more effectively than one-hot encoding.
  • Maintains some ordinal information.

Cons

  • May not handle high cardinality well, similar to one-hot encoding.
  • Introduces additional complexity.

Building a linear regression model with categorical variables

Now that we have encoded our categorical variables using one of the techniques above, we can use them in a linear regression model. Here’s an example:

# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import category_encoders as ce
from sklearn.metrics import mean_squared_error

# Create a synthetic dataset
data = {
    'color': ['red', 'green', 'blue', 'red', 'green'],
    'size': [3, 5, 2, 4, 1],
    'price': [10, 20, 15, 25, 5]
}

df = pd.DataFrame(data)

# Separate features (X) and target variable (y)
X = df[['color', 'size']]
y = df['price']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Binary Encoding:
binary_encoder = ce.BinaryEncoder(cols=['color'])
X_train_binary = binary_encoder.fit_transform(X_train[['color']])

# Build linear regression model
model_binary = LinearRegression().fit(X_train_binary, y_train)

# Evaluate model on the test set
X_test_binary = binary_encoder.transform(X_test[['color']])
y_pred_binary = model_binary.predict(X_test_binary)
mse_binary = mean_squared_error(y_test, y_pred_binary)
print(f"Binary Encoding Model - Mean Squared Error: {mse_binary}")

Output:

Binary Encoding Model - Mean Squared Error: 225.0

In this example, X_train is a matrix containing the independent variables, including the encoded categorical variables, and y_train is a vector containing the dependent variable. We create a new instance of LinearRegression class from sklearn.linear_model module and call the fit() method to train the model.

Interpreting Model Performance

After applying various encoding techniques to incorporate categorical variables into our linear regression model, it’s essential to interpret the performance of each method. We evaluated the models using the Mean Squared Error (MSE), a metric that quantifies the average squared difference between predicted and actual values. Let’s delve into the interpretation of our results.

One-Hot Encoding Model

The One-Hot Encoding model exhibited an MSE of 225.0. This implies that, on average, the squared difference between predicted and actual prices is 225.0. Considering our dataset’s price range (5 to 25), this level of error may be perceived as relatively high.

Label Encoding Model

The Label Encoding model showcased a similar performance with an MSE of approximately 225.0. The numerical values assigned to different color categories successfully captured the relationships in the data, resulting in predictions with a similar level of accuracy as the One-Hot Encoding method.

Binary Encoding Model

The Binary Encoding model also yielded an MSE of 225.0, aligning its performance closely with the other encoding techniques. The binary representations effectively captured patterns related to the ‘color’ feature, contributing to predictions that exhibit a comparable level of accuracy.

It’s crucial to note that the interpretation of MSE values depends on the context of the problem and the range of the target variable. Lower MSE values are generally preferred, but what constitutes a “good” performance varies based on the specific use case. Additional considerations, such as interpretability, computational efficiency, and the nature of the categorical variables, should guide the selection of the most suitable encoding method for a given scenario.

In summary, our models, regardless of the encoding technique used, demonstrated similar levels of prediction accuracy. When choosing an encoding method, it’s essential to weigh various factors and consider the specific requirements of the problem at hand.

Error Handling and Considerations

  1. Handling Missing Values:
  • Prior to encoding, address and impute missing values in the categorical variables appropriately.
  1. Handling New Categories in Test Data:
  • Ensure that the encoding method used in training is applied consistently to the test data.
  • Handle new or unseen categories in test data, either by imputing or encoding them appropriately.
  1. Handling Ordinal Information:
  • When using label encoding or binary encoding, validate that the encoding captures meaningful ordinal relationships.
  • Check for consistency in ordinal encoding across training and test datasets.
  1. Addressing Multicollinearity:
  • Be aware of multicollinearity issues that may arise, especially with one-hot encoding.
  • Consider regularization techniques or feature selection to mitigate multicollinearity.
  1. Choosing the Right Encoding Method:
  • Select the encoding method based on the nature of the categorical variable and its relationship with the target variable.
  • Experiment with different encoding methods and assess their impact on model performance.

Conclusion

In this article, we have explored how to use sklearn to build a linear regression model with categorical variables. We have seen that we need to use special techniques to convert categorical variables into numerical variables that can be used in a linear regression model, and we have looked at three techniques: one-hot encoding, label encoding, and binary encoding. By using these techniques, we can ensure that our linear regression model is accurate and reliable, even when dealing with categorical variables.


About Saturn Cloud

Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.