How to Use Logistic Regression predict_proba Method in scikit-learn

As a data scientist, you may often come across situations where you need to predict the probability of an event occurring. Logistic regression is a popular algorithm used for this purpose. scikit-learn, a popular machine learning library in Python, provides a predict_proba method to predict the probability of an event using logistic regression.

As a data scientist, you may often come across situations where you need to predict the probability of an event occurring. Logistic regression is a popular algorithm used for this purpose. scikit-learn, a popular machine learning library in Python, provides a predict_proba method to predict the probability of an event using logistic regression.

In this article, we will discuss how to use the predict_proba method in scikit-learn to predict the probability of an event using logistic regression.

Table of Contents

  1. Introduction
  2. Handling Multi-class Scenarios
  3. Conclusion

What is Logistic Regression?

Logistic regression is a statistical method used to analyze the relationship between a dependent variable and one or more independent variables. It is commonly used to predict the probability of a binary outcome, such as whether a customer will buy a product or not.

The logistic regression model is based on the logistic function, which is a sigmoid curve that maps any real-valued number into a value between 0 and 1. The logistic function is defined as:

$$f(x) = \frac{1}{1+e^{-x}}$$

where $x$ is the input to the function.

The logistic regression model estimates the parameters of the logistic function using maximum likelihood estimation. Once the parameters are estimated, the model can be used to predict the probability of an event occurring.

What is predict_proba Method in scikit-learn?

scikit-learn provides a predict_proba method for logistic regression, which returns the predicted probabilities for each class. In binary classification, the predict_proba method returns a 2-dimensional array with shape (n_samples, 2), where n_samples is the number of samples and the first column represents the probability of the negative class and the second column represents the probability of the positive class.

How to Use predict_proba Method in scikit-learn?

To use the predict_proba method in scikit-learn, we first need to train a logistic regression model using the LogisticRegression class. The LogisticRegression class provides several parameters that can be tuned to improve the performance of the model.

Here is an example code snippet to train a logistic regression model on the iris dataset:

# Main explanation using a binary dataset
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Generate a binary dataset
X, y = make_classification(n_samples=100, n_features=5, n_classes=2, random_state=42)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a logistic regression model
clf = LogisticRegression(random_state=42)
clf.fit(X_train, y_train)

Once the model is trained, we can use the predict_proba method to predict the probabilities of the test set:

# Predict the probabilities of the test set
proba = clf.predict_proba(X_test)

# Print the predicted probabilities
print(proba)

This will output something like:

[[8.46257648e-01 1.53742352e-01]
 [7.62136223e-03 9.92378638e-01]
 [2.87286831e-03 9.97127132e-01]
 [3.21162679e-02 9.67883732e-01]
 [1.84407648e-02 9.81559235e-01]
 [1.00737925e-03 9.98992621e-01]
 [3.37339288e-02 9.66266071e-01]
 [9.88364490e-01 1.16355104e-02]
 [2.48766220e-02 9.75123378e-01]
 [8.17053982e-01 1.82946018e-01]
 [9.35066986e-01 6.49330136e-02]
 [9.95837620e-01 4.16237981e-03]
 [9.64052233e-01 3.59477669e-02]
 [2.10579042e-01 7.89420958e-01]
 [9.99844362e-01 1.55638038e-04]
 [1.45450963e-04 9.99854549e-01]
 [1.35817845e-02 9.86418215e-01]
 [5.10531933e-02 9.48946807e-01]
 [5.39144538e-02 9.46085546e-01]
 [2.39720894e-03 9.97602791e-01]]

The output represents the results of the predict_proba method applied to a logistic regression model trained on a binary classification dataset. Each row corresponds to a specific data point, and the two columns contain the predicted probabilities for the two classes. For instance, in the first row, the model predicts with approximately 84.6% probability that the data point belongs to the first class (negative class) and with a lower probability of 15.4% for the second class (positive class). Similarly, for each subsequent row, the two probabilities indicate the model’s confidence in assigning the data point to either class, allowing for an assessment of the classification certainty for individual instances in the dataset.

Handling Multi-class Scenarios

In this section, we transition from the binary classification context to a multi-class setting using the Iris dataset. Following the same framework, logistic regression is employed to estimate the probabilities for each class. The predict_proba method, in this case, returns an array with columns representing the predicted probabilities for each class. Each row corresponds to an instance in the test set, and the probabilities offer insights into the model’s confidence regarding the likelihood of each class. The provided output displays these probabilities, illustrating how logistic regression can be employed for multi-class classification problems, extending its utility beyond binary outcomes.

from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Load the iris dataset
iris = load_iris()

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)

# Train a logistic regression model
clf = LogisticRegression(random_state=42)
clf.fit(X_train, y_train)

# Predict the probabilities of the test set
proba = clf.predict_proba(X_test)

# Print the predicted probabilities
print(proba)

Output:

[[3.78147862e-03 8.27214504e-01 1.69004018e-01]
 [9.46740893e-01 5.32589067e-02 1.99904361e-07]
 [8.70510521e-09 1.55660319e-03 9.98443388e-01]
 [6.42856249e-03 7.92141402e-01 2.01430035e-01]
 [1.43974610e-03 7.74275957e-01 2.24284297e-01]
 [9.55805352e-01 4.41944712e-02 1.76765833e-07]
 [7.75927457e-02 9.08099007e-01 1.43082472e-02]
 [1.61216501e-04 1.55692354e-01 8.44146430e-01]
 [2.20696580e-03 7.62640689e-01 2.35152345e-01]
 [2.83084164e-02 9.45795458e-01 2.58961261e-02]
 [4.39141123e-04 2.43364553e-01 7.56196306e-01]
 [9.68342654e-01 3.16572678e-02 7.80251731e-08]
 [9.72958886e-01 2.70410810e-02 3.33176722e-08]
 [9.62131478e-01 3.78684113e-02 1.10870046e-07]
 [9.79276337e-01 2.07235981e-02 6.47056469e-08]
 [4.53672614e-03 7.12721598e-01 2.82741676e-01]
 [7.21582631e-06 2.42162012e-02 9.75776583e-01]
 [2.73262531e-02 9.47683002e-01 2.49907449e-02]
 [8.22517364e-03 8.31144892e-01 1.60629934e-01]
 [1.41757641e-05 3.59481509e-02 9.64037673e-01]
 [9.64392624e-01 3.56071832e-02 1.92775020e-07]
 [1.31235293e-03 3.99179236e-01 5.99508411e-01]
 [9.61649434e-01 3.83503054e-02 2.61037286e-07]
 [1.85177264e-05 4.58727522e-02 9.54108730e-01]
 [1.63455879e-06 2.58916898e-02 9.74106676e-01]
 [9.31482222e-05 1.05072299e-01 8.94834552e-01]
 [8.67587969e-06 5.83363810e-02 9.41654943e-01]
 [4.29268161e-06 1.88640280e-02 9.81131679e-01]
 [9.66875997e-01 3.31238670e-02 1.35722265e-07]
 [9.56339017e-01 4.36607511e-02 2.32316615e-07]]

In each row, the three columns represent the predicted probabilities for the three classes (setosa, versicolor, and virginica) in the Iris dataset. For instance, in the first row, the model predicts a probability of approximately 3.8% for setosa, 82.7% for versicolor, and 16.9% for virginica. These probabilities offer insights into the model’s confidence in assigning each instance to a particular class. The row-wise interpretation continues, illustrating the distribution of probabilities across the three classes for each data point in the test set. The highest probability in each row signifies the predicted class for that instance, showcasing how logistic regression extends its application to multi-class classification scenarios.

Conclusion

In summary, this article has delved into the practical application of logistic regression, shedding light on its ability to predict the probability of binary outcomes and navigate multi-class scenarios. A pivotal aspect of our exploration has been the predict_proba method in scikit-learn, offering a robust means to gauge the likelihood of different classes.

Our journey through logistic regression, illustrated with examples from both binary and multi-class datasets, has underscored the algorithm’s adaptability across a spectrum of classification tasks. Whether forecasting customer actions or categorizing diverse species, logistic regression, anchored in its sigmoidal logistic function, emerges as a versatile tool for data scientists.

Appreciating the intricacies of predict_proba and the dynamic relationship between probabilities and class predictions equips practitioners with the tools to refine predictive models and make judicious decisions. For those embarking on their data science journey, we encourage hands-on experimentation with the provided code snippets, exploration of diverse datasets, and consultation of the scikit-learn documentation for deeper insights into the world of logistic regression and its multifaceted applications.


About Saturn Cloud

Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.