# How to Find Percentile Stats of a Given Column Using Pandas

As a data scientist or software engineer, you might come across a situation where you need to analyze the distribution of a dataset and find the percentile statistics of a specific column. In such cases, Pandas is the go-to library for data manipulation and analysis in Python. In this post, we will discuss how to find percentile statistics of a given column using Pandas.

## Table of Contents

- What are Percentile Statistics?
- Step-by-Step Guide to Finding Percentile Statistics Using Pandas
- Common Errors
- Best Practices
- Conclusion

## What are Percentile Statistics?

Percentiles are used to divide a dataset into equal parts based on the value of a specific column. For example, the 50th percentile (also known as the median) is the value that divides the dataset into two equal parts. Similarly, the 25th percentile (also known as the first quartile) is the value that divides the dataset into four equal parts. Percentile statistics are useful in understanding the distribution of a dataset and identifying outliers.

Let’s consider the following DataFrame:

```
name age salary
0 Alice 25 95767
1 Bob 30 50967
2 Charlie 52 52042
3 David 46 98117
4 Eva 46 96719
5 Frank 51 86764
6 Grace 50 62443
7 Henry 46 58686
8 Ivy 30 95121
9 Jack 58 59271
10 Katie 38 70260
11 Liam 47 97618
12 Mia 48 68332
13 Nathan 47 54634
14 Olivia 37 89439
15 Paul 28 88806
16 Quinn 51 69256
17 Rachel 31 64053
18 Sam 52 85306
19 Tyler 59 68671
```

## Step-by-Step Guide to Finding Percentile Statistics Using Pandas

To find percentile statistics of a given column using Pandas, we will follow these steps:

- Import the Pandas library and read the dataset into a Pandas DataFrame.
- Identify the column for which you want to find percentile statistics.
- Use the
`quantile()`

function to find the percentile statistics.

Let’s dive into each step in detail.

### Step 1: Import the Pandas Library and Read the Dataset into a Pandas DataFrame

To use Pandas, we first need to import the library. We can do this using the following code:

```
import pandas as pd
```

Next, we need to read the dataset into a Pandas DataFrame. We can use the `read_csv()`

function to read a CSV file into a DataFrame. For example, if our dataset is stored in a file called `data.csv`

, we can read it into a DataFrame using the following code:

```
df = pd.read_csv('data.csv')
```

### Step 2: Identify the Column for Which You Want to Find Percentile Statistics

Once we have the dataset loaded into a DataFrame, we need to identify the column for which we want to find percentile statistics. We can do this by referring to the column name. For example, if we want to find percentile statistics for the `age`

column, we can use the following code:

```
column_name = 'age'
```

### Step 3: Find the Percentile Statistics

#### Use the quantile() Function

The `quantile()`

function is used to find the percentile statistics of a given column in a Pandas DataFrame. We can use this function to find any percentile, such as the median (50th percentile), first quartile (25th percentile), third quartile (75th percentile), etc.

The `quantile()`

function takes a single argument, which is the percentile value as a decimal. For example, to find the median (50th percentile), we can use the following code:

```
median = df[column_name].quantile(0.5)
print(median)
```

Output:

```
46.5
```

Similarly, to find the first quartile (25th percentile) and third quartile (75th percentile), we can use the following code:

```
q1 = df[column_name].quantile(0.25)
q3 = df[column_name].quantile(0.75)
```

We can also find any other percentile by specifying the percentile value as a decimal. For example, to find the 90th percentile, we can use the following code:

```
p90 = df[column_name].quantile(0.9)
```

#### Method 2: Using `numpy.percentile`

```
import numpy as np
# Load the employee data CSV file into a Pandas DataFrame
df = pd.read_csv('data.csv')
# Extract the salary column for analysis
salary_data = df['salary']
# Define the desired percentiles
percentiles = [25, 50, 75]
# Calculate percentiles using numpy.percentile
percentile_values = np.percentile(salary_data, percentiles)
print(f"Salary Percentiles {percentiles}: {percentile_values}")
```

Output:

```
Salary Percentiles [25, 50, 75]: [68100.5 75557. 88517.25]
```

## Common Errors

### Error 1: Missing Data

Handle missing data appropriately using methods like `dropna`

or imputation, especially if your dataset contains missing salary values.

### Error 2: Incorrect Percentile Value

Ensure that the specified percentile values are within the valid range (0 to 100 for `numpy.percentile`

and 0 to 1 for Pandas' `quantile`

).

## Best Practices

- Handle missing data appropriately using methods like
`dropna`

or imputation. - Verify column names and ensure they match your DataFrame structure.
- Choose the method that best suits your needs;
`numpy.percentile`

for more flexibility or Pandas'`quantile`

for simplicity.

## Conclusion

In this post, we discussed how to find percentile statistics of a given column using Pandas. We learned that percentile statistics are useful in understanding the distribution of a dataset and identifying outliers. We also went through a step-by-step guide to finding percentile statistics using Pandas. By following these steps, you can easily find the percentile statistics of any column in a Pandas DataFrame.

#### About Saturn Cloud

Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Join today and get 150 hours of free compute per month.