Downloading a CSV from a URL and Converting it to a DataFrame using Python Pandas

In this blog, we will learn about the potent role Python’s Pandas library plays in data science, particularly in the manipulation and analysis of data. Addressing a common challenge faced by data scientists, the focus will be on the step-by-step process of downloading a CSV file from a URL and transforming it into a DataFrame for subsequent analysis. Follow along as this post guides you through each crucial step in this essential data science task.

In the world of data science, Python’s Pandas library is a powerful tool for data manipulation and analysis. One common task that data scientists often encounter is downloading a CSV file from a URL and converting it into a DataFrame for further processing. This blog post will guide you through this process step-by-step.

Table of Contents

  1. Prerequisites
  2. Step-by-Step downloading a csv from url
  3. Pros and Cons of This Method
  4. Common Errors and How to Handle Them
  5. Conclusion

Prerequisites

Before we start, make sure you have the following installed on your system:

  • Python 3.6 or later
  • Pandas library

If you haven’t installed Pandas yet, you can do so using pip:

pip install pandas

Step-by-Step downloading a CSV from URL

Step 1: Importing the Required Libraries

The first step is to import the necessary libraries. We will need the pandas library for creating the DataFrame and the requests library for downloading the CSV file.

import pandas as pd
import requests

Step 2: Downloading the CSV File

Next, we will download the CSV file from the URL. We will use the requests library’s get method to do this. The get method sends a GET request to the specified URL and returns the response.

url = "https://raw.githubusercontent.com/datasets/covid-19/main/data/countries-aggregated.csv"
response = requests.get(url)

In this example, we will use a real-world dataset related to COVID-19, specifically country-wise aggregated data.

Step 3: Converting the CSV File to a DataFrame

After downloading the CSV file, we can convert it into a DataFrame using the pandas library’s read_csv method. The read_csv method reads a CSV file and converts it into a DataFrame.

response = requests.get(url)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Save the content of the response to a local CSV file
    with open("downloaded_data.csv", "wb") as f:
        f.write(response.content)
    print("CSV file downloaded successfully")
else:
    print("Failed to download CSV file. Status code:", response.status_code)

# Read the CSV file into a Pandas DataFrame
df = pd.read_csv("downloaded_data.csv")

The StringIO function is used to convert the response text into a file-like object, which can then be passed to the read_csv method.

Step 4: Exploring the DataFrame

Now that we have our DataFrame, we can start exploring it. Here are a few methods you can use:

  • df.head(): This method returns the first 5 rows of the DataFrame.
  • df.describe(): This method provides a statistical summary of the DataFrame.
  • df.info(): This method provides a concise summary of the DataFrame, including the number of non-null entries in each column.
print("\n--- HEAD ---")
print(df.head())
print("\n--- DESCRIBE ---")
print(df.describe())
print("\n--- INFO ---")
print(df.info())

Output:


--- HEAD ---
         Date      Country  Confirmed  Recovered  Deaths
0  2020-01-22  Afghanistan          0          0       0
1  2020-01-23  Afghanistan          0          0       0
2  2020-01-24  Afghanistan          0          0       0
3  2020-01-25  Afghanistan          0          0       0
4  2020-01-26  Afghanistan          0          0       0

--- DESCRIBE ---
          Confirmed     Recovered         Deaths
count  1.615680e+05  1.615680e+05  161568.000000
mean   7.361569e+05  1.453967e+05   13999.436089
std    3.578884e+06  9.748275e+05   59113.581271
min    0.000000e+00  0.000000e+00       0.000000
25%    1.220000e+03  0.000000e+00      17.000000
50%    2.369200e+04  1.260000e+02     365.000000
75%    2.558420e+05  1.797225e+04    4509.000000
max    8.062512e+07  3.097475e+07  988609.000000

--- INFO ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 161568 entries, 0 to 161567
Data columns (total 5 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   Date       161568 non-null  object
 1   Country    161568 non-null  object
 2   Confirmed  161568 non-null  int64 
 3   Recovered  161568 non-null  int64 
 4   Deaths     161568 non-null  int64 
dtypes: int64(3), object(2)
memory usage: 6.2+ MB
None

Pros and Cons of This Method

Pros:

  • Simple and straightforward implementation.
  • Suitable for smaller datasets.
  • No need for additional dependencies beyond Pandas and Requests.

Cons:

  • Not optimal for handling large datasets due to the entire file being downloaded first.
  • Dependency on internet connectivity for downloading the file.

Common Errors and How to Handle Them

Error 1: ConnectionError

try:
    response = requests.get(csv_url)
    response.raise_for_status()
except requests.exceptions.HTTPError as errh:
    print("HTTP Error:", errh)
except requests.exceptions.ConnectionError as errc:
    print("Error Connecting:", errc)
except requests.exceptions.Timeout as errt:
    print("Timeout Error:", errt)
except requests.exceptions.RequestException as err:
    print("Error:", err)

This code snippet handles various connection-related errors that may occur during the download.

Error 2: File Not Found

try:
    df = pd.read_csv("downloaded_data.csv")
except FileNotFoundError:
    print("The specified CSV file was not found.")

This snippet addresses the scenario where the downloaded file is not found.

Conclusion

In this guide, we covered the process of downloading a CSV file from a URL and converting it into a Pandas DataFrame using Python. We discussed the pros and cons of this method, common errors, and provided detailed examples for handling potential issues. Incorporate these steps into your data analysis projects to efficiently work with remote datasets.


About Saturn Cloud

Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.