How to Scrape an HTML Table with Beautiful Soup into Pandas

As a data scientist or software engineer, you may often encounter the need to extract data from an HTML table on a website. This task can seem daunting at first, especially if you are not familiar with the necessary tools and techniques. Fortunately, with the help of Python and the Beautiful Soup library, extracting data from an HTML table is a relatively straightforward process.

As a data scientist or software engineer, you may often encounter the need to extract data from an HTML table on a website. This task can seem daunting at first, especially if you are not familiar with the necessary tools and techniques. Fortunately, with the help of Python and the Beautiful Soup library, extracting data from an HTML table is a relatively straightforward process.

In this article, we will walk through the steps of scraping an HTML table using Beautiful Soup and then importing the data into a Pandas DataFrame. By the end of this article, you will have a solid understanding of how to extract data from an HTML table and use it in your data science or software engineering projects.

Table of Contents

  1. What is Beautiful Soup?
  2. Scraping an HTML table with Beautiful Soup
  3. Conclusion

What is Beautiful Soup?

Beautiful Soup is a Python library designed for web scraping purposes. It allows you to parse HTML and XML documents, extract data, and navigate the parse tree with ease. Beautiful Soup provides a simple interface for working with HTML and XML files, making it an ideal tool for web scraping.

Scraping an HTML table with Beautiful Soup

To scrape an HTML table using Beautiful Soup, you will need to follow these steps:

  1. Install Beautiful Soup

Before you can start using Beautiful Soup, you will need to install it. You can install Beautiful Soup using pip, a package manager for Python:

pip install beautifulsoup4
  1. Import the necessary libraries

After installing Beautiful Soup, you will need to import the necessary libraries into your Python script:

from bs4 import BeautifulSoup
import requests
import pandas as pd

The requests library is used to make HTTP requests to the website from which you want to scrape the HTML table. The pandas library is used to create a DataFrame from the scraped data.

  1. Make an HTTP request to the website

Next, you will need to make an HTTP request to the website from which you want to scrape the HTML table. You can do this using the requests.get() method:

url = 'https://gcoins.net/en/catalog/view/45518'
response = requests.get(url)

Replace https://gcoins.net/en/catalog/view/45518 with the URL of the website from which you want to scrape the HTML table. In this tutorial, we will use a table that shows some old coin prices.

  1. Parse the HTML document

After making the HTTP request, you will need to parse the HTML document using Beautiful Soup. You can do this by passing the response.text attribute to the BeautifulSoup() constructor:

soup = BeautifulSoup(response.text, 'html.parser')

The html.parser argument tells Beautiful Soup to use the built-in HTML parser to parse the HTML document.

  1. Find the HTML table

Once the HTML document has been parsed, you can find the HTML table by inspecting the HTML code of the website. You will need to find the HTML table using its tag name, ID, class, or other attributes.

For example, if the HTML table has a class of subs noBorders evenRows, you can find it using the soup.find() method:

table = soup.find('table', attrs={'class':'subs noBorders evenRows'})
table_rows = table.find_all('tr')

Replace my_table with the ID of the HTML table you want to scrape.

  1. Extract the data from the HTML table

After finding the HTML table, you can extract the data from it using Beautiful Soup. You will need to loop through the rows and columns of the HTML table and extract the text from each cell.

For example, you can extract the data from the HTML table as follows:

data = []
for row in table.find_all('tr'):
    row_data = []
    for cell in row.find_all('td'):
        row_data.append(cell.text)
    data.append(row_data)

This code loops through each row and column of the HTML table and extracts the text from each cell. It then appends the row data to a list of data.

  1. Convert the data to a Pandas DataFrame

After extracting the data from the HTML table, you can convert it to a Pandas DataFrame using the pd.DataFrame() constructor:

df = pd.DataFrame(data)
print(df)

Output:

       0        1     2     3          4     5               6
0   None     None  None  None       None  None            None
1                  1882          108,000   UNC               —
2                  1883          786,000   UNC         ~ $3.20
3          \n\n\n  1884        4,604,000   UNC   ~ $1.67–$5.77
4                  1885        1,314,000   UNC         ~ $2.56
5                  1886          444,000   UNC               —
6                  1888          413,000   UNC         ~ $2.31
7                  1889          568,000   UNC         ~ $2.05
8          \n\n\n  1890        2,137,000   UNC   ~ $1.03–$6.41
9                  1891          605,000   UNC               —
10                 1892          205,000   UNC         ~ $3.59
11         \n\n\n  1893          754,000   UNC  ~ $3.84–$11.28
12                 1894          532,000   UNC         ~ $2.56
13                 1895          423,000   UNC         ~ $1.92
14                 1896          174,000   UNC               —

Conclusion

In this article, we have walked through the steps of scraping an HTML table using Beautiful Soup and then importing the data into a Pandas DataFrame. We have also discussed how to clean up the data if necessary.

Scraping HTML tables is a common task in data science and software engineering, and Beautiful Soup provides a simple and effective way to accomplish this task. By following the steps outlined in this article, you should now be able to scrape HTML tables with ease and use the extracted data in your projects.


About Saturn Cloud

Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.