Back to Projects

WEB SCRAPING PROJECT

Wikipedia Data Scraper

An automated Python script that extracts, processes, and exports data about the largest US companies directly from Wikipedia.

Python BeautifulSoup Pandas Automation

Project Overview

This project demonstrates the power of web scraping by automatically collecting a list of the largest companies in the United States by revenue. By utilizing the BeautifulSoup library for HTML parsing and Pandas for data manipulation, the script transforms unstructured web data into a clean, analytical dataset saved as a CSV file.

The Scraping Process

1. Connect & Parse

First, we use the requests library to fetch the HTML content from Wikipedia, and then create a BeautifulSoup object to parse the raw HTML structure.

main.py
from bs4 import BeautifulSoup
import requests

url = 'https://en.wikipedia.org/wiki/...'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html')

2. Target Data

The script identifies the correct table (index 1) containing the company data. It extracts the column headers to initialize our dataset.

main.py
table = soup.find_all('table')[1]
world_titles = table.find_all('th')
titles = [t.text.strip() for t in world_titles]

import pandas as pd
df = pd.DataFrame(columns = titles)

3. Extract & Export

We loop through every row, clean the text, append it to the Pandas DataFrame, and finally export the clean data to a CSV file.

main.py
column_data = table.find_all('tr')

for row in column_data[1:]:
    row_data = row.find_all('td')
    data = [d.text.strip() for d in row_data]
    df.loc[len(df)] = data

df.to_csv('companies.csv', index = False)

Output Data Preview

Below is a sample of the cleaned data extracted by the script, showing the top companies by revenue.

Rank Name Industry Revenue (USD) Growth Employees
1 Walmart Retail 611,289 6.7% 2,100,000
2 Amazon Retail / Cloud 513,983 9.4% 1,540,000
3 ExxonMobil Petroleum 413,680 44.8% 62,000
4 Apple Electronics 394,328 7.8% 164,000
5 UnitedHealth Healthcare 324,162 12.7% 400,000

Outcomes & Final Thoughts

This project successfully demonstrates the utility of Python for data acquisition. Key takeaways include:

© 2026 Sai Suraj Matta. All rights reserved.