WEB SCRAPING PROJECT

Wikipedia Data Scraper

An automated Python script that extracts, processes, and exports data about the largest US companies directly from Wikipedia.

Python BeautifulSoup Pandas Automation

View Repository

Project Overview

This project demonstrates the power of web scraping by automatically collecting a list of the largest companies in the United States by revenue. By utilizing the BeautifulSoup library for HTML parsing and Pandas for data manipulation, the script transforms unstructured web data into a clean, analytical dataset saved as a CSV file.

The Scraping Process

1. Connect & Parse

First, we use the requests library to fetch the HTML content from Wikipedia, and then create a BeautifulSoup object to parse the raw HTML structure.

main.py

from bs4 import BeautifulSoup
import requests

url = 'https://en.wikipedia.org/wiki/...'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html')

2. Target Data

The script identifies the correct table (index 1) containing the company data. It extracts the column headers to initialize our dataset.

main.py

table = soup.find_all('table')[1]
world_titles = table.find_all('th')
titles = [t.text.strip() for t in world_titles]

import pandas as pd
df = pd.DataFrame(columns = titles)

3. Extract & Export

We loop through every row, clean the text, append it to the Pandas DataFrame, and finally export the clean data to a CSV file.

main.py

column_data = table.find_all('tr')

for row in column_data[1:]:
    row_data = row.find_all('td')
    data = [d.text.strip() for d in row_data]
    df.loc[len(df)] = data

df.to_csv('companies.csv', index = False)

Output Data Preview

Below is a sample of the cleaned data extracted by the script, showing the top companies by revenue.

Rank	Name	Industry	Revenue (USD)	Growth	Employees
1	Walmart	Retail	611,289	6.7%	2,100,000
2	Amazon	Retail / Cloud	513,983	9.4%	1,540,000
3	ExxonMobil	Petroleum	413,680	44.8%	62,000
4	Apple	Electronics	394,328	7.8%	164,000
5	UnitedHealth	Healthcare	324,162	12.7%	400,000

Outcomes & Final Thoughts

This project successfully demonstrates the utility of Python for data acquisition. Key takeaways include:

Automated Data Collection: Bypassed manual data entry by programmatically scraping web tables.
HTML Parsing: Leveraged BeautifulSoup to navigate complex DOM structures and target specific HTML tags.
Data Structuring: Used Pandas to transform raw list data into a structured DataFrame suitable for analysis.
File Persistence: Successfully exported the processed data to a persistent CSV format for future use.

View More Projects Back to Top