WEB SCRAPING PROJECT
Wikipedia Data Scraper
An automated Python script that extracts, processes, and exports data about the largest US companies directly from Wikipedia.
Project Overview
This project demonstrates the power of web scraping by automatically collecting a list of the largest companies in the United States by revenue. By utilizing the BeautifulSoup library for HTML parsing and Pandas for data manipulation, the script transforms unstructured web data into a clean, analytical dataset saved as a CSV file.
The Scraping Process
1. Connect & Parse
First, we use the requests library to fetch the HTML content from Wikipedia, and
then create a BeautifulSoup object to parse the raw HTML structure.
from bs4 import BeautifulSoup
import requests
url = 'https://en.wikipedia.org/wiki/...'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html')
2. Target Data
The script identifies the correct table (index 1) containing the company data. It extracts the column headers to initialize our dataset.
table = soup.find_all('table')[1]
world_titles = table.find_all('th')
titles = [t.text.strip() for t in world_titles]
import pandas as pd
df = pd.DataFrame(columns = titles)
3. Extract & Export
We loop through every row, clean the text, append it to the Pandas DataFrame, and finally export the clean data to a CSV file.
column_data = table.find_all('tr')
for row in column_data[1:]:
row_data = row.find_all('td')
data = [d.text.strip() for d in row_data]
df.loc[len(df)] = data
df.to_csv('companies.csv', index = False)
Output Data Preview
Below is a sample of the cleaned data extracted by the script, showing the top companies by revenue.
| Rank | Name | Industry | Revenue (USD) | Growth | Employees |
|---|---|---|---|---|---|
| 1 | Walmart | Retail | 611,289 | 6.7% | 2,100,000 |
| 2 | Amazon | Retail / Cloud | 513,983 | 9.4% | 1,540,000 |
| 3 | ExxonMobil | Petroleum | 413,680 | 44.8% | 62,000 |
| 4 | Apple | Electronics | 394,328 | 7.8% | 164,000 |
| 5 | UnitedHealth | Healthcare | 324,162 | 12.7% | 400,000 |
Outcomes & Final Thoughts
This project successfully demonstrates the utility of Python for data acquisition. Key takeaways include:
- Automated Data Collection: Bypassed manual data entry by programmatically scraping web tables.
- HTML Parsing: Leveraged BeautifulSoup to navigate complex DOM structures and target specific HTML tags.
- Data Structuring: Used Pandas to transform raw list data into a structured DataFrame suitable for analysis.
- File Persistence: Successfully exported the processed data to a persistent CSV format for future use.