WEB SCRAPING PROJECT
An automated Python script that extracts, processes, and exports data about the largest US companies directly from Wikipedia.
This project demonstrates the power of web scraping by automatically collecting a list of the largest companies in the United States by revenue. By utilizing the BeautifulSoup library for HTML parsing and Pandas for data manipulation, the script transforms unstructured web data into a clean, analytical dataset saved as a CSV file.
First, we use the requests library to fetch the HTML content from Wikipedia, and
then create a BeautifulSoup object to parse the raw HTML structure.
from bs4 import BeautifulSoup
import requests
url = 'https://en.wikipedia.org/wiki/...'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html')
The script identifies the correct table (index 1) containing the company data. It extracts the column headers to initialize our dataset.
table = soup.find_all('table')[1]
world_titles = table.find_all('th')
titles = [t.text.strip() for t in world_titles]
import pandas as pd
df = pd.DataFrame(columns = titles)
We loop through every row, clean the text, append it to the Pandas DataFrame, and finally export the clean data to a CSV file.
column_data = table.find_all('tr')
for row in column_data[1:]:
row_data = row.find_all('td')
data = [d.text.strip() for d in row_data]
df.loc[len(df)] = data
df.to_csv('companies.csv', index = False)
Below is a sample of the cleaned data extracted by the script, showing the top companies by revenue.
| Rank | Name | Industry | Revenue (USD) | Growth | Employees |
|---|---|---|---|---|---|
| 1 | Walmart | Retail | 611,289 | 6.7% | 2,100,000 |
| 2 | Amazon | Retail / Cloud | 513,983 | 9.4% | 1,540,000 |
| 3 | ExxonMobil | Petroleum | 413,680 | 44.8% | 62,000 |
| 4 | Apple | Electronics | 394,328 | 7.8% | 164,000 |
| 5 | UnitedHealth | Healthcare | 324,162 | 12.7% | 400,000 |
This project successfully demonstrates the utility of Python for data acquisition. Key takeaways include: