DATA CLEANING PROJECT
Nashville Housing Data Cleaning
A technical SQL project focused on transforming raw housing data into a usable format through standardization, parsing, and deduplication.
Project Overview
This project addresses common data quality issues found in the Nashville Housing dataset. Real-world data is often messy, containing inconsistencies in date formats, missing values, and compound fields that make analysis difficult.
By using advanced SQL techniques including PARSENAME, SUBSTRING, and
ROW_NUMBER(), this project transforms the raw data into a clean, normalized schema
ready for analytical processing.
Usage Instructions
- Clone the repository via Git.
- Import the Nashville Housing dataset into your SQL Server environment.
- Execute the cleaning scripts sequentially to transform the dataset.
Key Cleaning Steps & SQL Queries
1. Standardize Date Format
Converts the 'SaleDate' column from a generic timestamp to a standardized Date format to ensure consistency.
ALTER TABLE NashvilleHousing
ADD SaleDateConverted Date;
UPDATE NashvilleHousing
SET SaleDateConverted = CONVERT(Date, SaleDate)
SELECT SaleDateConverted FROM NashvilleHousing
2. Populate Missing Property Addresses
Uses a Self Join to populate null Property Addresses. If a ParcelID exists with an address elsewhere, it populates the missing value.
UPDATE a
SET PropertyAddress = ISNULL(a.PropertyAddress, b.PropertyAddress)
FROM NashvilleDataCleaning..NashvilleHousing a
JOIN NashvilleDataCleaning..NashvilleHousing b
ON a.ParcelID = b.ParcelID
AND a.UniqueID <> b.UniqueID
WHERE a.PropertyAddress IS NULL
3. Breaking Out Property Address (Substring)
Splits the single 'PropertyAddress' column into individual 'Address' and 'City' columns using Substrings and Character Indexes.
SELECT
SUBSTRING(PropertyAddress, 1, CHARINDEX(',', PropertyAddress) -1) AS Address,
SUBSTRING(PropertyAddress, CHARINDEX(',', PropertyAddress) +1, LEN(PropertyAddress)) AS City
FROM NashvilleDataCleaning..NashvilleHousing
-- (Update statements omitted for brevity, see repo for full code)
4. Breaking Out Owner Address (Parsename)
A more efficient way to split delimited
strings. Uses PARSENAME to split 'OwnerAddress' into Address, City, and State.
SELECT
PARSENAME(REPLACE(OwnerAddress, ',', '.'), 3),
PARSENAME(REPLACE(OwnerAddress, ',', '.'), 2),
PARSENAME(REPLACE(OwnerAddress, ',', '.'), 1)
FROM NashvilleDataCleaning..NashvilleHousing
5. Standardize 'SoldAsVacant' Field
Harmonizes the 'SoldAsVacant' field by changing 'Y' and 'N' values to 'Yes' and 'No' for better readability and consistency.
UPDATE NashvilleHousing
SET SoldAsVacant = CASE
WHEN SoldAsVacant = 'Y' THEN 'Yes'
WHEN SoldAsVacant = 'N' THEN 'No'
ELSE SoldAsVacant
END
6. Remove Duplicates
Identifies and removes duplicate rows based on unique identifiers (ParcelID, Address, SalePrice, etc.) using a CTE and Window Functions.
WITH RowNumCTE AS(
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY ParcelID, PropertyAddress, SalePrice, SaleDate, LegalReference
ORDER BY UniqueID
) row_num
FROM NashvilleDataCleaning.dbo.NashvilleHousing
)
DELETE FROM RowNumCTE
WHERE row_num > 1
7. Delete Unused Columns
Removes raw columns that have been split or converted to keep the dataset clean and lightweight.
ALTER TABLE NashvilleDataCleaning.dbo.NashvilleHousing
DROP COLUMN OwnerAddress, TaxDistrict, PropertyAddress, SaleDate
Outcomes & Final Thoughts
This project successfully transformed a raw, messy dataset into a clean, structured resource suitable for analysis. Key improvements include:
Data Integrity
Populated missing property address data using reference points within the dataset itself.
Usability
Split complex address strings into atomic columns (Address, City, State) for more granular geographic analysis.
Consistency
Standardized boolean fields ('Yes'/'No') and date formats across the entire table.
Optimization
Removed duplicate records and unnecessary columns to improve query performance and data relevance.
The techniques demonstrated here (Self Joins, CTEs, String Parsing) are essential for any data analyst working with real-world data pipelines.