Merging and Data Cleaning
When Inbuilt Functions Are Impossible
Cleaning data is always more of an undertaking than it feels like it should be. A common saying in most statistics classes is that 80% of your time is spent getting the data ready for analysis, which is very true in our experience. Often, students lack experience cleaning data before their dissertation because the datasets used in most classes are pre-cleaned.
We recently had a client who needed two large Excel spreadsheets of client data merged. However, each sheet contained different information, meaning no single spreadsheet could be considered more ‘updated’ than the other. To add to the complexity, there were duplicate clients on each sheet, so the data needed to be cleaned carefully to avoid duplicates and prevent data loss.
To make matters worse, each sheet contained over 10,000 rows, so we needed to automate as much of the cleaning process as possible. Our first task was to remove duplicate entries. However, we couldn’t use Excel’s built-in ‘delete duplicates’ function because it would delete the duplicate clients.
This is how we solved the problem:
Cleaned and sorted the data.
Made the duplicate values more visually distinct
Consolidated the information into one entry
When cleaning the data, we checked for spacing inconsistencies and found that some entries had double spacing. Once we fixed the issue with find-and-replace, we sorted all entries alphabetically to make it easier to see if something went wrong. To make the duplicates more visually distinct, we used conditional formatting based on duplicate status. From there, we deleted the duplicate information and consolidated separated entries into a single row.
All of these measures saved time and effort, allowing us to complete the merging process and deliver the Excel sheet within a single 8-hour workday.
For dissertations, clean and comprehensible data is vital to persuading others to support your conclusions. We specialize in quick and thorough data work while still being creative in our approaches and avoiding getting stuck with inbuilt functions.