Data Clean-up: One of the worst formats to get raw data is the system reports. These are reports you generate from business automation software. Export to Excel / CSV option dumps the report into a file instead of a printer. But the report is originally designed for printing. Now you waste too much time removing unwanted headers and footers.
This article shows a quick method, and more importantly, a concept which can be used in various clean-up scenarios as well.
Typical report format
Consider this output from a business report – exported to Excel. The areas in pink are page headers and footers. Usually we need the tabular data. We don’t need the page headers and footers.
The wrong way
Most of us either do this manually, deleting few rows at a time – repeatedly. This is inefficient.
Recording a macro is also inefficient because in effect you are automating inefficiency.
Don’t do it.
Slightly better way
Sort or auto-filter also works. Sort is bad because in many cases the original order of the transactions or data is important. Besides, while sorting if the correct range is not included, it can lead to confusion.
Filter may work in some cases. But filter only removes unwanted rows. In some cases, you may have unwanted columns.
In case of very large data, filter as well as sort can take up a very long time.
Both methods do not remove the repeated tabular header.
The right way: Using GoTo Special
The objective is to keep the needed rows (which contain tabular data) and remove unwanted rows (those which are part of header / footer / subtotals / etc.).
In this case, the text file was imported into Excel, however this concept will work with data which was directly copy pasted from system reports (web pages) as well.
Here are the steps. First of all, you must scroll the data and eyeball it (view the pattern).
Now try to find a column which is useful to us. Useful means what? If that column has an empty cell – it should identify unwanted row. If there is a value in that column cell, it should identify a row we want to keep. In short, the column is our keep / remove row decision flag.
In this case, Column D seems to be suitable. Scroll and check if your assumption is right. There may be many columns which are useful. But as long we find one such column, there is no need to search for additional columns. This is the column you would have used in AutoFilter.
Select the entire Column D. Now we want to select only the blank cells in it so that we can delete those rows. This is done by using GoTo -Special – Blanks.
Press Ctrl G (or F5) and choose Special – Blanks – Ok.
Now all blanks are selected in column D.
Right click carefully within any of the selected cells and choose Delete… Delete Entire Row. Now all unwanted rows are gone in one stroke. For large data, this operation may take significant amount of time. So make sure all other applications are closed and this file gets maximum memory resources.
Deleting unwanted columns
This can be done using the same method. Find a ROW which contains blank cells in unwanted columns. GoTo Special – Blanks – Delete Entire Column.
Removing unwanted headers
The header for columns will repeat on each page. This can be removed using Find feature.
Choose any column. In this case I am choosing the the first column where the header contains the title “Vendor”.
Press Ctrl F to open Find dialog and choose Find All.
Now you see the find results below. Click inside the results and press Ctrl A to select all of them. We want to preserve the first row header. So search for the first row and press CTRL Click (Press Ctrl key on keyboard, click on that search result). This will Unselect only that row from the find results.
Now the Find dialog has done its job – of identifying (selecting) the unwanted header rows.
Carefully right click in one of the selected cells and choose Delete – Delete Entire Row.
Check the data again to ensure that there are no mistakes.
Now our job of cleaning data is done!
This approach can be used as a best practice for most report cleaning tasks. Of course, you will need to refine the approach depending upon patterns specific to your data. But this concept will help you in finding the most optimal method.
- Understand the pattern of data
- Find a column which contains empty cells for unwanted rows.
- Remove those rows using GoTo Special Blanks
- Find a row where empty cells indicate unwanted columns
- Remove those columns using GoTo Special Banks
- Select any column and search on header caption
- Select all captions except the first one and remove those rows