We continue our exploration of good data and bad data in this article.
Please read these articles to get a better understanding of the concepts for Data clean-up.
- Big Data – Small Problem! Is the data good?
- Good Data vs. Bad Data: How to identify?
- Bad Data to Good Data: Pivot Copy Paste
We will cover the following clean-up scenarios here for Data clean-up.
The Data clean-up paradox
We spend too much time cleaning up data and get too little time left to do the real work. Data clean-up is not a very enjoyable activity. It is actually a frustrating and non-value-adding activity which wastes our time.
Often the feeling comes that you are helping Excel instead of Excel helping you.
Whenever this happens, it is usually an indication of the that your method is inefficient. Your method may work in the end. But if it is increasing your manual work, it is not a good idea.
Even though clean-up is a boring task, it can be done efficiently. This is done by using Excel to your advantage.
Given below are more specific examples.
Tabular data with gaps left over intentionally to speed up data entry
The data looks similar to what we had left over from Pivot copy paste. Only difference is that the data was manually entered in this format.
This could also be a situation where it could be a copy pasted pivot table where you don’t have the original pivot with you.
Now we need to fill the gaps manually. Which is obviously a very time consuming process if the data is large. Most of us would do it using either dragging repeatedly or copy pasting repeatedly. Repetition means inefficiency. It means we are helping Excel – Excel is not helping us.
Here is one more precaution. Notice that we do NOT want to fill ALL the gaps. Gaps in the data are to be left alone because they have a meaning. For example, FY 11 – Feb – Life value is missing or not known or zero. This is a specific meaning.
Whereas the gaps below FY 11 are actually not gaps. We know the value there – FY 11. Only for convenience the cells have been left blank.
So the green gaps are Convenience Gaps and the red ones are not gaps at all. They are information.
We want to fill only the convenience gaps. Let us see how…
Filling gaps in all selected columns
Select the entire area within which gaps have to be filled.
Now we want Excel to only select the empty cells.
That is done by choosing Go To (Ctrl G), Special – Blanks.
Now notice that all convenience gaps are selected in one quick action. Also notice that the active cell still allows us to type a formula.
We will use a simple formula which simply asks Excel to put the value above the current cell in the current cell.
DON’T PRESS ENTER. Wait.
Now we want this formula to be copied across the selection.
Press Ctrl Enter to finalize the formula in the current cell as well as copy it across the selection.
Not done yet. Copy – Paste Special as Values
The formulas have filled the gaps quickly. But those formulas are still alive. If you sort the data, the formulas will get confused and the data will be mixed up.
To avoid that, select the same area again and Copy – Paste special as values (ideally Values and Number formatting)
Yes, You can record a macro to do this. Do the selection first. Then record the macro.
The problem is that, in this process, we have to select again, in order to paste as values. And selection should not be part of the macro. Because that gets hard coded. So DO NOT include the last step of pasting special as values in the macro.
Interpreting Empty Data Cells
An empty cell can mean that the value is zero or the value is unknown.
This interpretation is important for some kind of calculations.
For example, a SUM formula result is not affected by this decision.
However an AVERAGE formula calculation IS affected depending upon the interpretation of the empty cell. By default the empty cell will not be counted in the calculation. So the average will be 4.
From a business point of view, if the empty cell meant that the value was zero, then the average should have been different (16/5 = 3.2)
This is something Excel cannot decide. Therefore, in any formula which has empty cells in the input, you will see the green error mark. Click there to open the dropdown.
You can either correct the problem or ignore the error, depending upon the context.
Unfortunately, this type of error is NOT checked by default – probably because MS wanted to avoid showing too many error marks!
But ideally, you should enable this feature by going to File – Excel Options – Formulas dialog.
Select the Formula refers to Empty Cells option and click OK.
Handling large data
Although this technique is fairly simple, when you handle large amount of data, it can take significant amount of time to finish the operation. Especially the step where you have to press Ctrl ENTER to copy formulas will take a long time.
In order to improve performance, you should run such operations on a fast machine with lots of RAM. Remember that Office is available in 64 bit version as well.
For most organizations, it is not feasible (or rather there is no budget available) to provide a high performance PC / Laptop to every user.
Therefore, I usually recommend that some shared high-end machines be kept in every important department and / or location. All users can utilize these machines only to perform intensive tasks and then use their regular PC for rest of the work.
This way, you can handle big data with a reasonable cost and also save enormous amount of time for every user – a Win Win situation!
In the next article we will continue the exploration of more data clean-up techniques.