This article shows you simple and immediately usable methods of understanding whether the data you have is good or bad.
This article has intentionally been kept short and to the point so that it is practically usable as a set of best practices.
A more detailed version of this article is being written by an expert in the field, which will be also be published as a guest post soon.
What is bad data?
That is easier to answer. Data which is not good is bad.
What is good data?
Data which allows you to make an Excel Table directly, without any manual clean up is called good data.
This data should ideally be like a table. This sounds simple enough. But in reality, data rarely comes in such a clean format. Unless you demand it (and you have the power to demand it).
Here is a sample of good data.
This may sound very simple. But in reality, this simple stuff is rarely the norm.
Good data checklist
Here is a simple list you should check with any data you get. If all items are true, then it is good data. Even if one of these is missing, then it is bad data.
- Each column must have a heading
- No blank headings
- No duplicate headings
- No formulas in headings
- No merged cells
- Only one meaning in each column
- Same data type in a column.
- Blank cells where data is not available are ok.
- No grand totals, no sub totals
- No rows or columns which are entirely empty
- The data type should be what is needed (e.g., dates should be actually dates, not text)
Bad data can come in many formats
Human imagination is the only limit to the number of bad data formats. But at least you know what is good now. Therefore, the first step is to check if the data is good or bad.
If it is bad, DO NOT continue to use it for further analysis or any other interpretative work. Usually this is exponentially more complex, error prone and time consuming.
Bad data must somehow be transformed into Good data before you do any useful work.
Easiest way to get good data
If you find bad data, do not immediately jump into finding innovative methods of cleaning it and converting it to good data. There is an easier method available.
The method is simple. Ask the source to provide data in the desired – Good – format. If they don’t understand what you mean, ask them to read this blog
In many cases, the person(s) responsible for providing the data are delivering bad data because nobody objected before. Many things in most organization just happen the way the used to happen earlier. So ask for good data. May be the source can easily provide it.
In few cases you do not have control over the source data. These are:
- You are buying off the shelf data from a provider. No customization is possible.
- Your IT team does not have the source code of the report generation application.
- The data is being captured in wrong format – usually happens with manually entered data
If these situations do not apply to you, ask for it. Most probably you will get it.
Here’s a video that I created on Good Data vs Bad Data
Skip the blue paragraphs if you are not a technical person.
This section is for developers
As I have discussed in some article before, the process of writing reports is usually done by developers. If you are a developer, just provide the denormalized query output as the input for further analytics by users. Most analytics tools including Excel like this type of data. Of course if you have a choice, provide data in ATOM feed format.
Do not provide the Report Writer output in the Export to Excel option.
Clean the data yourself if the source cannot provide it
This is sometimes inevitable. But this time, you at least tried to get data in good format.
How to clean bad data and convert it to good data? That is a really bad question. Because every kind of bad data will require a specific approach and method to convert to good format.
So I am going to cover various types of common bad formats and the recommended conversion techniques in the upcoming articles.
The bad formats covered will be:
- Tabular data with gaps left over after copy pasting as values from Pivots
- Tabular data with gaps left over intentionally to speed up data entry
- Cross-tab data
- Tabular data with unwanted page headers and footers (Report printed to Excel!)
- Tabular Data with multiple meanings in one column
- Tabular Data with various types of date abnormalities
- Tabular Data where leading zeros are being lost in the import process
- Tabular Data where formatting has some meaning
- Tabular Data in multiple sheets or files
- Non tabular data
- Data with lots of errors (green marks – number stored as text)
- Tabular data with unwanted formatting and subtotals
- Tabular data which is too large to handle in Excel
Finally we will also discuss how to check data quality and handle missing data.
If you have any specific type of bad data and you want to know the best way of cleaning it, please post a sample as comments. Make sure that the data is not confidential and does not violate any copyrights.