Big Data – Small Problem! Is the data good?

For those of us who have been in the industry long enough – we know that IT is a hype driven field. New fads and fashions come and go… some residual effect and benefit does remain but it is soon overridden by the next hype. Nothing wrong with this per se.

However, there is a disturbing and predictable pattern. Every hype is presented as the next new thing… everyone is supposed to jump on the bandwagon immediately, get enamored with the cutting edge and reap immediate benefits and gain the “first mover” advantage. Sounds nice. But does not deliver the promised value… in most cases!

The problem is that just because something new, genuinely useful and has been marketed well, does not mean we lose track of what is currently happening. Everyone would want you to be at the cutting edge… but there is gap between where you are and where your hype driven aspirational destination is.

In the process what happens is really sad – for the technology vendors and the customers. It is called a lose – lose situation. Most assume that data is already available in a usable and easy to analyze format – which is absolutely wrong. Usually it is available but in the worst possible format. People spend most of their time cleaning it. What is the point?

This article the beginning of a new series about data. It will cover – data concepts, data gathering, data clean up, effective analytics and finally the psychology of analytics.

Right now, we are in the midst of such situation in the Big Data hype. Therefore, I thought of writing this article to put things in a simplistic yet realistic perspective so that you can assimilate the benefits of the “cutting edge” and move smoothly from where you are to where you want to be.

This article is not for data science geeks. These are few elite class of people who access to sophisticated applications and resources. This article is for billions of users who spend their life in working on data and generating some output – just because boss wanted that report! Even if you are the boss, the story does not change. You DEPEND on people to feed you with reports on a periodic basis. Again a lose-lose situation.

Things have not changed

In Feb 2003, I wrote this article – Business Intelligence without the hype. I just read through it again. The issues I had discussed a decade back were:

  1. Lack of user awareness
  2. Poor database design
  3. Underutilization of existing reports
  4. Excessive features lead to user inundation—not empowerment
  5. Statutory reporting as a habit
  6. Slow reports with large amount of data
  7. Slow response to report enhancement requests by IT/vendor
  8. Geographically scattered information
  9. Proactive rather than reactive utility of information
  10. Manual/semi-automated heterogeneous data processing
  11. BI Initiatives do not progress beyond a department
  12. Vendor / IT dependence

Practically all of these issues are applicable even today. Of course, the hardware and software capabilities and bandwidth have increased significantly. But the core problems continue.

The same thing is likely to happen to this new phenomenon called Big Data. Immensely useful, but it will be misused and underused till the fashion season is over.

That is called Lost Opportunity – Low Return on Investment – Potential / Accrual Gap… Many names … recurring problem!

This article is not related to the WSJ article

Big Data’s Big Problem was an article published in The Wall Street Journal in Feb 2012 written by Ben Rooney. This article primarily discusses the shortage of skilled resources in the field of data science or Big Data.  Over the last two years, I am sure more resources are available. I am not discussing the skills shortage problem in this article.

Technical perception: Input >> Output

The divide between technical people and business people will remain. In fact, I see that it is growing rather than shrinking. But that is a subject for another article.

The point is that, from a technical point of view, any kind of data analysis looks like Input giving rise to Output after some processing.


Output format is typically decided by the business side and the processing is usually done programmatically by IT teams. So far it is fair enough. IT analyzes or generates reports based upon what users want to see. And when business people see what they want to see, they can interpret it and take some action.

A good division of labor. But an ineffective one. Why so? Because this model defeats the base objective of the term “analysis” in true sense of the word.

Data = Past. Cannot change anything

Unless it is randomly generated or forecast kind of information, most of the “data” we talk about is a record or log of something which has already happened.

Data may originate from different sources like these.

  1. Data entry
  2. Dumps from business applications
  3. Export to Excel option with reports
  4. Copy pasted from business reports which are rendered on the browser
  5. Direct connection with some kind of database or data feed
  6. Purchased from a third party agency

Irrespective of the source, there is nothing you can do to change the occurrence or transaction or the log or the event. It is simply past.

So what is the point in creating some output from something which is not under control? Why waste time and money in working on an unchangeable reality?
Why not spend that effort into something more under control?

The answer is simple…

Past >> Future

Agreed, data is usually PAST data. So why spend time on understanding it? Because you can learn something from the past and then use that knowledge to improve the future in one or more ways.


If what you learnt from the past is not in anyway useful from a futuristic perspective, then it only means you have not thought about it enough!

Of course, what you learn could be a good news or a bad news or a violation of some business rule or something which clashes with your gut feel.

Whatever it is, it is always helpful in a proactive way.

When we make reports and analysis in a typical business situation – this is what happens. Consider the finance department, for example. Lot of data is captured through the ERP or Financial Accounting application directly at source. Most transactional part is now fully automated. So that part is efficient.

Now when it comes to reports and MIS, each finance department will have a fixed set of reports which are churned out periodically – either through programming or manually.

By now, we should think that most of this reporting would have been automated. But you will be surprised to know that it is not so.

Let us say, 30 reports are being generated every month by the team. Fair enough. These 30 reports are essentially different points of view of looking at what happened – learn from it – and improve the future. So far so good.

But think again. Is anyone trying to find the 31st useful thing from the same data? Usually not. Why not? Because :

  1. Nobody is asking anybody to make a 31st report
  2. Few people, if at all, know how to explore the available raw information to create some meaningful conclusions
  3. Everyone is anyway getting paid for creating and interpreting those 30 reports. Creating new insights is nobody’s job!

So with all this dismal situation in mind, let us have a simple, exciting and motivating definition of Data Analysis.

Learning every possible useful thing from the past, so that we can improve the future is called analysis.

Now that we have the concept clear, we will explore the world of Data in greater detail in the next article. We will discuss how to get data in such a format which eliminates the need for manual clean up. That type of data is called “Good Data” –  a term coined by me. It helps focus on what matter, without getting carried away by the jargon.


2 thoughts on “Big Data – Small Problem! Is the data good?”

Comments? Suggestions? Wish list?