As a self-proclaimed lover of data, it should be no surprise that I’m a huge fan of Big Data. But, like many other data professionals, I am sometimes frustrated by the misconceptions and misrepresentations that come along with the latest industry buzzwords and marketing attacks. That’s why, when I recently read this blog post by Stephen Few (b | t) a world-renown data visualization expert, it stirred up some thoughts that I wanted to organize and get down into words.
Note: when looking for Stephen’s twitter handle, I stumbled upon @FakeStephenFew…which appears to be a spoof account…but very funny nonetheless for those familiar with Stephen’s personality and writing style.
Size Doesn’t Matter
One of the biggest, ugliest, and most pervasive misconceptions about Big Data is that its all about size. Thankfully, that line of thinking is slowly starting to die…and in its place we are starting to get more and more articles using the 3 V’s model – where in Big Datais classified as data exhibiting 1 (or more) of the following characteristics: volume, variety, velocity But that’s also a misconception of sorts. The reality is that we’ve always had data that could be described as having high volume, high velocity, and/or high variability. Those are just relative and highly subjective descriptions of data.
A Map to Buried Treasure
Another misconception (or perhaps overhype) is that Big Data will be your pirate map to buried treasures and untold fortunes. The sales pitch looks something like this:
- Build Hadoop Cluster
- Load Data
- Hire Data Scientist
That’s simply not the case. And that’s sort of the point of the Stephen Few blog post I referenced in the intro…so go read that now 😉
Big Data is a Paradigm Shift
I prefer to think of Big Data as a paradigm shift…driven by the emergence of new tools and technologies (ex. Hadoop, MapReduce, HBase, Hive, etc) that expand our capabilities for interacting with data – primarily from the perspectives of storage and processing. Ultimately these tools allow us to tackle additional sources of data (which were previously forsaken due to the cost of storage, cost of processing, or some other similar reason) with the end-goal of making more informed decisions.
Cost of Latency
One of the biggest benefits of these new tools and technologies is the corresponding movement in the cost-of-latency factor.
Latency, in this context, refers to the length of time it takes to turn data into information. In general, there’s a fairly tight inverse correlation between latency and cost. Processing data quickly (lower latency) typically requires better, more expensive, hardware with higher I/O throughput and processing capacity…ex. 600GB FusionIO vs 1 TB 5400rpm spinning disk.
Below is a graph of the existing cost-of-latency…prior to Big Data:
In reality, the correlation between latency and cost isn’t a straight line…it’s actually jagged with the low points representing the sweetspots where the right technology has been matched with the right problem (ex. using local SSDs for SSAS multidimensional databases with high-random access patterns instead of allocating space on corporate SAN).
It’s akin to the old saying of using the right tool for the job – consider using a cordless drill vs a screw driver when building a deck…I know which tool I’d prefer to use.
Extrapolating this logic out…as the number of tools and technologies increase, so to will the number sweetspots that fall underneath the current/existing cost-of-latency curve…
Note: the space between the original line and the one below represents gains derived from the new Big Data tools and technologies.
This opens up the conversation and options for solving business problems (from an IT architecture perspective) more efficiently. And that is exactly what Jeremiah Peschka (b | t) did in this excellent case-study. By using some of the new Big Data tools/technologies, Jeremiah was able to help his client more efficiently allocate IT spend while still meeting their requirements. Who doesn’t want that?
Thought Process Evolution
Another benefit, although slightly more intangible, is how the thought process of the decision-makers is evolving. Instead using “gut feel” or “dead-reckoning” more and more decision-makers are starting to use data to drive the decisions they make.
This isn’t exactly a new trend that stared with the rise of Big Data. Personally, I think it started way-back-when with operational reporting and then picked up a lot of steam over the past decade with the rise and dominance of BI. Regardless of the history, the point is this: as long as data is at the forefront of popular Business and IT magainzes…it’s going to continue to drive this evolution…which I think is a positive, not only for businesses, but also for society.
Don’t forget to tip your Waiter
So who do we have to thank for all these new tools and technologies? The answer is companies like Google, Facebook, and Yahoo (or perhaps more accurately – smaller companies acquired by larger companies like Google, Facebook, Yahoo).
Yep, that’s right…the majority of these “Big Data” tools/technologies (ex. Hadoop, MapReduce, etc) were created by teams of brilliant developers (working at the aforementioned companies) to help them solve their own challenging data-problems more efficiently…pure innovation!
For example, Hadoop was originally being developed at Yahoo to address some of their search engine scalability pains (which you can read about in The history of Hadoop).
And, like a gift from the gods, these companies open-sourced the toosl/technology – making it available to everyone else.
Now that you know what Big Data means to me, I’ll step down from my soapbox and open it up to the readers…what does Big Data mean to you?