The term “Big Data” has gained popularity over the past 12-24 months as a) amounts of data available to companies continually increase and b) technologies have emerged to more effectively manage this data. Of course, large volumes of data have been around for a long time. For example, I worked in the telecommunications industry for many years analyzing customer behavior. This required analyzing call records. The problem was that the technology (particularly the infrastructure) couldn’t necessarily support this kind of compute intensive analysis, so we often analyzed billing records rather than streams of calls detail records, or sampled the records instead.
Now companies are looking to analyze everything from the genome to Radio Frequency ID (RFID) tags to business event streams. And, newer technologies have emerged to handle massive (TB and PB) quantities of data more effectively. Often this processing takes place on clusters of computers meaning that processing is occurring across machines. The advent of cloud computing and the elastic nature of the cloud has furthered this movement.
A number of frameworks have also emerged to deal with large-scale data processing and support large-scale distributed computing. These include MapReduce and Hadoop:
-MapReduce is a software framework introduced by Google to support distributed computing on large sets of data. It is designed to take advantage of cloud resources. This computing is done across large numbers of computer clusters. Each cluster is referred to as a node. MapReduce can deal with both structured and unstructured data. Users specify a map function that processes a key/value pair to generate a set of intermediate pairs and a reduction function that merges these pairs
-Apache Hadoop is an open source distributed computing platform that is written in Java and inspired by MapReduce. Data is stored over many machines in blocks that are replicated to other servers. It uses a hash algorithm to cluster data elements that are similar. Hadoop can cerate a map function of organized key/value pairs that can be output to a table, to memory, or to a temporary file to be analyzed.
But what about tools to actually analyze this massive amount of data?
I recently had a very interesting conversation with the folks at Datameer. Datameer formed in 2009 to provide business users with a way to analyze massive amounts of data. The idea is straightforward: provide a platform to collect and read different kinds of large data stores, put it into a Hadoop framework, and then provide tools for analysis of this data. In other words, hide the complexity of Hadoop and provide analysis tools on top of it. The folks at Datameer believe their solution is particularly useful for data greater than 10 TB, where a company may have hit a cost wall using traditional technologies but where a business user might want to analyze some kind of behavior. So website activity, CRM systems, phone records, POS data might all be candidates for analysis. Datameer provides 164 functions (i.e. group, average, median, etc) for business users with APIs to target more specific requirements.
For example, suppose you’re in marketing at a wireless service provider and you offered a “free minutes” promotion. You want to analyze the call detail records of those customers who made use of the program to get a feel for how customers would use cell service if given unlimited minutes. The chart below shows the call detail records from one particular day of the promotion – July 11th. The chart shows the call number (MDN) as well as the time the call started and stopped and the duration of the call in milliseconds. Note that the data appear under the “analytics” tab. The “Data” tab provides tools to read different data sources into Hadoop.
This is just a snapshot – there may be TB of data from that day. So, what about analyzing this data? The chart below illustrates a simple analysis of the longest calls and the phone numbers those calls came from. It also illustrates basic statistics about all of the calls on that day – the average, median, and maximum call duration.
From this brief example, you can start to visualize the kind of analysis that is possible with Datameer.
Note too that since Datameer runs on top of Hadoop, it can deal with unstructured as well as structured data. The company has some solutions in the unstructured realm (such as basic analysis of twitter feeds), and is working to provide more sophisticated tools. Datameer offers its software either on either a SaaS license or on premises.
In the Cloud?
Not surprisingly, early adopters of the technology are using it in a private cloud model. This makes sense since some companies often want to keep control of their own data. Some of these companies already have Hadoop clusters in place and are looking for analytics capabilities for business use. Others are dealing with big data, but have not yet adopted Hadoop. They are looking at a complete “big data BI” type solution.
So, will there come a day when business users can analyze massive amounts of data without having to drag IT entirely into the picture? Utilizing BI adoption as a model, the folks from Datameer hope so. I’m interested in any thoughts readers might have on this topic!