The Hadoop distributed computing framework is broadening from its initial role in Internet search engines. Further expansion seems likely this year as more developers build on the software.
Hadoop, an Apache Software Foundation project, first took root at Yahoo and has since spread to other marquee customers such as Facebook and Twitter. The open-source software specializes in crunching very large data sets — the “big data” problem. To manage tasks of that sort, Hadoop dispatches processing chores across multiple computers. The software is inherently parallel, and applications may be designed to exploit that parallelism.
The core of the Hadoop framework consists of MapReduce, which distributes data-processing tasks across a compute cluster and aggregates the results, and Hadoop Distributed File System (HDFS), a storage component for Hadoop applications. A Hadoop FAQ states dual-core machines scale best for Hadoop; however, quad-core and hex-core deployments are emerging.
A Bigger Base
Since Hadoop’s search engine development six years ago, the technology has found its way into other applications such as clickstream analysis. The relatively recent availability of Hadoop as a distribution vendor has further widened the array of use cases. Cloudera Inc. began selling its Hadoop distribution in 2009. Other Hadoop distribution vendors include Hortonworks Inc. and MapR Technologies Inc., which both launched software in 2011.
Charles Zedlewski, vice president of products at Cloudera, said most people purchase Hadoop in distribution form, which provides an integrated stack of components, as opposed to a raw system. The additional functionality lets customers “solve a broader range of business problems,” he says.
Cloudera’s Distribution including Apache Hadoop (CDH), for example, provides a number of Apache Hadoop–related systems such as Pig, which lets developers write programs that lend themselves to parallelization; Sqoop, which integrates Hadoop with relational databases; HBase, a Hadoop database; and Flume, a system for aggregating streaming data.
CDH components such as HBase and Flume support real-time analysis, taking the technology into use cases not possible with core Hadoop, says Zedlewski.
“Flume allows users to stream data into Hadoop in real time so the lag between data generation and analysis is only a few seconds,” he notes.
Those real-time capabilities enable mobile services and systems for IT operations.
“Many popular mobile services that people use every day are backed by real-time Hadoop/HBase systems,” says Zedlewski. “We see several examples of people using Hadoop/HBase as a real-time operational data store for systems management at scale.”
Distributions from Hortonworks and MapR, meanwhile, cover a similar swath of Apache tools. They also provide management technology, which vendors believe will smooth Hadoop’s path into more enterprises and, presumably, a large set of applications.
Hortonworks Data Platform, a distribution initially released as a technology preview program in November, includes a management system, Apache Ambari. CDH and MapR, meanwhile, offer subscription versions of their free distributions that include tools for managing Hadoop clusters.
“At this point, we are focusing on things like making Hadoop really easy to consume … and monitor and support,” says Arun Murthy, founder and architect at Hortonworks.
The next wave of Hadoop expansion may flow from the ecosystem of ISVs now forming around the technology.
RainStor Inc., which develops big data management software, is one such firm. The company has partnerships in place with Cloudera, Hortonworks and MapR.
“Hadoop is an operating system for big data — a storage and processing mechanism,” says John Bantleman, chief executive officer at RainStor. “But you need applications and capabilities on top of that. I think Hadoop is a bit like Linux. It’s a platform. You need a product sitting on the platform to make it valuable.”
In January, the company rolled out its RainStor Big Data Analytics on Hadoop. The enterprise database uses data compression and partition filtering to speed up queries, notes Bantleman. The latter feature results in greater productivity through more efficient use of a Hadoop cluster, according to RainStor.
At the moment, financial services along with telco service providers stand out among RainStor’s top industry customers. Electronic trading and credit card transactions via smartphones contribute to the growth of big data in those sectors, says Bantleman.
Bantleman also points to the airline industry as another potential market: Aircraft such as Boeing’s 787 will generate masses of data from sensors monitoring engines and other aircraft systems. “We really believe that machine-generated data is a key driver,” he says.
Help For Developers
Hadoop includes components that let developers take advantage of its parallelism. Apache Pig, and specifically the Pig Latin language, is one key tool.
The Pig Latin data flow language lets programmers create scripts that generate a series of MapReduce jobs. Developers who use the high-level language don’t necessarily have to be up-to-speed on Hadoop’s parallelism to get results.
“The nice part is you don’t have to understand its parallel nature, but you get the benefit of the parallelism of MapReduce and HDFS,” says Murthy.
Work is underway to make Hadoop distributions more developer-friendly. Zedlewski said Cloudera limits updates to once a quarter to provide a predictable development target. In addition, the company aims to simplify its distribution from a development standpoint. Zedlewski notes that it is currently much more difficult to build a product against Hadoop than it is to build one against JBoss.
“We definitely want to lower the bar,” he says. “That’s a work in progress.”
Distribution providers are also developing partner programs to back developers.
Cloudera offers its Connect Partner Program for ISVs, independent hardware vendors, systems integrators, value-added resellers, and training organizations. Hortonworks’ Technology Partner Program supports ISVs, OEMs, and service providers.