Hadoop: Getting Big Answers from Big Data

Intel’s Boyd Davis on how Hadoop can help enterprises gain better access to the information within their big data.

Boyd Davis vice president and general manager of Intel Datacenter Software division

"Our Hadoop distribution takes advantage of the fact that we have the most intimate knowledge of the underlying hardware," said Boyd Davis, vice president and general manager of Intel's Datacenter Software division.

The market for big data is growing fast — its forecast top $18 billion in 2013 and $47 billion by 2017, according to Wikibon. Lurking behind those big numbers is the reality that organizations often struggle to access and make use of the information within big data. Apache Hadoop, the open-source software framework, has emerged as an import technology for managing huge volumes of data.

Intel introduced its Hadoop distribution just one day after EMC unveiled its own distribution. Boyd Davis, vice president and general manager of Intel’s Datacenter Software division, took a moment recently to discuss big data and the Intel distribution of Apache Hadoop.

There are other providers in this space, including EMC, Cloudera and Hortonworks. What differentiates the Intel Hadoop distribution?

Hadoop is a framework for managing big data. It’s got three primary components. First, it’s a way of storing data on a large scale. Second, it’s a way to organize that data so that it can be accessible via a variety of different tools. And third, it is a set of tools that allows you to gain insights from the data.

Also, it’s open source, so there’s a community of programmers around the world contributing to it, as does Intel. So the Hadoop framework is not a single product or project. And because it’s so versatile we believe it has the potential to be a transformative technology.

But, while a lot of organizations like the idea of downloading and using open source Hadoop code — because it’s free — once they go into production with an application or service, they want somebody who can back them up. That’s where we come in.

Our Hadoop distribution takes advantage of the fact that we have the most intimate knowledge of the underlying hardware, like our Xeon processors. And we are getting substantial performance gains because of that.

Apache Hadoop Elephant

The Hadoop framework was created by Doug Cutting and Michal J. Cafarrela. Cutting, who worked at Yahoo at the time, allegedly named it after his son's toy elephant.

We have an example of the gains that can be made when you’re sorting, say, a terabyte of data. Using a standard benchmark, on the previous-generation Xeon platforms using hard-disk drives and 1 gigabit Ethernet connections, and just the standard Hadoop distribution, it would take about 4 hours to sort that data.

Now, we add in the newest-generation Xeon and we can cut that in half. Then you add solid-state drives , and that drops it down another 80 percent. You go from 1 Gig Ethernet to our new, faster 10 Gig Ethernet connection, and you drop it another 50 percent. And then if you use the Intel Hadoop Distribution, you drop it another 40 percent.

Suddenly, from greater than 4 hours, you’re down to just 7 minutes to run that workload. So yeah, there’s this huge link to Intel hardware which delivers the optimizations, and the software framework that takes advantage of the hardware, and all of a sudden you’re giving a lot of value to customers.

What’s the scale of the datasets that require Hadoop to manage them?

Let’s take a petabyte of data. That is 1 million GB, or a 1 followed by 15 zeroes. To put that in perspective, it would take an average person 13 years to watch a petabyte of HD video. Well, it takes the Internet just 11 seconds to generate a petabyte, all day long, each and every day. That scale of data creation is something that traditional database tools, like relational databases and traditional storage solutions, just can’t handle.

So that’s big data: data whose volume, variety of different formats, and velocity, in terms of how quickly it gets created, is greater than what traditional tools can manage.

What’s a real-world example of the challenge with managing big data?

In the cell phone business, one of the things providers want to do is manage call data records, so people can keep track of their cellphone minutes and who they called.

That was a challenge faced by a very large mobile operator in China we’ve been working with. They simply wanted to put the call data records into a storage environment and make the data available to their consumers online. But they had literally hundreds of millions of users. And when they put all that data into traditional databases, the databases simply broke from the scale and volume of the data. It was billions of records — the databases just couldn’t scale to handle it.

But with the Intel distribution Hadoop we are now making broadly available, we were able to get it to the point where that data is now accessible to customers in just a second or two. They type in the query, hit “enter,” and it pops up in what we call “human real time.”

But that’s just the beginning. Now that the mobile operator has all the call data in its Hadoop framework, they can now do things like ask, “Which of our smartphones are the most profitable in terms of data plans and usage?” Or, “What’s the most popular smartphone at a given time of the year, so we can direct manufacturing to reflect that?”

Speaking of value, how big is the market for big data solutions?

Well, the datacenter market worldwide is about $100 billion, growing at low single digits. That’s traditional data management hardware and software.

We see the big data sector of that market getting to about $30 billion, but with 20 percent growth rates just in the next 3 or 4 years. So it’s a market that is growing dramatically.

What’s the biggest misconception about big data?

Well, I think a lot of people get the idea that we are generating so much data as a society. And they get the idea that the vast majority of that data is wasted, that we basically throw it away.

What we do is to help people all over the world get more access to the information that’s in their data. I heard a speaker recently who was involved in a national political campaign that took advantage of big data. And he said, “I hate the term ‘big data.’ What we as an industry should be thinking of is not ‘big data,’ but ‘big answers.’ That’s the art of it.”

Hadoop: Getting Big Answers from Big Data

Intel’s Boyd Davis on how Hadoop can help enterprises gain better access to the information within their big data.

Boyd Davis vice president and general manager of Intel Datacenter Software division

"Our Hadoop distribution takes advantage of the fact that we have the most intimate knowledge of the underlying hardware," said Boyd Davis, vice president and general manager of Intel's Datacenter Software division.

The market for big data is growing fast — its forecast top $18 billion in 2013 and $47 billion by 2017, according to Wikibon. Lurking behind those big numbers is the reality that organizations often struggle to access and make use of the information within big data. Apache Hadoop, the open-source software framework, has emerged as an import technology for managing huge volumes of data.

Intel introduced its Hadoop distribution just one day after EMC unveiled its own distribution. Boyd Davis, vice president and general manager of Intel’s Datacenter Software division, took a moment recently to discuss big data and the Intel distribution of Apache Hadoop.

There are other providers in this space, including EMC, Cloudera and Hortonworks. What differentiates the Intel Hadoop distribution?

Hadoop is a framework for managing big data. It’s got three primary components. First, it’s a way of storing data on a large scale. Second, it’s a way to organize that data so that it can be accessible via a variety of different tools. And third, it is a set of tools that allows you to gain insights from the data.

Also, it’s open source, so there’s a community of programmers around the world contributing to it, as does Intel. So the Hadoop framework is not a single product or project. And because it’s so versatile we believe it has the potential to be a transformative technology.

But, while a lot of organizations like the idea of downloading and using open source Hadoop code — because it’s free — once they go into production with an application or service, they want somebody who can back them up. That’s where we come in.

Our Hadoop distribution takes advantage of the fact that we have the most intimate knowledge of the underlying hardware, like our Xeon processors. And we are getting substantial performance gains because of that.

Apache Hadoop Elephant

The Hadoop framework was created by Doug Cutting and Michal J. Cafarrela. Cutting, who worked at Yahoo at the time, allegedly named it after his son's toy elephant.

We have an example of the gains that can be made when you’re sorting, say, a terabyte of data. Using a standard benchmark, on the previous-generation Xeon platforms using hard-disk drives and 1 gigabit Ethernet connections, and just the standard Hadoop distribution, it would take about 4 hours to sort that data.

Now, we add in the newest-generation Xeon and we can cut that in half. Then you add solid-state drives , and that drops it down another 80 percent. You go from 1 Gig Ethernet to our new, faster 10 Gig Ethernet connection, and you drop it another 50 percent. And then if you use the Intel Hadoop Distribution, you drop it another 40 percent.

Suddenly, from greater than 4 hours, you’re down to just 7 minutes to run that workload. So yeah, there’s this huge link to Intel hardware which delivers the optimizations, and the software framework that takes advantage of the hardware, and all of a sudden you’re giving a lot of value to customers.

What’s the scale of the datasets that require Hadoop to manage them?

Let’s take a petabyte of data. That is 1 million GB, or a 1 followed by 15 zeroes. To put that in perspective, it would take an average person 13 years to watch a petabyte of HD video. Well, it takes the Internet just 11 seconds to generate a petabyte, all day long, each and every day. That scale of data creation is something that traditional database tools, like relational databases and traditional storage solutions, just can’t handle.

So that’s big data: data whose volume, variety of different formats, and velocity, in terms of how quickly it gets created, is greater than what traditional tools can manage.

What’s a real-world example of the challenge with managing big data?

In the cell phone business, one of the things providers want to do is manage call data records, so people can keep track of their cellphone minutes and who they called.

That was a challenge faced by a very large mobile operator in China we’ve been working with. They simply wanted to put the call data records into a storage environment and make the data available to their consumers online. But they had literally hundreds of millions of users. And when they put all that data into traditional databases, the databases simply broke from the scale and volume of the data. It was billions of records — the databases just couldn’t scale to handle it.

But with the Intel distribution Hadoop we are now making broadly available, we were able to get it to the point where that data is now accessible to customers in just a second or two. They type in the query, hit “enter,” and it pops up in what we call “human real time.”

But that’s just the beginning. Now that the mobile operator has all the call data in its Hadoop framework, they can now do things like ask, “Which of our smartphones are the most profitable in terms of data plans and usage?” Or, “What’s the most popular smartphone at a given time of the year, so we can direct manufacturing to reflect that?”

Speaking of value, how big is the market for big data solutions?

Well, the datacenter market worldwide is about $100 billion, growing at low single digits. That’s traditional data management hardware and software.

We see the big data sector of that market getting to about $30 billion, but with 20 percent growth rates just in the next 3 or 4 years. So it’s a market that is growing dramatically.

What’s the biggest misconception about big data?

Well, I think a lot of people get the idea that we are generating so much data as a society. And they get the idea that the vast majority of that data is wasted, that we basically throw it away.

What we do is to help people all over the world get more access to the information that’s in their data. I heard a speaker recently who was involved in a national political campaign that took advantage of big data. And he said, “I hate the term ‘big data.’ What we as an industry should be thinking of is not ‘big data,’ but ‘big answers.’ That’s the art of it.”