Yahoo Frees ‘Most Stable and Tested’ Hadoop

Shelton Shugar, Yahoo's senior vice president of cloud computing
Shelton Shugar, Yahoo’s senior vice president of cloud computing
Photo: David Needle

SANTA CLARA, Calif. — Yahoo had some good news for the 700-plus developers at its second annual Hadoop Summit here today: It’s making available to developers the same distribution of Hadoop that it uses internally to help power its Yahoo search, home page and other properties.

The download, available here, is based on the latest 0.20 version of the Apache Hadoop code.

The Hadoop open source distributed file system has been widely adopted by leading Web companies, including Yahoo (NASDAQ: YHOO), Google (NASDAQ: GOOG), Facebook and Amazon (NASDAQ: AMZN) because it is particularly well suited to handle those firms’ need to process massive amounts of data, typically measured in terabytes and petabytes.

Yahoo, for example, said Hadoop is key to the system that displays relevant ads to consumer’s visiting its sites based on their interests.

“I think Hadoop has become the de facto platform for scalable data processing,” said Shelton Shugar, Yahoo’s senior vice president of cloud computing. “It’s proven ready for business.”

It’s certainly had a good test at Yahoo, which claims over 500 million users worldwide of its various properties.

“The Yahoo distribution of Hadoop is running on the largest compute clusters in the universe and now we’re putting it out there on the Web,” said Eric Bladeschwieler, Yahoo’s vice president of Hadoop software development. “Yahoo has much more experience than any other organization, a lot more users running Hadoop, and this is the most stable and tested version.”

All the code in Yahoo’s implementation is contributed to the Apache Software Foundation, but this public release includes notes and other information about patches that Bladeschwieler said will help developers.

Shugar said Hadoop saves Yahoo money in being able to use commodity servers more efficiently, but that’s a secondary benefit.

“Hadoop takes very large problems and cuts them into small bits and lets you run all those small bits in parallel and get a solution,” he said. The result is that it can process large tasks that could take weeks and months using other tools, completing them instead within days, if not hours.

“It’s inevitable that enterprises will adopt Hadoop because of the ever-increasing amounts of data — structured and unstructured — they have to deal with, because Hadoop lets you extract the value more quickly.”

Executives from Amazon, IBM (NYSE: IBM), Sun (NASDAQ: JAVA) and Cloudera followed Yahoo’s keynote to discuss their experience and the advantages of using Hadoop. IBM’s vice president of emerging technologies, Rod Smith, said he sees great potential for companies to use Hadooop to, among other things, develop their own business analytics that would otherwise require specialized IT resources.

“It’s a wonderfully simple programming model because you don’t have to think about what’s under the covers. That’s critical to adoption,” Smith told

Shugar said Hadoop is the quintessential platform for cloud computing because it abstracts scale. “As a developer, you don’t have to code for how many servers you’re going to address or pre-provision,” he said.

Benefits of a Hadoop ecosystem

While Yahoo and others have reaped benefits from Hadoop, Shugar said the decision to make big investments in its further development that will aid other companies was a no-brainer, even if it helped competitors.

“Others help us by fixing bugs and enhancing Hadoop — that’s valuable to us,” he said.

In addition, Shugar said Yahoo’s Hadoop support is already proving to be a valuable recruiting and retention tool. “Developers like working with open source and there are already people doing graduate work using Hadoop, so it’s a plus to be able to attract new talent experienced in what is a key infrastructure function for us.”

News Around the Web