Gartner may have predicted that the government will have the largest private clouds, but some private sector clouds are getting up there. These databases contain hundreds of terabytes (TB) or even petabytes (PB) of data about visitor behavior on the Web’s largest sites.
eBay (NASDAQ: EBAY) has two databases totaling 8.5 PB, one based on Greenplum software at 6.5 PB and one based on Teradata holding 2 PB.
The data can be messy. Fox Interactive has 200 TB database based on Greenplum that has a trillion rows and no primary key “and it’s doing fine,” said Brian Dolan, Fox Interactive director of research analytics, in a video produced by database analytics vendor Greenplum.
Enterprises are building these databases because they can, and the technology is only getting better. Greenplum today is announcing its Enterprise Data Cloud initiative comprising a cloud-friendly platform, proactive methodology and an ecosystem to ensure that enterprises can utilize existing investments as they grow their private clouds. It is supported by an upgrade of the core product to Greenplum Database version 3.3. Customers such as eBay and Fox Interactive have been using the technology for a year, the company said.
“You get accustomed to this kind of performance,” said Dolan. “You’re getting annoyed that it’s taking 3 minutes to do a sum on 100 million rows and then you realize that 3 minutes would have been a dream a year ago.”
With these capabilities, companies are storing and using data that they never would have bothered to capture before. “We’re storing, here at Fox, literally every ad impression served on MySpace.com,” Dolan said.
eBay’s 6.5 PB Greenplum database stores every click on every page of the site.
It’s made possible in part by the steadily declining price of commodity hardware. “A 200TB database is unheard of, at least at the price we’ve implemented here,” said Mark Dunlap, co-founder of Evergreen Technologies and consultant to Fox Interactive Media in the video.
It’s also made possible by the achievement, Greenplum claims, of lossless parallelism, where adding an extra sever delivers a linear improvement in performance instead of diminishing returns.
“Our database uses parallelism at its core. The database is able to make full use of all machines provisioned, whether it’s 10, 50, or more,” Ben Werther, Greenplum director of product management, told InternetNews.com
A key technology is MAP reduce, developed and published by Google (NASDAQ: GOOG) in 2004.
It’s taken time to get it to work, said Werther. “This is a multi-year exercise that only a handful of people in the world can perform. We have a great team and we work with hardware vendors like Sun, HP, IBM, and Dell.”
One result of the new technology is that data doesn’t need to be analyzed before it’s stored. Companies can just store everything. “We don’t worry about optimizing the data that’s in there,” said Fox Interactive’s Dolan. “We create tables that are horribly inefficient.”
Members of the ecosystem are drooling. Router vendors such as Arista Technologies are eager to sell large enterprises the networking hardware they’ll need as they move more data than they’ve ever moved before.
“Processing terabytes and petabytes of data requires immense I/O and network bandwidth, and Arista’s 10GigE switches provide the scalability and performance for even the largest systems,” said Mansour Karam, Arista’s director of business development, in a statement.
“We need great partners,” said Greenplum’s Werther. “We believe this is not just a Greenplum initiative.”
Pricing was not disclosed. According to Greenplum, the software runs on SUSE Linux Enterprise Server 10.2 (64-bit), Red Hat Enterprise Linux 5.x (64-bit), CentOS Linux 5.x (64-bit) and Sun Solaris 10U5+ (64-bit). Greenplum Database 3.3 is supported on server hardware from a range of vendors, including HP, Dell, Sun and IBM.
In non-production environments, it can run on Mac OSX 10.5, Red Hat Enterprise Linux 5.2 or higher (32-bit) and CentOS Linux 5.2 or higher (32-bit), the company added.
Planned improvements include a dashboard that will be part of the next release, due in the fourth quarter. Further down the road map are connections to public clouds and other features.