The Apache Hadoop project offers the promise of being able to manage and process large volumes of data quickly. It’s a project that is used by big Web titans like Yahoo (NASDAQ: YHOO), Google (NASDAQ: GOOG) and Facebook, and it’s about to get a critical new tool to help it ingest even more data.
Cloudera, a commercial company developing a Hadoop-based distribution, is in the process of contributing a new database tool for Hadoop called SQOOP that enables users to directly import all types of database tables into Hadoop.
The new tool is something that Cloudera will be talking about, along with its Hadoop development efforts in general, as part of a Yahoo-sponsored Hadoop conference kicking off Wednesday in Santa Clara.
“The tool is called SQOOP, and we got the name from thinking about from SQL
Just because SQOOP has been contributed to Apache Hadoop doesn’t mean that it’s actually in Apache Hadoop just yet. Bisciglia noted that when you contribute code to Apache, it might take three to five months until it shows up in an official release. So what Cloudera is doing is making SQOOP immediately available in Cloudera’s distribution of Hadoop.
Cloudera offers its own packaged version of Hadoop, which is intended to make it easier for enterprises to get Hadoop up and running.
“SQOOP is a tool that enterprise customers were demanding,” Bisciglia said. “Enterprises have lots of data in existing databases, and if you can’t give them a way to interact with that data, Hadoop isn’t as useful as it could be.”
Which database?
Bisciglia said SQOOP will work with any database that has a JDBC (Java Database Connectivity) driver. The first thing that SQOOP does is to inspect the database table over JDBC. SQOOP understands the column names and field types, and then it generates all the code that Hadoop needs to work with the records.
“SCQOOP then pulls all the data out of the database over JDBC and stuffs it into the container that is generated after inspecting the table, and then it imports the data into HDFS,” Bisciglia said.
HDFS (Hadoop Distributed File System (HDFS) is the clustered file system at the core of Hadoop. Bisciglia said a user could automatically import the database data into Hive, which is Hadoop’s data warehouse that speaks SQL.
“This gives you the ability to take data directly out of an existing database and import it into Hadoop in a way where you can still issue SQL queries,” Bisciglia said.
While SQOOP works with many types of databases, Cloudera has also developed a specific optimization for MySQL. Bisciglia noted that the problem with importing data over JDBC is that it works with everything, but it’s not the fastest way to get data into Hadoop.
The MySQL support makes use of the MySQL ‘dump’ command, which exports database content that SQOOP can make use of directly. The plan is to provide support for other databases as well over time.
What’s next for Cloudera?
The Apache Hadoop project is at version 0.20, and Cloudera is now in the early stages of packaging that release for its own users. Bisciglia was quick to note that Cloudera’s distribution is a direct line between what Apache Hadoop offers and what the company’s Hadoop distribution provides.
“What we have in our distribution is all code that is either Hadoop core or on its way to Hadoop core,” Bisciglia said. “We will never diverge from the core Apache release.”
The reason Cloudera will remain with the Apache Hadoop core is directly related to why Hadoop is successful in the first place.
“One of the things that makes Hadoop work so well is the fact that many organization are contributing,” Bisciglia said. “Yahoo is the largest contributor to Hadoop and they run it through some extensive workloads. Many customers don’t run networks as big as Yahoo, but they benefit from stress testing that Yahoo provides. Lots of people are using Hadoop, and we deliver a lot of value to our customers as a result of that.”