RealTime IT News

Michael Baum, CEO, Splunk

Michael Baum Thanks to the ongoing war between Google , Yahoo  and Microsoft , search is becoming a technology juggernaut few can ignore.

With investors and analysts wrapped up in that three-horse race, the value of finding and cataloging information at the corporate level sometimes goes unnoticed.

But what about navigating the terabytes of IT data logged by the infrastructure in your data center? This can be important in an era of stringent compliance laws that call for IT data to be saved and quickly recalled.

Enter Splunk, a company determined to move from the fringe to the epicenter of the search movement.

The startup, whose name is a twist on the term "spelunking," makes software that helps IT administrators sift and search through logs and IT data from any application, server or network device on the fly.

The company just provided a 110 percent boost in indexing and search speed and added a distributed search feature to run multiple Splunk servers in a datacenter and help them locate each other, Napster-style.

The idea is to save admins precious time, allowing them to focus on more important tasks to drive the business. Internetnews.com recently caught up with Splunk CEO Michael Baum to discuss Splunk and what it hopes to achieve.

How did Splunk come about and what problems are you trying to solve?

The company was started three years ago by three of us that grew up building search engines for the Internet at places like InfoSeek (Baum and Co-founder Erik Swan, Yahoo (Baum), and Taligent (Co-founder Rob Das and Baum).

After building those, we had the joyful task of managing all of the infrastructure to keep them running. My last job at Yahoo was running the e-commerce business over three years ago.

I had about 12,000 servers running Yahoo Finance, Yahoo Stores, Yahoo PayDirect, and it was a nightmare to keep all of this stuff going.

When I bought technology at Yahoo, I was terribly frustrated with vendors coming in and saying: "Pay me $500,000 and we'll install it and see if it works for you."

I was also frustrated with downloading open source tools and having to support them on my own.

When you have these large, distributed computing environments where you get above 100 servers, you run into a lot of problems on a regular basis.

You do something like that for 10 years, bang your head against a wall and think there must be a better way. One of the things we noticed is the people working to build and manage these kinds of environments were spending a lot of time looking through logging data.

The reason they do that is they're trying to figure out what the heck is going on.

So we got to thinking about building a search engine for all of this IT data that was being logged by our network devices, our servers, our applications, as a way of letting people search through it but also navigate and follow the different events through the infrastructure to get a very quick read on what was happening.

Q: So your software goes into computer networks and it finds what network resources are available?

No. We actually go into someone's datacenter and we find and we index all of the log files, networking traffic, messages coming across SOAP queues and Web service transactions.

We take basically any event data that a machine generates. It could be a database creating a log of all the slow queries hitting it. It could be the audit table in the database.

Every time a write transaction is written to the database, it's logged in an audit table because you need to be able to do rollback.

It can be a firewall looking at people trying to log into the VPN [virtual private network]. All of the data ends up being logged by the machinery. The problem is there is a huge amount of it.

We're talking about, in some of the datacenters we're working in, terabytes a day generated of this kind of data.

It's really hard for people to work with it because it's in lots of different formats; it's all scattered all over the place in the datacenter. It's kind of like the Web.

If you can imagine trying to use the Internet without a search engine, it would be pretty difficult.

What devices do you search?

The three primary categories are application, server and network management. So, we can tell you what one server is saying to another and what an application is saying to the network because all of that is being logged.

IBM makes enterprise search. How would you differentiate between their applications and your software?

They do content-based search. It's completely orthogonal to what we do with machine-generated data.

If you take Web pages on the Internet, documents, content sitting in the database, it's very structured information to go out and try to index and search.

When you're talking about the data that's generated by an Oracle database, or an Apache Web server, or a JBoss application server, they're highly, highly unstructured.

In fact, the structure changes fairly frequently, because in every information people tune the software or the device to log differently.

Whose products would you compare this to on the market?

We think it's a whole new space.

Our competition is people doing this the old-fashioned way -- by hand. They use tools under Unix and Linux, like grep, which is a text-based search tool that can look in a single file at a time.

The tools they have at their disposal are pretty primitive and were really born for a different generation of computing, not really the Web generation.

A lot of these guys have written libraries of homegrown scripts like we used to do at Yahoo. It's very tedious and brittle. As soon as you change your IT infrastructure you have to go modify your scripts and you don't get a whole lot out of it.

It works for one problem, but not for the next.

We index the data by keywords, by time. We classify the events into common and dissimilar buckets to come up with basically, "these events look alike, these events all look different over here," which is a dynamic classification of everything that's occurring in your machinery.

So there are lots of different ways that we index that data that are very particular to this kind of machine-generated data.

How do you sell your software?

We sell our software much the same way the cellular phone companies sell mobile phone service.

We charge strictly based on the amount of data that you're indexing. So you come to us and you buy a data plan for a year. If in a month you need more, you can come back and upgrade your plan and we start another year contract with you.

You can download our server in five minutes, install it on a cheap Linux or Unix box [no Windows yet] and be off and running pretty quickly.

The user interface borrows a lot from our experience in the Web world; it's an interactive, AJAX  user interface in the Web browser. It's really like a next-generation search engine interface.

I'm sure that my friends at Yahoo and people at Google and other places will eventually copy a bunch of stuff we've done for the Web space.

But we really believe that if you want to market into large enterprises that you've got to go about it in a different kind of way. The old way of going top down into these companies just doesn't work.