Michael Baum, CEO, Splunk

Michael Baum
Thanks to the ongoing war between Google , Yahoo  and Microsoft , search is becoming a technology juggernaut few can ignore.


With investors and analysts wrapped up in that three-horse race, the value
of finding and cataloging information at the corporate level sometimes goes
unnoticed.


But what about navigating the terabytes of IT data logged by the
infrastructure in your data center? This can be important in an era of
stringent compliance laws that call for IT data to be saved and quickly
recalled.


Enter Splunk, a company determined to move from the fringe to the epicenter
of the search movement.


The startup, whose name is a twist on the term “spelunking,” makes software that helps IT administrators sift and search through logs and IT data from any application, server or network device on the fly.


The company just provided a 110 percent boost in indexing and search speed
and added a distributed search feature to run multiple Splunk servers in a
datacenter and help them locate each other, Napster-style.


The idea is to save admins precious time, allowing them to focus on more
important tasks to drive the business.
Internetnews.com recently caught up with Splunk CEO Michael Baum to
discuss Splunk and what it hopes to achieve.


How did Splunk come about and what problems are you trying to solve?


The company was started three years ago by three of us that grew up building
search engines for the Internet at places like InfoSeek (Baum and Co-founder
Erik Swan, Yahoo (Baum), and Taligent (Co-founder Rob Das and Baum).

After
building those, we had the joyful task of managing all of the infrastructure
to keep them running. My last job at Yahoo was running the e-commerce
business over three years ago.

I had about 12,000 servers running Yahoo
Finance, Yahoo Stores, Yahoo PayDirect, and it was a nightmare to keep all of
this stuff going.

When I bought technology at Yahoo, I was terribly
frustrated with vendors coming in and saying: “Pay me $500,000 and we’ll
install it and see if it works for you.”

I was also frustrated with
downloading open source tools and having to support them on my own.


When you have these large, distributed computing environments where you get
above 100 servers, you run into a lot of problems on a regular basis.

You do
something like that for 10 years, bang your head against a wall and think
there must be a better way. One of the things we noticed is the people
working to build and manage these kinds of environments were spending a lot
of time looking through logging data.

The reason they do that is they’re
trying to figure out what the heck is going on.


So we got to thinking about building a search engine for all of this IT data
that was being logged by our network devices, our servers, our applications,
as a way of letting people search through it but also navigate and follow the
different events through the infrastructure to get a very quick read on what
was happening.


Q: So your software goes into computer networks and it finds what network
resources are available?


No. We actually go into someone’s datacenter and we find and we index all
of the log files, networking traffic, messages coming across SOAP queues and
Web service transactions.

We take basically any event data that a machine
generates. It could be a database creating a log of all the slow queries
hitting it. It could be the audit table in the database.

Every time a write
transaction is written to the database, it’s logged in an audit table because
you need to be able to do rollback.


It can be a firewall looking at people trying to log into the VPN [virtual
private network]. All of the data ends up being logged by the machinery. The
problem is there is a huge amount of it.

We’re talking about, in some of the
datacenters we’re working in, terabytes a day generated of this kind of
data.

It’s really hard for people to work with it because it’s in lots of
different formats; it’s all scattered all over the place in the datacenter.
It’s kind of like the Web.

If you can imagine trying to use the Internet
without a search engine, it would be pretty difficult.


What devices do you search?


The three primary categories are application, server and
network management. So, we can tell you what one server is saying to another
and what an application is saying to the network because all of that is
being logged.


IBM makes enterprise search. How would you differentiate between their
applications and your software?


They do content-based search. It’s completely orthogonal to what we do with
machine-generated data.

If you take Web pages on the Internet, documents,
content sitting in the database, it’s very structured information to go out
and try to index and search.

When you’re talking about the data that’s
generated by an Oracle database, or an Apache Web server, or a JBoss
application server, they’re highly, highly unstructured.

In fact, the
structure changes fairly frequently, because in every information people tune
the software or the device to log differently.


Whose products would you compare this to on the market?


We think it’s a whole new space.

Our competition is people doing this the
old-fashioned way — by hand. They use tools under Unix and Linux, like grep, which is a text-based
search tool that can look in a single file at a time.


The tools they have at their disposal are pretty primitive and were really
born for a different generation of computing, not really the Web generation.

A lot of these guys have written libraries of homegrown scripts like we used
to do at Yahoo. It’s very tedious and brittle. As soon as you change your IT
infrastructure you have to go modify your scripts and you don’t get a whole
lot out of it.

It works for one problem, but not for the next.

We index the data by keywords, by time. We classify the events into common
and dissimilar buckets to come up with basically, “these events look alike,
these events all look different over here,” which is a dynamic
classification of everything that’s occurring in your machinery.

So there
are lots of different ways that we index that data that are very particular
to this kind of machine-generated data.


How do you sell your software?


We sell our software much the same way the cellular phone companies sell
mobile phone service.

We charge strictly based on the amount of data that
you’re indexing. So you come to us and you buy a data plan for a year. If in
a month you need more, you can come back and upgrade your plan and we start
another year contract with you.


You can download our server in five minutes, install it on a cheap Linux or
Unix box [no Windows yet] and be off and running pretty quickly.

The user
interface borrows a lot from our experience in the Web world; it’s an
interactive, AJAX  user interface in the Web browser. It’s
really like a next-generation search engine interface.


I’m sure that my friends at Yahoo and people at Google and other places will
eventually copy a bunch of stuff we’ve done for the Web space.

But we really
believe that if you want to market into large enterprises that you’ve got to
go about it in a different kind of way. The old way of going top down into
these companies just doesn’t work.

News Around the Web