Jim Gray is a distinguished engineer in Microsoft’s Scaleable Servers
Research Group and manager of its Bay Area Research Center. Gray is one of
six Microsoft researchers based in San Francisco, three of whom work on
managing personal media; the others, including Gray, concentrate on working
with large data sets. He focuses on building supercomputers with commodity
components, including fast networks, huge Web servers and inexpensive, very
high-performance storage servers.
Gray is working with the astronomy community to create SkyServer, an
effort to make the entire world’s astronomy data accessible as a single
distributed database on the Internet. Another project, TerraServer-USA,
provides free public access to a vast data store of maps and aerial
photographs of the United States. TerraServer also is available as a Web
service.
Gray recently sat down with internetnews.com to discuss what’s on the company’s cutting edge.
Q: Your work seems to be moving toward providing data as a public
utility.
Exactly. Our premise is that a lot of the excitement about grid comp is
misplaced. Many people talk about outsourcing computation. Frankly, it’s not
very interesting to outsource computation. But people have questions they
want to outsource. Most grid computing we’ll see in the future will be a
data grid. You’ll go places for the data and not the computes.
TerraServer is an example of a Web services project, and one of our most
successful subscriptions to it is MapPoint, a geo-location service. If you
give MapPoint an address, it will tell you what the longitude and latitude
are, and it gives you points nearby. There’s a bit of enthusiasm for it,
especially for cell phones. The cell phone knows where it is, so now it can
ask what’s the closest gas station, and how do I get there? It’s already
part of MSN.
Q: You’ve said that federating the astronomy archives presents
interesting challenges for computer scientists. What are some of those
challenges?
There are technical challenges and non-technical challenges. As with all
things, the non-technical challenges, the people, are the biggest problems.
For example, getting all the astronomers to agree there are stars and
galaxies is easy. But getting them to agree to what exactly a galaxy is and
what its properties are and how to measure them? Now, we’re getting close to
deep beliefs about the way astronomy should be done. The biggest challenge
we face in all human endeavors is getting people to agree. Once people
agree, it’s just engineering.
Q: What are the goals of TerraServer and SkyServer? Are they to provide
public access to the kinds of scientific data that used to be hidden within
the scientific community?
No. It was a crass attempt to show off the scalability of Microsoft
software. When I came to Microsoft, it was even truer than it is today that
we were a desktop operating system and productivity tools company. We had
some server software already in place — Exchange and SQLServer were
available — but they did not get very much respect. We’ve been working for
close to a decade to change that. One way is to dog-food your own stuff, to
build large servers and experiment with them, and see what works and what
doesn’t work.
It’s interesting to see astronomers working with PostSQL, MysSQL, Oracle
DB2 and SQLServer. You get a pretty honest comparison. SQLServer’s strength
is that it’s a fairly complete implementation and very easy to install and
manage. We’re learning a lot about what we could do better.
Q: You mentioned that the software to process SkySurvey data, to
present the catalogs to the public and to federate it with other archives,
often comprises more than 25 percent of the project’s budget. Is that an
opportunity for Microsoft?
It is more than 25 percent, and that’s a lot of money, especially for the
astronomers. The software is more expensive than the telescope, which came
as a shock to me. But it’s custom software, peculiar to what astronomers are
doing with their data. There is some generic software in there, the
operating system and some programming languages — certainly, we’re in that
business — but that is a tiny part of the whole software bill.
Q: Will Web services make that custom code more reusable?
My enthusiasm about this work, why I think it’s good for Microsoft to be
funding the astronomy community, is that it’s a fairly vendor-neutral, open
collaboration way of experimenting with some new ideas like Web services.
Many people go off and implement Web services, but they can’t tell you much
about what they did, because it’s a competitive advantage. Here, people talk
about how they implement these ideas, talk about how the implementation
works and show people the code, so others can copy it.
People in private industry can look at what the astronomers have done and
apply it to their widgets. It’s not so much that we’re hoping to make a buck
on the astronomers, but that we’re hoping to learn what works and doesn’t
about the technology. SkyServer is like a sandbox we’re playing in. We’re
building prototypes of what we think will be typical of other enterprises in
the future.
Q: SkyServer seemed pretty fast to me — and the user interface was
fairly friendly. How are you addressing the presentation layer issues that
still bedevil grid computing? Applications remain bandwidth hogs, even when
people are pooling CPUs in a grid format.
It’s a challenge. The TerraServer is fundamentally limited by how much
we’re willing to spend on the phone bill. We design given the phone bill. We
don’t do fly-throughs or show movies. We insist people click for every
screen they get, and that limits how much bandwidth you can soak up. It’s a
significant part of the cost of running the site.
Q: What got you interested in working with large bodies of data?
I’m puzzled to this day by what existence is about. How do we know
anything? When I went to college, I started as a philosophy major.
Philosophers were trying to understand reasoning using predicate calculus.
But thought is very complicated. Computers at their base operate in a very
simple way, but it was clear even in 1960 that we were not going to be able
to explain human thought with predicate calculus. I’m not sure we’re any
closer today to understanding how thought works, but it’s a noble goal to do
that and a fascinating one.