In the beginning, Google
was all about indexing Web
pages, then it grew to news, blogs and video.
Now Google is embarking on, perhaps, its most ambitious indexing venture yet: indexing countless billions of lines of code as part of the new Google Code Search.
Google Code Search on Google Labs gives users a place to search for publicly
accessible source code. It looks like a regular Google Search page but
instead of searching Web pages, it’s going to search billions of lines of
“The two ways that source code lives on the Internet is in archives, things
like Zip files, gzip, etc. And then in software-control repositories like
SourceForge.net, Google’s code hosting, and other places,” Google product manager Tom Stocky told
“We’ll be crawling all of that.”
Google isn’t just going to index the Zip archive files. They’re actually
going to open up the files and index all the individual files within in.
In the case of software-control repositories like CVS and SVN, Google will go
into the public access and index the individual files within them.
Google’s regular Googlebot crawler is being used to find and identify the
Zip files. In the case of software-control repositories, Stocky noted that
it’s a different kind of crawler that has to access the CVS or SVN server
and speak in a different protocol to then get the information back.
The total task is staggering.
Stocky was unable to provide a figure, but he did note the Google
Code Search has billions of lines of code.
“We’re not getting more specific than that, but it is a significant number,”
Google Code Search will offer users a number of different ways to find the code
they are looking for. Users can perform search queries based on software
license, programming language and by file name.
“We also support regular expressions, so instead of searching for keywords
you can search for patterns of words,” Stocky explained. “For people that
know how to use regular expressions well you can get really specific search
and search over some really obscure stuff.”
Google is also launching an API for Code Search as part of the launch. The
API will utilize Google’s GDATA API format.
At launch there will be no
Google AdSense ads on the results pages and the Code Search results are not
integrated into the main Google index.
One of the possible uses of Google Code Search is for developers to do
searches for their own code and see where people are using it. It may also
help to combat plagiarism and software license use infractions.
“If you own code and someone else is posting illegally, there is a process
where we can remove it from the index,” Stocky noted.
Most of the code indexed by Google Code Search is open source-licensed.
Stocky noted that Google doesn’t believe that much, if any, is
proprietary since it’s all posted in public places.
“In the case of CVS and SVN there is a password capability so we believe if
someone didn’t want it to be seen by the outside world they would either
have a password or not post it to a public place,” Stocky commented.
Google Code Search is available here or via the advanced search option from Google.com