RealTime IT News

Google Launches Code Search

In the beginning, Google was all about indexing Web pages, then it grew to news, blogs and video.

Now Google is embarking on, perhaps, its most ambitious indexing venture yet: indexing countless billions of lines of code as part of the new Google Code Search.

Google Code Search on Google Labs gives users a place to search for publicly accessible source code. It looks like a regular Google Search page but instead of searching Web pages, it's going to search billions of lines of code.

"The two ways that source code lives on the Internet is in archives, things like Zip files, gzip, etc. And then in software-control repositories like SourceForge.net, Google's code hosting, and other places," Google product manager Tom Stocky told internetnews.com.

"We'll be crawling all of that."

Google isn't just going to index the Zip archive files. They're actually going to open up the files and index all the individual files within in.

In the case of software-control repositories like CVS and SVN, Google will go into the public access and index the individual files within them.

Google's regular Googlebot crawler is being used to find and identify the Zip files. In the case of software-control repositories, Stocky noted that it's a different kind of crawler that has to access the CVS or SVN server and speak in a different protocol to then get the information back.

The total task is staggering.

Stocky was unable to provide a figure, but he did note the Google Code Search has billions of lines of code.

"We're not getting more specific than that, but it is a significant number," Stocky noted.

Google Code Search will offer users a number of different ways to find the code they are looking for. Users can perform search queries based on software license, programming language and by file name.

"We also support regular expressions, so instead of searching for keywords you can search for patterns of words," Stocky explained. "For people that know how to use regular expressions well you can get really specific search and search over some really obscure stuff."

Google is also launching an API for Code Search as part of the launch. The API will utilize Google's GDATA API format.

At launch there will be no Google AdSense ads on the results pages and the Code Search results are not integrated into the main Google index.

One of the possible uses of Google Code Search is for developers to do searches for their own code and see where people are using it. It may also help to combat plagiarism and software license use infractions.

"If you own code and someone else is posting illegally, there is a process where we can remove it from the index," Stocky noted.

Most of the code indexed by Google Code Search is open source-licensed. Stocky noted that Google doesn't believe that much, if any, is proprietary since it's all posted in public places.

"In the case of CVS and SVN there is a password capability so we believe if someone didn't want it to be seen by the outside world they would either have a password or not post it to a public place," Stocky commented.

Google Code Search is available here or via the advanced search option from Google.com