MSNbot Crawling For MSN Search

Microsoft has unveiled a $100 million dollar investment in search that among other things is delivering a significant upgrade to its Search portal.

“With this significant upgrade to MSN Search, we are delighted to now offer what we believe is the best search service available for our 350 million MSN customers. Among the many improvements, we’re particularly excited to increase the relevancy of many search query results by nearly 50 percent,” Yusuf Mehdi, Microsoft corporate vice president of MSN, said in a statement.

“This massive investment kicks off a wave of innovation from MSN that will move search beyond its current, limited offering to delivering the next generation search experience.”

The most obvious difference to users is the MSN user interface, which now more clearly separates paid or sponsored listings from the algorithmic search. MSN’s own unique search technology, including its MSNbot Web spider, is still not powering the main MSN Search index, though MSN announced in a technology preview that it does utilize the new crawler’s results. According to Microsoft, the new search algorithmic search engine technology is expected to be launched within the next year and currently contains over one billion documents.

MSNbot is Microsoft’s next generation Web search crawler that was announced last year. It now appears to be aggressively crawling the Web as part of the algorithmic search effort to be included at a future point in the newly revised MSN Search portal.

For the last two months, Webmasters in various online forums have been posting comments about increased MSNbot activity on their sites.

“They’ve been planning for several months now to update the presentation of their search results in July, and they’ve also been saying for a while that their own organic algorithm would be ready towards the end of this year,” said Jupiter Research analyst Nate Elliot.

According to some postings by Webmasters, the increased frequency and amount of bandwidth that the MSNbot crawler was utilizing became an issue. Apparently, in response to that concern, Microsoft programmed its crawler to respect a new crawler delay tag that would help stretch out the time between visits. The crawler delay would be included in the robots exclusion file (robots.txt) that is specified by a site that informs crawlers what not to index.

Although some in the Web development community criticized Microsoft for introducing a new definition for robots.txt, others, including a noted open source contributor, were more positive about the move. Doug Cutting, lead developer of the open source search engine project Nutch, told that he gets the sense Microsoft is playing by the rules.

“The robots exclusion protocol (robots.txt) is the accepted ruleset,” Cutting explained. “Unfortunately, it is vague in some areas, in particular, in how long a crawler should pause between accesses to a host.”

“Microsoft has unilaterally added a crawl delay parameter to robots.txt. This is mostly a good thing, since it clarifies this murky area. It would best if they worked to make this a part of the standard robots.txt specification, but that is difficult, since the specification is not actively maintained.”

MSNbot is not the first crawler that Webmasters complained about visiting their site too frequently. According to Cutting, there was a time when people complained about Google’s search frequency as well.

In the past, some folks have complained that Google accesses their sites too frequently, but the complaints have been fewer and more muted, since most folks see the direct benefits of having Google crawl their site, namely, more folks finding and visiting their site, Cutting said. “Microsoft has not yet launched their search service, so they’re not yet giving anything back to sites they crawl, so folks have more incentive to complain.”

Get the Free Newsletter!
Subscribe to Daily Tech Insider for top news, trends & analysis
This email address is invalid.
Get the Free Newsletter!
Subscribe to Daily Tech Insider for top news, trends & analysis
This email address is invalid.

News Around the Web