RealTime IT News

Hungry Bot Behind Microsoft's New Search

Noticed a spike in your Web traffic lately? You're not the only one.

Webmasters across the Internet have been tracking a surge in spider activity coming from Microsoft's MSNbot over the past few months and sleuthing over what kind of new search offerings the software giant is cooking up.

MSNbot, the next-generation Web search crawler that Microsoft announced last year, appears to be aggressively crawling the Web as part of the algorithmic search effort to be included at a future point in its newly revised MSN Search portal.

In a sample of more than a dozen Website traffic logs made available to internetnews.com, we saw a substantial increase in activity by its MSNbot crawler in May. In most cases, the number of visits more than doubled. The activity increased in June though by a smaller percentage. On at least one site in May, MSNbot represented more hits than either spiders or programs for search used by Yahoo! Slurp or Googlebot.

Microsoft ended some of the speculation over how deeply the bots were indexing sites with the unveiling of "significant" upgrades to the MSN Search service Wednesday.

The confirmation came as Webmasters in various online forums have been posting comments about increased MSNbot activity on their sites, and tracking the increased activity from the nascent Microsoft search crawler through their site logs.

On Wednesday, Microsoft unveiled its enhanced search portal, as well as a $100 million investment in search. So far, the announcement ranks as its "most significant upgrade to the MSN Search portal in its history."

The most obvious difference to users is the user interface, which now more clearly separates paid or sponsored listings from the algorithmic search. MSN's own unique search technology, fed by the MSNbot Web spider, is still not powering the main MSN Search index. But MSN's technology preview does utilize the new crawler's results. According to a statement issued by Microsoft, the new search algorithmic technology is expected to be launched within the next year and currently contains over one billion documents.

Like its search rivals Google and Yahoo, Microsoft's own indexing approach and formulas are a closely guarded secret, given the increasing stakes in the white-hot search sector.

According to some postings by Webmasters, the increased frequency and amount of bandwidth that the MSNbot crawler was utilizing became an issue. Apparently, in response to that concern, Microsoft programmed its crawler to respect a new crawler delay tag that would help stretch out the time between visits. The crawler delay would be included in the robots exclusion file (robots.txt), which is specified by a site that informs crawlers what not to index.

Although some in the Web development community criticized Microsoft for introducing a new definition for robots.txt, others, including a noted open source contributor, were more positive about the move. Doug Cutting, lead developer of the open source search engine project Nutch, told internetnews.com that he gets the sense Microsoft is playing by the rules.

"The robots exclusion protocol (robots.txt) is the accepted ruleset," Cutting explained. "Unfortunately, it is vague in some areas, in particular, in how long a crawler should pause between accesses to a host. Microsoft has unilaterally added a crawl delay parameter to robots.txt. This is mostly a good thing, since it clarifies this murky area. It would be best if they worked to make this a part of the standard robots.txt specification, but that is difficult, since the specification is not actively maintained."

MSNbot is not the first crawler that Webmasters complained about visiting their site too frequently. According to Cutting, there was a time when people complained about Google's search frequency as well.

In the past, some folks have complained that Google accesses their sites too frequently; but the complaints have been fewer and more muted, since most folks see the direct benefits of having Google crawl their site, namely, more folks finding and visiting their site, Cutting said. "Microsoft has not yet launched their search service, so they're not yet giving anything back to sites they crawl, so folks have more incentive to complain."

Still, for now at least, Cutting called Google's crawler the state-of-the-art. "Most sites want Google to crawl them, and hence cater their sites to what Google's crawler does."