RealTime IT News

Everybody's a Sundance Critic

As if stopping spam weren't challenging enough, imagine trying to use e-mail filtering software to pick the winners and losers at this year's Sundance Film Festival.

That's what Unspam Technologies, an anti-spam consulting firm based in Park City, Utah, home of Sundance, tried to do.

Unspam used a modified version of POPFile, an open-source program that employs a Bayesian classification system to identify junk mail, to separate the cinematic wheat from the chaff among contenders at the festival, which runs through Sunday, Jan. 29.

Bayesian filtering, a popular tool for stopping spam in the last few years, uses statistical methods based on the work work of 17th century mathematician, Reverend Thomas Bayes. Many e-mail programs, such as Mozilla Thunderbird, and anti-spam programs, such as SpamAssasin, use Bayesian spam filtering techniques

A Bayesian spam filter picks out spam by identifying words and phrases that commonly appear in unwanted messages. It draws on a large number of variables so that, for example, it can tell the difference between a message from a friend that contains the word "Viagra" and a Viagra ad from a spammer.

The advantage of Bayesian systems is that they become more accurate over time as they are "trained," in effect, by being exposed to an increasing number of e-mail messages that are identified as either spam or not by the recipient.

With that in mind, Matthew Prince, CEO of Unspam, had his employees analyze film guides containing descriptions of Sundance films over the last 10 years. The group looked at some 200 variables in the descriptions, including the time of day that the film premiered, the location of the theater and the author of the description.

"We took all these data points and fed them into a database," Prince said. "Then we looked at how successful the films were, based on a variety of indicators, including the number of awards it received, how much money it made at the box office and how many people had voted for a particular film on the Internet Movie Database."

The idea was to see which words and other characteristics showed up in the descriptions of films that later proved successful. That information was then used to predict this year's winners and losers.

Some of the variables, like the number of adjectives found in a particular review, turned out to be meaningless. But others, like the name of the person who wrote the description, turned out to be significant. In some cases, half of the films whose descriptions were written by certain authors turned out to be successful.

Despite the fact that Sundance has made a huge push to promote digital film-making, it turns out that digital films flopped 72 percent of the time while only 19 percent were hits, Prince said.

Ken Dunham, director of malicious code at iDefense, Inc., said the idea of picking films using a Bayesian system was intriguing, but the concept required further study.

"It sounds like a fun project," he said. "But even if a bunch of their predictions come true, it still doesn't prove that the Bayesian process worked. They'll need to reproduce the results and measure against a [control] group. It's difficult to do that with one test."