RealTime IT News

IBM Hashes Out Privacy Preserving Scheme

It's one of those dicey, give-and-take situations. E-commerce merchants rely on customer data to customize services, while customers are reluctant to offer the honest goods. But have no fear: IBM scientists are tucked away in labs, working on new techniques to eliminate the problem of customers who offer disingenuous personal data to protect their privacy from malevolent hackers.

Ideally, if it comes to fruition at the IBM's Privacy Research Institute, e-commerce merchants would be able to cull accurate data, without compromising customer privacy. As it stands now, many customers lie to the questions asked by Web sites -- data such as name, gender, age, etc.

What Rakesh Agrawal and Ramakrishnan Srikant, researchers at IBM's Almaden Research Center in California are developing, is a method to use data mining to circumvent the dilemma. They call it "Privacy-Preserving Data Mining," and it hinges on the idea that a customer's personal data can be scrambled before it is relayed across a Web server or database.

With this, a retailer could generate accurate data models -- so valuable in helping e-businesses customize services according to demographics or tastes -- without ever seeing personal information.

IBM Thursday offered an idea of how it works. Suppose a Web user wants to enter a piece of personal data, such as age. Take the age 35. Upon entry, that number is scrambled or 'randomized' by IBM software. The software then takes the original number that was input and adds or subtracts to it a random value. Simply, that 35-year-old's age may be randomized to 24, while a 40-year-old's entry may be randomized to 28.

Now, what remains constant is the allowed range of the randomization, which is linked to the desired level of privacy. Large randomization increases the uncertainty and the privacy of the users, and naturally causes loss in the accuracy of the results that are produced by a data mining algorithm that uses the randomized data as input.

Agrawal maintains that this is a trade off. Experiments indicate only a 5-10 percent loss in accuracy even for 100 percent randomization after the data mining algorithm has applied corrections to the randomized distributions.

"Our research institutionalizes the notion of fibbing on the Internet, and does so to preserve the overall reality behind the data," said Agrawal.

IBM also provided another example using, fittingly, an IT manager's salary. In short, the software could ascribe a randomization parameter of -$15,000 to +15,000. If John Doe wants to enter his salary onto a Web merchant's site he may do so, without fear of the actual salary being presented. Suppose he earns $90,000. He enters it, but the software would scramble it, and depict it as either $75,000, or $105,000. Hence, John Doe's true salary is masked. The software has access to only the randomized values and the parameters of randomization and nothing else.

On the back-end, the merchant has a rough idea of what John Doe's salary is, and can customize preferences based on this.