Aug10 '15

A repeated, and somewhat pessimistic, conclusion is that the more we privitize data, the more we lose the signal in that data. That is, the safer the data (for sharing) the worse it ebcomes (for making conclusions)

Recent results have addressed this issue. Former RAISE-member (now working on her post-doc) Fayola Peters presented her novel privacy algorithms called LACE2. In recent work with Dr. Tim Menzies, presented at the International Confernce on Software Engineering, Dr Peters applies instance-based learning methods to learns how much (and how litte) we can mutate data without changing the conclusions we might learn from that data. Based on that work, she offers three laws of trusted data mining.

To explain our three laws, we must first introduce the concept of “corners” in a data set. Many researchers in machine learning offer the same conclusion: when learning models from data, it is not necessary to share all rows and columns within tables of data. It turns out that most of the signal in a data set tables can be represented as small “corners” of the data; i.e. just a few columns and just a few rows. While the exact numbers vary from data set to data set:

  • The usual instance selection result is that rows of data contain redundancies; i.e. repeated instances of a similar example. Hence, M1 rows can be approximated using M2=M1/5 (or fewer) rows by (for example), clustering the data then replacing each cluster with the median point within that cluster.
  • The usual feature selection result , is most of the signal in N1 cols comes from N2=sqrt(N1) columns or less (and the remaining data is either noisy or closely correlated to the data in the selected N2 columns).

We say that the “corner” of a data set are just the small number of rows and columns found via instance and row selection. Using the corners, we can state our first law of cost-effective trusted data sharing:

First Law: don’t share everything; just share the corners.

This law is interesting since, in the usual case, that this “corner” is very small compared to the original data set. For example, consider a table of data with 1000 rows and 100 columns. Note that this data has 1000*100 cells. If this data set compresses according to the instance/feature selection ratios listed above, then the “corners” of this data would hold just 200 rows and 10 columns; i.e. 2000 cells in all– which is 2% of the original data. That is, if we just shared the “corners” of this data, then most of the data is never revealed to an outside party (in this case, we’d share 2% and hide 98%).

Of course, if this “corner” contains the essence of the data, then it is important to apply a second law of cost-effective trusted data sharing:

Second Law: anonymize the data in the “corners”.

Our research has suggested an interesting way to implement this second law. Our experience with this approach1,2,3 is that as move from all the data into the “corners”, then this increases the distance between:

  • an example of some class A
  • and the nearest example of some class B.

Halfway between these examples is the class boundary where the classification might flip from A to B. Note that, to privatize data, we could mutate the example A anywhere up to that class boundary without changing what the conclusions of a nearest neighbor algorithm working this dataset. Accordingly, we offer the third law of cost-effective trusted data sharing:

Third Law: while anonymizing, never mutate examples across the class boundary.

Note that, using this third law, our privatization methods achieved better results that standard anonymization algorithms such as k-anonymity , at least for data taken from software engineering projects.

In their recent paper, Dr Peters and Dr. Menzies we simulated a consortium of 20+ data owners, each with a separate data set. A “pass the parcel” system was implemented in which each data owner incrementally added their data to a parcel of shared data- but only the parts of their data that was somehow outstandingly different from the data already in the parel. To define “different”, various instance-based reasoning operators were employed such that when we said some data was “different”, we based that comparison on the most informative attributes. In all, that shared parcel held just 5% of the data owned by all members of the consortium– yet when we build software quality predictors from this 5%, those predictors performed better than predictors built from all that data.

The most significant aspect of that work was that after applying our three laws of data sharing we built predictive models from a privatized version of the shared data and the shared privatized data generated better predictions than the raw data.

So, not only is privitization necessary, it can actually boost the value of the data.