International PET Portal

Fresh

There is no new content.

» Go to the full list

Predicting anonymity with machine learning in social networks

2015.08.18. 05:29:12

Gulyás Gábor

Back

Measuring the level of anonymity is not an easy task. It can be easier in some exceptional cases, but that is not true in general. For example in an anonymized database, we could measure the level of anonymity with the anonymity set sizes: how many user records share the same properties, which could make them identifiable. (And here is the point where differential privacy fans raise their voices, but that story is worth another post.) However, this is much harder if we think about highly dimensional datasets where you have hundreds of attributes for a single user (think of movie ratings, for example).

Introduction

Measuring anonymity in social networks is also not trivial, even if we restrict our analysis to simple structural properties. While we could measure anonymity by checking anonymity or similarity set sizes of the properties of nodes such as degree, this could be problematic: even if we measure node degree, an attacker could use another way to fingerprint nodes. Let us think of the network provided on the right-hand side of the following figure. While regarding degree, nodes v₁ and v₅ could be in the same anonymity set, and thus they cannot be connected to nodes on the left-hand side. However, it is a different case if we use the degree list of their neighborhood as a fingerprint: f₁ = {5, 3} and f₆ = {5, 4}.

Example of attacker background knowledge (left) and the anonymized dataset (right). [source]

Furthermore, Narayanan & Shmatikov proposed an approach that overcomes such problems, and is able to re-identify significant fraction of large networks quite efficiently. This works on a pair of datasets as depicted above, as follows. In the initial phase (called seeding), the attacker identifies mappings between nodes in the two networks with outstanding global properties, such as high degree. In this example, this could be the re-identification of v₃ as Dave, since deg(v_d) = deg(v₃). While this could work for many nodes, not for the majority, and in addition, due to computational issues, this does not scale to large datasets. Thus, after re-identifying a sufficient number of nodes this way, the attack moves to the next phase, the propagation phase, where new nodes are iteratively recovered by checking only a small set of neighboring nodes. In our example, v₁ could not be identified by its degree, however, as it is the only two degree node in the neighborhood of v₃ (that is now identified as Dave), its identity could be easily resolved as Harry.

Exotic measures for anonymity

Probably this gave us enough insight to see that measuring anonymity in cases of such attacks is even less trivial. In my dissertation (see Section 4.1, or a related paper here) I argued that measuring the correlation could be used as an underestimate for identifiability; correlation could be checked between the number of times a node is re-identified (in a series of experiment) and some of its structural properties.

In my work, I showed that two particular structural properties highly correlate with re-identification counts: node degree and the average similarity of a node to its 2nd line of neighbors. (For the knowsy, here are some more details. The original attack compares the node with others that share at least a common neighbor with him. Thus, the more similar a node is to its 2nd hop neighbors, the less likely it could be identified. We could also say, it depends how the node blends into its neighborhood.)

Measures of anonymity.

Based on these measures, I used the Spearman rank correlation to determine if these could predict anonymity. The following figure shows the formulas and the correlation values I got for different recall rates (the ratio of the network the attacker could successfully re-identify). As you can see, both measure led to convincingly high correlation rates when the attacker could re-identify a significantly wide part of the network (let's say 5%).

Correlation rates with empirical re-identification rates of nodes. [source]

Thus, the next question is: can we do better with machine learning?

Machine learning comes in

There are couple of papers now that use machine learning for re-identification. For example, here is a paper on stylometry recognition that teaches random forests to distinguish between a larger set of CPP coding authors based on a large set of features. Here random forests are used quite efficiently to glue ego-networks together that were previously exported from the same network. As random forests are often considered as an off-the-shelf machine learning technique, why not to use it for predicting anonymity?

To experiment on this, I used the random forest implementation of the scikit-learn package in python (with a 100 trees and default settings), and I used networks available in the SNAP repository. To be more exact, I used experimental results that were behind the previous figure.

The next question is, what pattern should be given to the forest? Thinking of node degree alone seemed to be a rather small piece of information, while providing the two hop ego-net of a node to the forest would be too much (note: due to small network diameters, an ego-net could consis of even tens of thousands of nodes). As a compromise, I selected a node fingerprint that consists of the degree of the nodes and the degree of its neighbors. The fingerprint vector was restricted to a lenght of 251, thus I omitted nodes of my experiments that had more neighbors than 250 (note: this is not cheating, as for higher degree nodes it is easy to tell that they are almost always re-identified). Here is an example for the fingerprint:

f = {10, 0, 0, ..., 0, 0, 2, 7, 16, 20, 71, 72, 121, 123, 131, 324}

Thus the fingerprint consisted of the degree, padding zeros and then neighbors' degree values in an increasing order. I used perturbed variants of the Slashdot network (if interested, read on perturbation here in Section 6.2.1) for training, and the prediction was tested on the perturbed variants of the Epinions dataset. Then, I run two types of experiments:

ML-1: the random forest was trained once on a dataset, and the same forest was used for all experiments. The training set was of the Slashdot network with parameters α_v=0.5 and α_e=0.5.
ML-*: the random forest was always trained before making predictions. The training and the test datasets had the same α_v, α_e values respectively.

The following figure gives a comparison between results of the previous measures, and correlation for the data that machine learning predicted. (Due to memory problems I could not work on some datasets, but the figure still clearly tells you what we have here.) When the attacker was less successful (he could re-identify smaller fractions only), the machine learning based prediction had obviously better results. However, for the higher values of recall, the ML-* method produced reasonably well result, almost as good as the others. Thus we could say, that machine learning could do a better job than the manually crafted measures, even though it had access to less information compared to LTA_A.

Machine learning is doing a pretty good job compared to manually edited methods.

Downloads

Machine learning did a good job, and probably with more sophisticated models it could have done even better. If you want to do some experiments yourself, you can download the datasets and attack simulation results I used here, with the python code provided on machine learning and correlation checking. Download it here.

If you like and use the concepts discussed here, please cite this related paper of mine, or my PhD thesis.

This post originally appeared in my professional blog.

Tags: anonymity, machine learning, anonymity measure

Permalink: https://pet-portal.eu/blog/read/668/2015-08-18-Predicting-anonymity-with-machine-learning-in-social-...

Back

Comments

1 comments.

2017.03.12.

Namrata Nayak [ nnayak3@asu.edu \| nnayak3@asu.edu ]	2017.03.12. 20:02:46
Hi, I´m trying to understand this tool you developed for learning purpose, however I´m not able to understand how was the csv files in SimuData folders were creatd. Can you please let me know the program or algorithm that was used to create it? Much appreciated.

Post new comment

Anyone can comment, in case of unregistered senders all fields are optional. Comment can be anonymous.

Name:
E-mail:
Blog:

BBCode is a simple markup language used for formatting comments. Valid codes are: bold: [b]Maecenas at nisl.[/b] italics: [i]Maecenas at nisl.[/i] underline: [u]Maecenas at nisl.[/u] url: [url]http://www.mysite.com[/url], [url=http://www.mysite.com]Maecenas at nisl.[/url] image: [img]http://www.mysite.com/mypic.png[/img] quote: [quote]Maecenas at nisl.[/quote] code: [code]Maecenas at nisl.[/code] size: [size=12]Maecenas at nisl.[/size] color: [color=#FF0000]Maecenas at nisl.[/color] Send