In a previous blog entry, I described how random forests could be used to predict the level of empirical identifiability. I have also been experimenting with neural networks, and how this approach could be used to solve the problem. As there is a miriad of great tutorials and ebooks on the topic, I'll just continue the previous post. Here, instead of using the scikit-learn package, I used the keras package for modeling artificial neural networks, which relies on theano. (Theano allows efficient execution by using GPU. Currently only NVIDIA CUDA is supported.)
The current setting is the same described in the previous post: node neighborhood-degree fingerprints are used to predict how likely it is that they would be re-identified by the state-of-the-art attack. As I've seen examples using raw image data for character classification (as for the MNIST dataset) with a Multi-Layer Perceptron structure, I decided to use a simple, fully connected MLP network, where the whole node-fingerprint is fed to the net. Thus the network is constructed of an input layer 251 neurons (with rectified linear unit activation, or relu in short), a hidden layer of 128 neurons (with relu). To achieve classification, I used 21 output neurons to cover all possible score cases in range of -10, ..., 10. Here, I used a softmax layer, as an output like a distribution is easier to handle for classification. See the image below for a visual recap.
I did all the traning and testing as last time: the perturbed Slashdot networks were used for training, and perturbations of the Epinions network were serving as test data. In each round with a different level of perturbation (i.e., different level of anonymization or strength of attacker background knowledge) I retrained the network with Stochastic Gradient Descent (SGD), using the dropout technique – you can find more of the details in the python code. As the figure shows below, this simple construction (hint: and also the first successful try) could beat previous results, however, with some constraints.
In the high recall region this simple MLP-based prediction approach proved to be better than all previous ones. While for the simulation of weak attackers (i.e., small recall, where perturbation overlaps are small), random forests obviously are the best choice. You can grab the new code here (you will also need the datasets from here).
This post originally appeared in the professional blog of Gábor Gulyás.
0 comments.
No comments.
Anyone can comment, in case of unregistered senders all fields are optional. Comment can be anonymous.