Statistical Privacy Workshop

Invited speakers

Louis Aslett, Oxford
- "Doing machine learning blindfolded"
Pedro Esperanca, Oxford
- "Encrypted regression analysis: acceleration and Bayesian methods"
Borja de Balle, Lancaster
- "Secure Multi-Party Linear Regression on High-Dimensional Data"
Yves-Alexandre De Montjoye, Imperial
- "Computational Privacy: The privacy bounds of human behavior"
Morten Dahl, Snips, Paris
- What privacy has meant for Snips"
Chris Skinner, LSE
- "Differential Privacy from the Perspective of Statistical Disclosure Control"
Emiliano De Cristofaro, UCL
- "Building and Measuring Privacy-Preserving Mobility Analytics"
Peter Jones, ONS
- "ONS research into methods for linking pseudonymised administrative data"

Abstracts

Borja de Balle, "Secure Multi-Party Linear Regression on High-Dimensional Data"
The goal of secure multi-pary computation (MPC) is to facilitate the evaluation of functionalities that depend on the private inputs of several distrusting parties in a privacy preserving manner. I will start my talk by discussing potential applications of secure MPC to machine learning and the relation between MPC and other well-known privacy frameworks like differential privacy. Then I will discuss our recent work on secure MPC protocols for linear regression on distributed databases. By combining several tools from the MPC literature (garbled circuits, oblivious transfers, correlated randomness) we obtain scalable solutions that can solve problems with millions of records and hundreds of features in a matter of minutes. Some crucial implementation details will be discussed, including the role of fixed-point arithmetic and a robust conjugate gradient descent solver for private linear systems. An implementation of our protocols based on the Obliv-C framework is available as open source.
Yves-Alexandre De Montjoye, "Computational Privacy: The privacy bounds of human behavior"
We're living in an age of big data, a time when most of our movements and actions are collected and stored in real time. Large-scale data coming from mobile phone, credit card, browsers or the IoT dramatically increase our capacity to measure, understand, and potentially affect the behavior of individuals and collectives. The use of this data, however, raise legitimate privacy concerns. In this talk, I will first show how the mere absence of obvious identifiers such as name or phone number is often not enough to prevent re-identification. I will then discuss how, as the use of this data progress, it will become increasingly important to consider whether sensitive information can be inferred from apparently innocuous data. Finally, I will discuss the impact of metadata on society and some of solutions we are developing to allow metadata to be used in a privacy-conscientious way.
Chris Skinner, "Differential Privacy from the Perspective of Statistical Disclosure Control"
Differential privacy has attracted a lot of attention recently in the computer science literature. It provides a means of measuring confidentiality protection in a mathematically rigorous framework under 'worst case' assumptions. This talk will introduce this idea in the context of statistical disclosure control and will illustrate its potential application in the dissemination of frequency tables from a census.
Morten Dahl,"What privacy has meant for Snips"
Privacy has been a guiding component at Snips from its very beginning, partly motivated by a core belief but also by a business rationale. In this talk we will outline some of the arguments that have led to our decisions and illustrate some of the challenges we have faced. We will further discuss a few concrete techniques by which we have aimed to overcome these, tailored to the fact that Snips is primarily focused on mobile and IoT.
Pedro Esperance,"Encrypted regression analysis: acceleration and Bayesian methods"
We discuss two methods for regression analysis under privacy constraints.
In the first, we analyse coordinate and accelerated gradient descent algorithms which are capable of fitting least squares and penalised ridge regression models using data encrypted under an FHE scheme; and show that the characteristics of encrypted computation favour a non-standard acceleration technique. We give details on several computational aspects, and theoretical bounds to help select key cryptographic parameters which are essential for correct decryption.
In the second, we propose a new computational framework for cooperative Bayesian regression with horizontally-partitioned and distributed data encrypted under an FHE scheme. In this framework, several data-owning parties interact sequentially with a semi-trusted party to compute the global posterior mean and variances (i.e., incorporating all data) in a privacy-preserving manner using recursive Bayesian updating. Suitable approximations to posterior mean and variance are studied and details on computational time, memory requirements and selection of cryptographic parameters are provided. The opportunity to incorporate differential privacy as an additional security layer is discussed.
Louis Aslett,"Doing machine learning blindfolded"
The prevalence of data today presents challenges with computational intractabilities, but also in ensuring the privacy of an ever growing amount of potentially sensitive data which is increasingly being stored with third party 'cloud' providers. The ideal solution is to store only encrypted versions of the data, but this appears to preclude any analysis being performed without first decrypting and risking revealing the data.
Recent advances in cryptography enable limited computational operations to be performed without first decrypting, opening up the prospect of fully encrypted data analysis without compromising security concerns. However, the constraints associated with these cryptographic schemes mean that many traditional machine learning models cannot simply be fitted encrypted without modification. Furthermore, the substantial computational burden of encrypted calculations make issues of scalability an important constraint.
Here we present statistical machine learning methods designed to learn on such fully homomorphic encrypted (FHE) data. We overview the two tailored machine learning algorithms we have recently proposed: completely random forests, involving a new cryptographic stochastic fraction estimator; and naïve Bayes, involving a semi-parametric model for the class decision boundary and show how they can be used to learn and predict from encrypted data. We demonstrate that these techniques perform competitively on a variety of classification data sets and provide an example fitted using a 1,152 CPU core cluster on Amazon EC2 to demonstrate the practicality of the methods.
All our illustrations are run in an open source R package, with all cryptographic functions coded in high performance parallelised C++ to ameliorate some of the computational costs associated with performing homomorphic operations on encrypted data.
Emiliano De Cristofaro,"Building and Measuring Privacy-Preserving Mobility Analytics"
Location data can be extremely useful to study commuting patterns and disruptions, as well as to predict real-time traffic volumes. At the same time, however, the fine-grained collection of user locations raises serious privacy concerns, as this can reveal sensitive information about the users, such as, life style, political and religious inclinations, or even identities. In our paper, we study the feasibility of crowd-sourced mobility analytics over aggregate location information: users periodically report their location, using a provably secure, privacy-preserving aggregation protocol, so that the server can only recover aggregates -- i.e., how many, but not which, users are in a region at a given time. We experiment with real-world mobility datasets obtained from the Transport For London authority and the San Francisco Cabs network, and present a novel methodology based on time series modeling that is geared to forecast traffic volumes in regions of interest and to detect mobility anomalies in them. In the presence of anomalies, we also make enhanced traffic volume predictions by feeding our model with additional information from correlated regions. Finally, we discuss challenges related to the possible privacy leakage from the aggregates themselves, as well as other applications of privacy-friendly analytics from aggregate statistics.
Peter Jones,"ONS research into methods for linking pseudonymised administrative data"
The ONS Administrative Data Census Project has been set up to research alternatives to traditional census taking after 2021. Switching to an Administrative Data Census in future will depend on linking multiple administrative data sources and surveys to a very high level of quality. There are many challenges associated with the linkage process including; the absence of common identifiers across government datasets, the number of records that need to be linked (hundreds of millions), the quality and timeliness of administrative records, and requirements to ensure privacy of personal information.
So far, ONS have concentrated on a pseudonymisation approach which 'hashes' sensitive information relating to name, date of birth and address prior to the linkage process. This preserves privacy of individuals, but impedes the application of scientifically proven linkage methods based on probabilistic and clerical matching. In this paper we present methods developed at ONS which were designed specifically to optimise linkages between pseudonymised records. These include:
- the use of deterministic match-keys to resolve minor discrepancies between administrative records
- the construction of 'similarity tables' during pre-preprocessing to derive string comparison scores between match fields
- the use of training data to fit logistic regression models to determine match status between candidate pairs
- automated threshold setting within the Fellegi-Sunter probabilistic linkage framework
- associative matching, where links are identified by collective resolution within a household
- early developmental work using graph databases and machine learning
We present a series of quality assurance exercises that have been undertaken to test these pseudonymisation methods against 'gold standard' links that have been identified from clerical matching. While early research was encouraging, an alternative approach from that which relies solely on pseudonymisation is likely to be needed to reduce linkage errors to an acceptable level. ONS has an interest in exploring other ways of managing data during the linkage process (for example Trusted Third Party matching models) and the use of pseudonymisation and access control measures to maintain privacy once records have been linked.

Privacy in Statistical Analysis

Invited speakers

Abstracts