DAPPER Project report

The ‘DAPPER’ Research project (Marie Curie Career Integration Grant 618202) has recently concluded its four year duration.  Here is the final public summary of the activity:

DAPPER: Delivering Anonymization Practically, Privately, Effectively and Reusably

Introduction. There is currently a tug-of-war going on surrounding data releases.  On one side, there are many strong reasons pulling to release data to other parties: business factors, freedom of information rules, and scientific sharing agreements.  On the other side, concerns about individual privacy pull back, and seek to limit releases.  Privacy technologies such as differential privacy have been proposed to resolve this deadlock, and there has been much study of how to perform private data release of data in various forms.  The focus of such works has been largely on the data owner: what process should they apply to ensure that the released data preserves privacy whilst still capturing the input data distribution accurately. Less attention has been paid to the needs of the data user, who wants to make use of the released data within their existing suite of tools and data. The difficulty of making use of data releases is a major stumbling block for the widespread adoption of data privacy technologies.

This Marie Curie career integration project considered the whole data release process, from the data owner to the data user.  It laid out a set of principles for privacy tool design that highlight the requirements for interoperability, extensibility and scalability. The aim of the project was in Delivering Anonymization Practically, Privately, Effectively and Reusably (DAPPER).  It produced published results under the following four themes:

  • Synthetic Private Data. New methods were developed for providing synthetic data in the form of (social) networks, based on anonymized versions of real data under the strong privacy guarantee of differential privacy [SIGMOD16].  The fellow also proposed a new privacy definition called personalized differential privacy (PDP), a generalization of differential privacy in which users specify a personal privacy requirement for their data, and introduced several novel mechanisms for achieving PDP [ICDE15].
  • Correlated Data Modelling. Many analysis and machine learning tasks require the availability of marginal statistics on multidimensional datasets while providing strong privacy guarantees for the data subjects. Applications for these statistics range from finding correlations in the data to fitting sophisticated prediction models. The fellow provided a set of algorithms for materializing marginal statistics under the strong model of local differential privacy [SIGMOD18], as well as developing PrivBayes, a differentially private method for releasing high-dimensional data [SIGMOD14,TODS17].
  • Data Utility Enhancement. The fellow worked on the core problem of count queries, and designed randomized mechanisms to release count data associated with a group of n individuals [ICDE18]. The fellow also gave new algorithms to provide statistical information about graphs based on a ‘ladder’ distribution [SIGMOD15].
  • Trajectory Data. The fellow presented DPT, a system to synthesize mobility data based on raw GPS trajectories of individuals while ensuring strong privacy protection in the form of ε-differential privacy [VLDB15].

The work of the fellow has had real impact on the state of the art: the fellow’s work is used within the private data collection software developed by Apple and deployed to millions of iOS and MacOS users around the world.  The fellow is now a Professor at the University of Warwick in the UK, and leads a group of researchers working on topics related to privacy and data analysis.

Website and Contact Details.  Activities and news about the project are posted by the fellow at the website https://gcormode.wordpress.com/.

Advertisements

Privacy Work at SIGMOD

A paper on marginal release under the model of local differential privacy has been accepted for presentation at the SIGMOD conference.  A related tutorial will also be presented on the model of Local Differential Privacy (LDP).  Details and links are as follows:

Marginal release under local differential privacy (Cormode, Kulkarni, Srivastava)

Many analysis and machine learning tasks require the availability of marginal statistics on multidimensional datasets while providing strong privacy guarantees for the data subjects. Applications for these statistics range from finding correlations in the data to fitting sophisticated prediction models. In this paper, we provide a set of algorithms for materializing marginal statistics under the strong model of local differential privacy. We prove the first tight theoretical bounds on the accuracy of marginals compiled under each approach, perform empirical evaluation to confirm these bounds, and evaluate them for tasks such as modeling and correlation testing. Our results show that releasing information based on (local) Fourier transformations of the input is preferable to alternatives based directly on (local) marginals.

 Privacy at scale: Local differential privacy in practice (G. Cormode, S. Jha, T. Kulkarni, N. Li, D. Srivastava, and T. Wang). Tutorial at SIGMOD 2018 and KDD 2018

Local differential privacy (LDP), where users randomly perturb their inputs to provide plausible deniability of their data without the need for a trusted party, has been adopted recently by several major technology organizations, including Google, Apple and Microsoft. This tutorial aims to introduce the key technical underpinnings of these deployed systems, to survey current research that addresses related problems within the LDP model, and to identify relevant open problems and research directions for the community.

Postdoc position in Algorithms Research (closes February 14th 2018)

We are seeking to recruit a postdoctoral research fellow to work in the area of designing algorithms for analysing large data sets.

You will be expected to perform high quality research under the supervision of Professor Graham Cormode, as part of the ERC funded project ‘Small Summaries for Big Data’. This can encompass streaming algorithms, sketching and dimensionality reduction, distributed monitoring and mergeable summaries, verification of outsourced computation, or other related topics. The expectation is that you will produce breakthrough research results in the summarisation of large volumes of data, and publish these results in top rated venues.

You will possess a PhD or an equivalent qualification in Computer Science or a very closely-related discipline (or you will shortly be obtaining it). You should have a strong background in one or more of the following areas: randomized and approximation algorithms; communication complexity and lower bounds; streaming or sublinear algorithms.

The post is based in the Department of Computer Science at University of Warwick, but collaborations with closely related research organisations such as the Centre for Discrete Mathematics and its Applications (DIMAP), the Warwick Institute for the Science of Cities (WISC); and the newly formed Alan Turing Institute (ATI) will be strongly encouraged. You will join a team of researchers led by Professor Cormode including other postdoctoral researchers and PhD students.

Candidates should provide with their application form a CV, a list of publications and a research statement.

Closing date: 14th February 2018

More information: https://atsv7.wcn.co.uk/search_engine/jobs.cgi?owner=5062452&ownertype=fair&jcode=1710356&vt_template=1457&adminview=1

Conference and Journal publications

A number of conference and journal papers have been accepted for publication over the winter break.  These include:

G. Cormode and J. Dark. Fast sketch-based recovery of correlation outliers. In International Conference on Database Theory, 2018.

Many data sources can be interpreted as time-series, and a key problem is to identify which pairs out of a large collection of signals are highly correlated. We expect that there will be few, large, interesting correlations, while most signal pairs do not have any strong correlation. We abstract this as the problem of identifying the highly correlated pairs in a collection of n mostly pairwise uncorrelated random variables, where observations of the variables arrives as a stream. Dimensionality reduction can remove dependence on the number of observations, but further techniques are required to tame the quadratic (in n) cost of a search through all possible pairs. We develop a new algorithm for rapidly finding large correlations based on sketch techniques with an added twist: we quickly generate sketches of random combinations of signals, and use these in concert with ideas from coding theory to decode the identity of correlated pairs. We prove correctness and compare performance and effectiveness with the best LSH (locality sensitive hashing) based approach.

G. Cormode and C. Hickey. Cheap checking for cloud computing: Statistical analysis via annotated data streams. In AISTATS, 2018.

As the popularity of outsourced computation increases, questions of accuracy and trust between the client and the cloud computing services become ever more relevant. Our work aims to provide fast and practical methods to verify analysis of large data sets, where the client’s computation and memory and costs are kept to a minimum. Our verification protocols are based on defining “proofs” which are easy to create and check. These add only a small overhead to reporting the result of the computation itself. We build up a series of protocols for elementary statistical methods, to create more complex protocols for Ordinary Least Squares, Principal Component Analysis and Linear Discriminant Analysis. We show that these are very efficient in practice.

G. Cormode, T. Kulkarni, and D. Srivastava. Constrained differential privacy for count data. In International Conference on Data Engineering (ICDE), 2018.

Concern about how to aggregate sensitive user data without compromising individual privacy is a major barrier to greater availability of data. The model of differential privacy has emerged as an accepted model to release sensitive information while giving a statistical guarantee for privacy. Many different algorithms are possible to address different target functions. We focus on the core problem of count queries, and seek to design mechanisms to release data associated with a group of n individuals. Prior work has focused on designing mechanisms by raw optimization of a loss function, without regard to the consequences on the results. This can leads to mechanisms with undesirable properties, such as never reporting some outputs (gaps), and overreporting others (spikes). We tame these pathological behaviors by introducing a set of desirable properties that mechanisms can obey. Any combination of these can be satisfied by solving a linear program (LP) which minimizes a cost function, with constraints enforcing the properties. We focus on a particular cost function, and provide explicit constructions that are optimal for certain combinations of properties, and show a closed form for their cost. In the end, there are only a handful of distinct optimal mechanisms to choose between: one is the well-known (truncated) geometric mechanism; the second a novel mechanism that we introduce here, and the remainder are found as the solution to particular LPs. These all avoid the bad behaviors we identify. We demonstrate in a set of experiments on real and synthetic data which is preferable in practice, for different combinations of data distributions, constraints, and privacy parameters.

G. Cormode, A. Dasgupta, A. Goyal, and C. H. Lee. An evaluation of multi-probe locality sensitive hashing for computing similarities over web-scale query logs. PLOS ONE, 2018.

Many modern applications of AI such as web search, mobile browsing, image processing, and natural language processing rely on finding similar items from a large database of complex objects. Due to the very large scale of data involved (e.g., users’ queries from commercial search engines), computing such near or nearest neighbors is a non-trivial task, as the computational cost grows significantly with the number of items. To address this challenge, we adopt Locality Sensitive Hashing (a.k.a, LSH) methods and evaluate four variants in a distributed computing environment (specifically, Hadoop). We identify several optimizations which improve performance, suitable for deployment in very large scale settings. The experimental results demonstrate our variants of LSH achieve the robust performance with better recall compared with “vanilla” LSH, even when using the same amount of space.

The three conference presentations will take place over the coming months.

Postdoc position in algorithms (closing March 31st 2016)

We are seeking to recruit a postdoctoral research fellow to work in the area of designing algorithms for analysing large data sets.

You will be expected to perform high quality research under the supervision of Professor Graham Cormode, as part of the ERC funded project ‘Small Summaries for Big Data’. This can encompass streaming algorithms, sketching and dimensionality reduction, distributed monitoring and mergeable summaries, verification of outsourced computation, or other related topics. The expectation is that you will produce breakthrough research results in the summarisation of large volumes of data, and publish these results in top rated venues.

You will possess a PhD or an equivalent qualification in Computer Science or a very closely-related discipline (or you will shortly be obtaining it). You should have a strong background in one or more of the following areas: randomized and approximation algorithms; communication complexity and lower bounds; streaming or sublinear algorithms.

The post is based in the Department of Computer Science at University of Warwick, but collaborations with closely related research organisations such as the Centre for Discrete Mathematics and its Applications (DIMAP), the Warwick Institute for the Science of Cities (WISC); and the newly formed Alan Turing Institute (ATI) will be strongly encouraged. You will join a team of researchers led by Professor Cormode including other postdoctoral researchers and PhD students.

Candidates should provide with their application form a CV, a list of publications and a research statement.

Closing date: 31st March 2016

More information:http://www.jobs.ac.uk/job/AUC577/research-fellow-77780-026/

PhD positions in Data Summarization available

A funded studentship is available in the area of computer science to work on algorithms and data structures for summarization of massive data sets. Funding is provided through the prestigious ERC program, under project 647557, “Small Summaries for Big Data”.

Project Overview:
A fundamental challenge in processing the massive quantities of information generated by modern applications is in extracting suitable representations of the data that can be stored, manipulated and interrogated on a single machine. A promising approach is in the design and analysis of compact summaries: data structures which capture key features of the data, and which can be created effectively over distributed data sets. Popular summary structures include the Bloom filter, which compactly represents a set of items, and sketches which allow vector norms and products to be estimated.
Such structures are very attractive, since they can be computed in parallel and combined to yield a single, compact summary of the data. Yet the full potential of summaries is far from being fully realized. Professor Cormode is recruiting a team to work on important problems around creating Small Summaries for Big Data. The goal is to substantially advance the state of the art in data summarization, to the point where accurate and effective summaries are available for a wide array of problems, and can be used seamlessly in applications that process big data. PhD studentships can work on a variety of topics related to the project, including:
• The design and evaluation of new summaries for fundamental computations such as large matrix computations
• Summary techniques for complex structures such as massive graphs
• Summaries that allow the verification of outsourced computation over big data.
• Application of summaries in the context of monitoring distributed, evolving streams of data
The expectation is that this will lead to novel results in the summarization of large volumes of data, which will be published in top-rated venues.
You will possess a degree in Computer Science, mathematics or very closely related discipline (or you will shortly be obtaining it). You should have good knowledge of one or more of the following areas: algorithm design and analysis; randomized and approximation algorithms; communication complexity and lower bounds; streaming or sublinear algorithms. The post is based in the Department of Computer Science at the University of Warwick, but collaborations with closely related research organizations such as the centre for Discrete Mathematics and its Applications (DIMAP), the Warwick Institute for the Science of Cities (WISC); and the newly formed Alan Turing Institute (ATI) will be strongly encouraged.
For examples of relevant research and related topics, please consult Prof. Cormode’s web pages at http://www2.warwick.ac.uk/fac/sci/dcs/people/Graham_CormodeEligibility:
Candidates should hold a degree in Computer Science, Mathematics or closely related discipline, or expect to complete one before the commencement of the studentship. The degree should show a high level of achievement (1st or 2.1 level).
Funding level:
Funding is available to support stipend and fees at the UK/EU level for 4 years (this does not cover fees for non-EU students, see http://www2.warwick.ac.uk/study/postgraduate/funding/fees/ for more information).
Application details:
Please send a CV to giving details of your education and achievements to date, including details of performance in relevant university-level subjects (such as Algorithms, Data Structures, Complexity, Mathematical analysis of algorithms, linear algebra and so on). Please also include a covering note explaining how your background and interests make you relevant to the aims of the project.
Applications will be reviewed as they are received, with an initial deadline of November 30th 2015, and a final deadline of 31st March 2016.This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 647557).

Funding Notes

Funding is available to support stipend and fees at the UK/EU level for 4 years (this does not cover fees for non-EU students, see View Website for more information).

Hossein Jowhari joins as postdoc

Dr Hossein Jowhari has joined the ERC project “Small Summaries for Big Data”, led by Professor Graham Cormode, as a postdoctoral research assistant. Dr Jowhari has made foundational contributions to the area of data summarization, most notably his work on Lp sampling.  Prior to joining Warwick, Hossein completed his PhD at Simon Fraser University, and spent time as a researcher at the Madalgo centre.