Skip to content

ianfoster/safedata

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction

Access to new types of data has revolutionized much of science. Yet that revolution has yet to fully make its way to the scientific study of human beings and their interactions, where progress has been hindered by the legal, technical, and operational obstacles to sharing and accessing sensitive data about individuals. New paradigms and platforms are required to enable sensitive data from different sources to be discovered, integrated, and analyzed in an appropriately controlled manner, while also allowing researchers to share analysis methods, results, and expertise in ways not easily possible today. Success in this endeavor may enable a fundamental change in how data on human beings from governments, statistical agencies, research institutions, and other organizations, are made available for research, and thus a flowering of new methods for studying human subjects.

Our goal in this class is to study the technical problems raised by sensitive data and the technical solutions that have been developed for working with this data. We will read and discuss scientific papers in the area and hear from distinguished invited speakers with experience in the development and application of those solutions. We’ll investigate not just the technologies but also the practical applications that make secure access to sensitive data so important: applications such as new scientific approaches to understanding human beings and human societies, and evidence-based policy for healthcare, poverty, and crime. Topics to be discussed include:

  • Opportunities for social science, evidence-based policy, program evaluation, and healthcare
  • Building and operating secure data enclaves
  • Secure multi party computation
  • Statistical disclosure methods such as differential privacy
  • Privacy preserving data mining
  • Search and discovery with sensitive data
  • Data de-identification, linkage, and re-identification
  • Deployment scenarios in cities, state governments, federal government
  • Regulatory challenges and regimes
  • Methods for sharing code and reproducible research
  • The Globus safe data platform

Class organization

The class is held Tue/Thu 9:00-10:20am in Ryerson 277. It runs from March 28 to June 1. The instructor is Ian Foster [email protected], whose office is in Searle 222. Please feel free to email anytime with questions or to set up a meeting.

Along with this website, we'll use Piazza for course announcements, submitting paper reviews, posting lecture slides, and general discussion and questions about course material.

Grading

  • Paper reviews — 25%
  • Paper presentations — 20%
  • Participation — 5%
  • Course project — 50%

Separate pages provide guidance for paper reviews and presentations and class projects.

Schedule (subject to change)

Date Content
3/28 Introduction. Defining the space. Slides.
3/30 Technological challenges and opportunities. Required: Privacy and security with big data, Simson Garfinkel, 2017; Privacy protecting research: Challenges and opportunities, Daniel Goroff and Jules Polonetsky, 2017.
4/4 Guest lecture: Matt Gee, Harris School. Read this blog post and this paper on Enigma.
4/6 Guest lecture: Charlie Catlett, Argonne National Laboratory. The Array of Things and opportunities and challenges in urban data. Please read and comment on this paper on AoT.
4/11 Safe data enclaves and related topics. Required: Five safes: designing data access for research, Tanvi Desai, Felix Ritchie, Richard Welpton, 2016. (Notes); Research infrastructures for the safe analysis of sensitive data, Ian Foster, 2017; Recommended: NORC Data Enclave:Providing Secure Remote Access to Sensitive Microdata, Julia Lane et al., 2009; Data Access in a Cyber World: Making Use of Cyberinfrastructure, Julia Lane et al., 2008.
4/13 Safe data enclaves, contd. Required: Advancing Integrated Data Systems by States and Local Governments, Dennis Culhane et al., 2017; Cloud Kotta: Enabling Secure and Scalable Data Analytics in the Cloud, Yadu Babuji et al., 2016.
4/18 Guest lecture (remote): Julia Lane, New York University. Big data for public policy: The quadruple helix. Background reading: P1, P2, P3, P4, P5.
4/20 Homomorphic encryption. Three papers: Technical Perspective: A First Glimpse of Cryptography's Holy Grail, Daniele Micciancio, 2010; Computing Arbitrary Functions of Encrypted Data, Craig Gentry, 2010; What is Homomorphic Encryption, and Why Should I Care?, Craig Stuntz, 2010.
4/25 More homomorphic etc.. CryptoNets: Applying Neural Networks to Encrypted Data with High Throughput and Accuracy, Dowlin et al., 2016; CryptDB: Protecting Confidentiality with Encrypted Query Processing, Raluca Ada Popa et al., 2011.
4/27 Differential privacy. Privacy by the Numbers: A New Approach to Safeguarding Data, Erica Klarreich, 2012; A firm foundation for private data analysis, Cynthia Dwork, 2011; Privacy-Preserving Data Analysis for the Federal Statistical Agencies, John Abowd et al., 2017; The algorithmic foundations of differential privacy, Cynthia Dwork and Aaron Roth, 2014.
5/2 Synthetic data and statistical disclosure limitation. How Protective Are Synthetic Data?, Abowd and Vilhuber, 2008; Statistical Disclosure Control for Survey Data, Chris Skinner, 2009.
5/4 Guest lecturer: Simson Garfinkel, US Census Bureau. Technical challenges in disclosure control.
5/9 Differential privacy. On Significance of the Least Significant Bits For Differential Privacy, Ilya Mironov, 2012. Verifiable Differential Privacy, Arjun Narayan et al., 2015.
5/11 Multiparty and masked. Multiparty Computation Goes Live, Peter Bogetoft et al., 2008. Computing on Masked Data to improve the Security of Big Data, Vijay Gadepally et al., 2015.
5/16 Project reviews
5/18 Guest lecture: Brett Goldstein, University of Chicago. Responsible data mining.
5/23 Guest lecture: Bruce Meyer, University of Chicago.
5/25 Malaria paper and Communication-Efficient Learning of Deep Networks from Decentralized Data, McMahan et al., 2016, and see blog post, 2017.
5/30 Project presentations for those graduating this quarter
6/1 Reading period
6/6 No class (Ian at conference)
6/8 Project presentations

Papers to be discussed (a work in progress)

Overview

Safe data enclaves

Anonymization and de-identification

Reidentification risks

Statistical disclosure control (Notes)

Secure multi-party computation

Computing on masked data

Residual information in documents

Responsible data

Other

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published