Access to new types of data has revolutionized much of science. Yet that revolution has yet to fully make its way to the scientific study of human beings and their interactions, where progress has been hindered by the legal, technical, and operational obstacles to sharing and accessing sensitive data about individuals. New paradigms and platforms are required to enable sensitive data from different sources to be discovered, integrated, and analyzed in an appropriately controlled manner, while also allowing researchers to share analysis methods, results, and expertise in ways not easily possible today. Success in this endeavor may enable a fundamental change in how data on human beings from governments, statistical agencies, research institutions, and other organizations, are made available for research, and thus a flowering of new methods for studying human subjects.
Our goal in this class is to study the technical problems raised by sensitive data and the technical solutions that have been developed for working with this data. We will read and discuss scientific papers in the area and hear from distinguished invited speakers with experience in the development and application of those solutions. We’ll investigate not just the technologies but also the practical applications that make secure access to sensitive data so important: applications such as new scientific approaches to understanding human beings and human societies, and evidence-based policy for healthcare, poverty, and crime. Topics to be discussed include:
- Opportunities for social science, evidence-based policy, program evaluation, and healthcare
- Building and operating secure data enclaves
- Secure multi party computation
- Statistical disclosure methods such as differential privacy
- Privacy preserving data mining
- Search and discovery with sensitive data
- Data de-identification, linkage, and re-identification
- Deployment scenarios in cities, state governments, federal government
- Regulatory challenges and regimes
- Methods for sharing code and reproducible research
- The Globus safe data platform
The class is held Tue/Thu 9:00-10:20am in Ryerson 277. It runs from March 28 to June 1. The instructor is Ian Foster [email protected], whose office is in Searle 222. Please feel free to email anytime with questions or to set up a meeting.
Along with this website, we'll use Piazza for course announcements, submitting paper reviews, posting lecture slides, and general discussion and questions about course material.
Grading
- Paper reviews — 25%
- Paper presentations — 20%
- Participation — 5%
- Course project — 50%
Separate pages provide guidance for paper reviews and presentations and class projects.
Date | Content |
---|---|
3/28 | Introduction. Defining the space. Slides. |
3/30 | Technological challenges and opportunities. Required: Privacy and security with big data, Simson Garfinkel, 2017; Privacy protecting research: Challenges and opportunities, Daniel Goroff and Jules Polonetsky, 2017. |
4/4 | Guest lecture: Matt Gee, Harris School. Read this blog post and this paper on Enigma. |
4/6 | Guest lecture: Charlie Catlett, Argonne National Laboratory. The Array of Things and opportunities and challenges in urban data. Please read and comment on this paper on AoT. |
4/11 | Safe data enclaves and related topics. Required: Five safes: designing data access for research, Tanvi Desai, Felix Ritchie, Richard Welpton, 2016. (Notes); Research infrastructures for the safe analysis of sensitive data, Ian Foster, 2017; Recommended: NORC Data Enclave:Providing Secure Remote Access to Sensitive Microdata, Julia Lane et al., 2009; Data Access in a Cyber World: Making Use of Cyberinfrastructure, Julia Lane et al., 2008. |
4/13 | Safe data enclaves, contd. Required: Advancing Integrated Data Systems by States and Local Governments, Dennis Culhane et al., 2017; Cloud Kotta: Enabling Secure and Scalable Data Analytics in the Cloud, Yadu Babuji et al., 2016. |
4/18 | Guest lecture (remote): Julia Lane, New York University. Big data for public policy: The quadruple helix. Background reading: P1, P2, P3, P4, P5. |
4/20 | Homomorphic encryption. Three papers: Technical Perspective: A First Glimpse of Cryptography's Holy Grail, Daniele Micciancio, 2010; Computing Arbitrary Functions of Encrypted Data, Craig Gentry, 2010; What is Homomorphic Encryption, and Why Should I Care?, Craig Stuntz, 2010. |
4/25 | More homomorphic etc.. CryptoNets: Applying Neural Networks to Encrypted Data with High Throughput and Accuracy, Dowlin et al., 2016; CryptDB: Protecting Confidentiality with Encrypted Query Processing, Raluca Ada Popa et al., 2011. |
4/27 | Differential privacy. Privacy by the Numbers: A New Approach to Safeguarding Data, Erica Klarreich, 2012; A firm foundation for private data analysis, Cynthia Dwork, 2011; Privacy-Preserving Data Analysis for the Federal Statistical Agencies, John Abowd et al., 2017; The algorithmic foundations of differential privacy, Cynthia Dwork and Aaron Roth, 2014. |
5/2 | Synthetic data and statistical disclosure limitation. How Protective Are Synthetic Data?, Abowd and Vilhuber, 2008; Statistical Disclosure Control for Survey Data, Chris Skinner, 2009. |
5/4 | Guest lecturer: Simson Garfinkel, US Census Bureau. Technical challenges in disclosure control. |
5/9 | Differential privacy. On Significance of the Least Significant Bits For Differential Privacy, Ilya Mironov, 2012. Verifiable Differential Privacy, Arjun Narayan et al., 2015. |
5/11 | Multiparty and masked. Multiparty Computation Goes Live, Peter Bogetoft et al., 2008. Computing on Masked Data to improve the Security of Big Data, Vijay Gadepally et al., 2015. |
5/16 | Project reviews |
5/18 | Guest lecture: Brett Goldstein, University of Chicago. Responsible data mining. |
5/23 | Guest lecture: Bruce Meyer, University of Chicago. |
5/25 | Malaria paper and Communication-Efficient Learning of Deep Networks from Decentralized Data, McMahan et al., 2016, and see blog post, 2017. |
5/30 | Project presentations for those graduating this quarter |
6/1 | Reading period |
6/6 | No class (Ian at conference) |
6/8 | Project presentations |
- New approaches to confidentiality protection, John Abowd, Julia Lane, 2004.
- Privacy and confidentiality in the use of administrative and survey data, 2016. (Report by OMB.)
- NORC Data Enclave:Providing Secure Remote Access to Sensitive Microdata, Julia Lane et al., 2009. (Slides)
- Data Access in a Cyber World: Making Use of Cyberinfrastructure, Julia Lane et al., 2008.
- How To Break Anonymity of the Netflix Prize Dataset, Arvind Narayanan and Vitaly Shmatikov, 2006.
- A systematic review of re-identification attacks on health data, Emam et al., 2011.
- Unique in the Crowd: The privacy bounds of human mobility, Yves-Alexandre de Montjoye et al., 2013.
- Unique in the shopping mall: On the reidentifiability of credit card metadata, Yves-Alexandre de Montjoye et al., 2014.
- Not So Unique in the Crowd: a Simple and Effective Algorithm for Anonymizing Location Data, Yi Song et al., 2014.
- The "Re-identification" of Governor William Weld's Medical Information: A Critical Re-examination of Health Data Identification Risks and Privacy Protections, Then and Now, Daniel Barth-Jones, 2012. (Notes)
Statistical disclosure control (Notes)
- Statistical Disclosure Control for Survey Data, Chris Skinner, 2009.
- Multiparty Computation Goes Live, Peter Bogetoft et al., 2008.
- Computing on Masked Data: a High Performance Method for Improving Big Data Veracity, Jeremy Kepner et al., 2014.
- Leaking sensitive information in complex document files—and how to prevent it, Simson Garfinkel, 2014.
- Data, Responsibly: Fairness, Neutrality and Transparency in Data Analysis, Julia Stoyanovich et al., 2016. (3 pages)
- Privacy: Theory meets Practice on the Map, Ashwin Machanavajjhala et al., 2008.
- Multiple imputation for statistical disclosure limitation, TE Raghunathan et al., 2003.
- The Anonymization Debate Should Be About Risk, Not Perfection, Hartzog et al., 2017., and a longer version.