This is a list of materials and resources on configuration management for cloud and Internet systems. Some of the early work did not exactly target modern cloud systems, but I find the ideas relevant and inspiring.
The list does not intend to include other forms of configuration, such as network device configuration, feature flags, or user preferences (fonts and background themes).
-
Fail at Scale: Reliability in the face of rapid change (CACM, 2015)
-
How Hadoop clusters break (IEEE Software, 2013)
-
What Takes Us Down? (;login:, 2012)
-
An Empirical Study on Configuration Errors in Commercial and Open Source Systems (SOSP, 2011)
-
Why do Internet services fail, and what can be done about it? (USITS, 2003)
-
MobileConfig: Remote Configuration Management for Mobile Apps at Hyperscale (NSDI, 2024)
-
Configuration Design and Best Practices (The Site Reliability Workbook, 2018)
-
Configuration Specifics (The Site Reliability Workbook, 2018)
-
Holistic Configuration Management at Facebook (SOSP, 2015)
-
ACMS: The Akamai Configuration Management System (NSDI, 2005)
-
STRIDER: A Black-box, State-based Approach to Change and Configuration Management and Support (LISA, 2003) - Microsoft CCMS
-
Test Selection for Unified Regression Testing (ICSE, 2023)
-
Finding Heterogeneous-Unsafe Configuration Parameters in Cloud Systems (EuroSys, 2021) - Detecting incompatible configuration in distributed systems
-
Testing Configuration Changes in Context to Prevent Production Failures (OSDI, 2020) - Connecting production system configurations to software tests
-
Usable Declarative Configuration Specification and Validation for Applications, Systems, and Cloud (Middleware, 2017) -- IBM ConfigValidator
-
Early Detection of Configuration Errors to Reduce Failure Damage (OSDI, 2016) - Generating configuration checks
-
ConfValley: A Systematic Configuration Validation Framework for Cloud Services (EuroSys, 2015) - A declarative framework for writing configuration validation code
-
Understanding and Detecting On-the-Fly Configuration Bugs (ICSE, 2023)
-
Automated Reasoning and Detection of Specious Configuration in Large Systems with Symbolic Execution (OSDI, 2020)
-
Rex: Preventing Bugs and Misconfiguration in Large Services Using Correlated Change Analysis (NSDI, 2020) - Correlated-change analysis for Microsoft Office 365 and Azure
-
PracExtractor: Extracting Configuration Good Practices from Manuals to Detect Server Misconfigurations (USENIX ATC, 2020) - Using NLP to learn good practices and detect bad practices
-
EnCore: Exploiting System Environment and Correlation Information for Misconfiguration Detection (ASPLOS, 2014) - Checking correlations between configuration values and the deployment environment (VM images)
-
Context-based Online Configuration-Error Detection (USENIX ATC, 2011) - Detecting abnormal configuration event sequences
-
Proactive Detection of Inadequate Diagnostic Messages for Software Configuration Errors (ISSTA, 2015) - Uses NLP to evaluate log message quality
-
Do Not Blame Users for Misconfigurations (SOSP, 2013) - Generating misconfigurations based on constraints inferred from source code
-
ConfErr: A Tool for Assessing Resilience to Human Configuration Errors (DSN, 2010) - Generating misconfigurations based on human error model
-
Automated Diagnosis of Software Configuration Errors (ICSE, 2013) - Identify behavior deviation caused by misconfiguration
-
X-ray: Automating Root-Cause Diagnosis of Performance Anomalies in Production Software (OSDI, 2012) - Inferring causality between performance anomalies and configuration values
-
Precomputing Possible Configuration Error Diagnoses (ASE, 2011)
-
Automating configuration troubleshooting with dynamic information flow analysis (OSDI, 2010) - Inferring causality between failures and configuration values
-
Configuration Debugging as Search: Finding the Needle in the Haystack (OSDI, 2004) - Debugging by time traveling
-
Automatic Misconfiguration Troubleshooting with PeerPressure (OSDI, 2004) - Integrated in Windows troubleshooting toolkit for Windows registry
-
Learning Patterns in Configuration (ASE, 2021)
-
An Evolutionary Study of Configuration Design and Implementation in Cloud Systems (ICSE, 2020) - Evolution of configuration design and implementation
-
Understanding and Discovering Software Configuration Dependencies in Cloud and Datacenter Systems (ESEC/FSE, 2020) - Configuration dependency analysis
-
Statically Inferring Performance Properties of Software Configurations (EuroSys, 2020)
-
Synthesizing Configuration File Specifications with Association Rule Learning (OOPSLA, 2017) - Synthesizing configuration specifications
-
Probabilistic Automated Language Learning for Configuration Files (CAV, 2016) - Learning a language model of configuration
-
ConfSeer: Leveraging Customer Support Knowledge Bases for Automated Misconfiguration Detection (VLDB, 2015) - Using NLP to find configuration KB articles; Integrated in Microsoft Operations Management Suite
-
Hey, You Have Given Me Too Many Knobs! Understanding and Dealing with Over-Designed Configuration in System Software (ESEC/FSE, 2015) - Statistics of configuration files in the field
-
Which Configuration Option Should I Change? (ICSE, 2014) - Re-configuration due to software evolution
-
KungFu: Making Training in Distributed Machine Learning Adaptive (OSDI, 2020)
-
Understanding and Auto-Adjusting Performance-Sensitive Configurations (ASPLOS, 2018)
-
Datasize-Aware High Dimensional Configurations Auto-Tuning of In-Memory Cluster Computing (ASPLOS, 2018)
-
CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics (NSDI, 2017)
-
BestConfig: Tapping the Performance Potential of Systems via Automatic Configuration Tuning (SoCC, 2017)
-
Automatic Database Management System Tuning Through Large-scale Machine Learning (SIGMOD, 2017)
-
Transfer Learning for Performance Modeling of Configurable Systems: An Exploratory Analysis (ASE, 2017)
-
Forensic Analysis in Access Control: Foundations and a Case-Study from Practice (CCS, 2020)
-
Towards Continuous Access Control Validation and Forensics (CCS, 2019)
-
How Do System Administrators Resolve Access-Denied Issues in the Real World? (CHI, 2017)
-
Detecting and Resolving Policy Misconfigurations in Access-Control Systems (TISSEC, 2011)
-
Baaz: A System for Detecting Access Control Misconfigurations (USENIX Security, 2010)
-
Configuration Dataset - Both configuration files and user-reported configuration issues collected from ServerFault, StackOverflow, and mailing lists
-
Ctest Dataset - Historical configuration-related JIRA issues
-
Mining Container Image Repositories for Software Configurations and Beyond - You can collect configuration files from Docker images
Feel free to open an issue or send me a PR, if you have any suggestions or feedback.