This repository holds the official code for the paper Fair Canonical Correlation Analysis.
This work investigates fairness and bias in Canonical Correlation Analysis (CCA), a widely used statistical technique for examining the relationship between two sets of variables. We present a framework that alleviates unfairness by minimizing the correlation disparity error associated with protected attributes. Our approach enables the CCA model to learn global projection matrices from all data points while ensuring that these matrices yield comparable correlation levels to group-specific projection matrices. Experimental evaluation on both synthetic and real-world datasets demonstrates the efficacy of our method in reducing unfairness without compromising CCA model accuracy. These findings emphasize the importance of considering fairness in CCA applications to real-world problems.
CCA is a way of inferring information from cross-covariance matrices. If we have two vectors
In our paper, we use three real-world datasets listed as follows.
- Mental Health and Academic Performance Survey (MHAAPS): This particular dataset consists of three psychological variables, four academic variables in the form of standardized test scores, as well as sex information for a cohort of 600 individuals classified as college freshmen. The primary objective of this investigation revolves around examining the interrelationship between the aforementioned psychological variables and academic indicators, with careful consideration given to the potential influence exerted by sex.
- National Health and Nutrition Examination Survey (NHANES): We utilized the 2005-2006 subset of the NHANES database, including physical measurements and self-reported questionnaires from participants. We partitioned the data into two subsets based on feature types to discern the individual and collective impact of various factors on health outcomes. The 'Phenotypic-Demographic' dataset contained physical traits and indicators, such as height, weight, BMI, and waist circumference, alongside demographic variables such as socioeconomic and demographic factors. The 'Environmental-Demographic' dataset encompassed environmental exposure variables and demographic factors, indicating exposure to specific environmental elements derived from the NHANES questionnaire responses.
- Alzheimer's Disease Neuroimaging Initiative (ADNI): The ADNI was launched in 2003 as a public-private partnership led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial MRI, PET, other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early AD. All participants provided written informed consent, and study protocols were approved by each participating site’s Institutional Review Board (IRB). Up-to-date information about the ADNI is available at www.adni-info.org. We utilized AV45 (amyloid) and AV1451 (tau) positron emission tomography (PET) data from the ADNI database.
The algorithm is implemented in Matlab. The Optimization Toolbox from MathWorks is required in the implementation. To install the related toolbox, see https://www.mathworks.com/products/optimization.html for more details.
Two methods, multi_cca.m and single_cca.m, are introduced in this work for Fair CCA. Please see synthetic_example.m for detailed examples based on the synthetic data. Synthetic data can be generated by synthetic_data_generation.m. For reproducibility, we already provide synthetic_data.mat.
Data used in this study are obtained from NHANES and ADNI. The authors Zhuoping Zhou, Davoud Ataee Tarzanagh, and Bojian Hou have contributed equally to this paper.
Zhuoping Zhou ([email protected])