Article Source
Data Heterogeneity Analysis for Out-of-Distribution Generalization
Abstract
Data heterogeneity is a key determinant of the performance of ML systems. Standard algorithms that optimize for average-case performance do not consider the presence of diversity within data. As a result, variations in data sources, data generation mechanisms, and sub-populations lead to unreliable decision-making, poor generalization, unfairness, and false scientific discoveries. Carefully modeling data heterogeneity is a necessary step in building reliable data-driven systems. Its rigorous study is a nascent field of research spanning several disciplines, including statistics, causal inference, machine learning, economics, and operations research. In this tutorial, we develop a unified view of the disparate intellectual threads developed by different communities. We aim to foster interdisciplinary research by providing a unified view based on a shared language. Drawing upon several separate literatures, we establish a taxonomy of heterogeneity and present quantitative measures and learning algorithms that consider heterogeneous data. To spur empirical progress, we conclude by discussing validation protocols and benchmarking practices.