The problem of estimating the proportion of each class in an unlabeled data set using a regression model trained on past labeled data is studied. It is shown that if the training data class proportions substantially differ from those in the unlabeled sample then the straightforward estimate can be highly biased. Unbiased methods are described assuming that the predictor variable distributions given class labels are the same in both samples. Diagnostics are presented for checking the validity of this assumption leading to a simple test for concept drift. It is shown that modifying the classification rule to reflect the actual class proportions in the unlabeled sample can often substantially improve classification accuracy on that sample. The method is generalized to the regression setting for estimating the marginal distribution of the outcome variable in samples with unknown outcome values.About the Speaker
Jerome H. Friedman is a professor of statistics at Stanford University. He received his bachelor and Ph.D degrees from the University of California at Berkeley, both in physics. His main research interests are in machine learning and data mining.
Young (and some not-so-young) researchers often wonder how to extract good research ideas and develop useful methodologies from solving real world problems. The path is rarely straightforward and its success depends on the circumstances, tenacity and luck. I will use three examples to illustrate how I trod the path. The first involved an attempt to find optimal growth conditions for nano structures (i.e., wires, belts, saws). It led to the development of a new method “sequential minimum energy design (smed)”, which exploits an analogy to potential energy of charged particles. After a few years of frustrated efforts and relentless pursuit, we realized that smed is more suitable for generating samples adaptively to mimic an arbitrary distribution rather than for optimization. The main objective of the second example was to build an efficient statistical emulator based on finite element simulation results with two mesh densities in cast foundry operations. It eventually led to the development of a class of nonstationary Gaussian process models that can be used to connect simulation data of different precisions and speeds. The third example hails from cell biology. In a T cell adhesion experiment at Georgia Tech, the biologist was not satisfied with the use of graphical method to understand the serial dependency of cell adhesion over repeated trials. It led to the development of hidden Markov models with new features that reflect the nature of the experiment. In each example, the developed methodology has broader applications beyond the original problem. I will explain the thought process in each example but do not promise any general observation. Finally I will reward the audience with a suggested path to enhance your creativity.About the Speaker
C. F. Jeff Wu is Professor and Coca Cola Chair in Engineering Statistics at the School of Industrial and Systems Engineering, Georgia Institute of Technology. He was the first academic statistician elected to the National Academy of Engineering (2004); also a Member (Academician) of Academia Sinica (2000). A Fellow of American Society for Quality, Institute of Mathematical Statistics, of INFORMS, and American Statistical Association. He received the COPSS (Committee of Presidents of Statistical Societies) Presidents’ Award in 1987, the COPSS Fisher Lecture in 2011, Deming Lecture in 2012, and numerous other awards and honors. Professor Wu has published more than 160 research articles and supervised 40 Ph.D.'s. He has published two books "Experiments: Planning, Analysis, and Parameter Design Optimization" (with Hamada) and “A Modern Theory of Factorial Designs” (with Mukerjee).