Transcript Carlos.ppt
Columbia University Advanced Machine Learning & Perception – Fall 2006 Term Project Nonlinear Dimensionality Reduction and K-Nearest Neighbor Classification Applied to Global Climate Data Carlos Henrique Ribeiro Lima New York – Dec/2006 Outline 1. Goals 2. Motivation and Dataset 3. Methodology 4. Results 1. Low-Dimensional Manifold 2. KNN on Low-Dimensional Manifold 5. Conclusion 1. Goals 1. Use of kernel PCA based on Semidefinite Embedding to identify the low-dimensional, non-linear, manifold of climate data sets identification of main modes of spatial variability; 2. Classification on the feature space predictions on the original space (KNN method); 2. Motivation Dataset of Monthly Sea Surface Temperature (SST) Huge economical and social impacts of extreme El Nino events (e.g. 1997) Need of forecasting models! 2. Dataset Monthly Sea Surface Temperature (SST) Data from Jan/1856 to Dec/2005 1. Latitudinal Band: 25oS-25oN 2. Grid with 599 cells; 3. Training data: Jan/1856 to Dec/1975 = 120 years 4. Testing set: Jan/1976 to Dec/2005 = 30 years x11 x21 5. Input matrix: X . . . . x1m n = 1440 points . m = 599 dimensions . xn1 xnm 3. Methodology 1) Semidefinite Embedding (Code from K. Q. Weinberger) Semipositive definiteness Inner product centered on the origin Isometry - local distances of the input space are preserved on the feature space 2) KNN Euclidian Distance 3) Probabilistic Forecasting Skill Score (RPS) 4. Results Low-Dimensional Manifold 4. Results Labeling on the feature space 4. Results Forecasts – Testing Set KNN method and skill score E.g. March – 1997; 1) Want to predict the class of nino3 in Dec/1997 lead time = 9 months. 2) KNN on feature space (March:1856 to 1975); 3) Take classes and weights of the k neighbors; 4) Skill score. 4. Results Forecasts – Testing Set KNN method and skill score – El Nino of 1982 and 1997 5. Conclusions 1. Semidefinite Embedding performs well on the SST data (high dimensional just 3 dimensions ~90%of exp. variance); 2. KNN method provides very good classification and forecasts; 3. Need to check sensibility to change in some parameters (# local neighbors, #KNN); 4. Plan to extend to other climate datasets; 5. Try other metrics, multivariate data, etc.