Lecture 6: Linear Regression II Machine Learning CUNY Graduate Center Extension to polynomial regression.
Download ReportTranscript Lecture 6: Linear Regression II Machine Learning CUNY Graduate Center Extension to polynomial regression.
Lecture 6: Linear Regression II Machine Learning CUNY Graduate Center Extension to polynomial regression 1 Extension to polynomial regression • Polynomial regression is the same as linear regression in D dimensions 2 Generate new features Standard Polynomial with coefficients, w Risk 3 Generate new features Feature Trick: To fit a D dimensional polynomial, Create a D-element vector from xi Then standard linear regression in D dimensions 4 How is this still linear regression? • The regression is linear in the parameters, despite projecting xi from one dimension to D dimensions. • Now we fit a plane (or hyperplane) to a representation of xi in a higher dimensional feature space. • This generalizes to any set of functions 5 Basis functions as feature extraction • These functions are called basis functions. – They define the bases of the feature space • Allows linear decomposition of any type of function to data points • Common Choices: – – – – Polynomial Gaussian Sigmoids Wave functions (sine, etc.) 6 Training data vs. Testing Data • Evaluating the performance of a classifier on training data is meaningless. • With enough parameters, a model can simply memorize (encode) every training point • To evaluate performance, data is divided into training and testing (or evaluation) data. – Training data is used to learn model parameters – Testing data is used to evaluate performance 7 Overfitting 8 Overfitting 9 Overfitting performance 10 Definition of overfitting • When the model describes the noise, rather than the signal. • How can you tell the difference between overfitting, and a bad model? 11 Possible detection of overfitting • Stability – An appropriately fit model is stable under different samples of the training data – An overfit model generates inconsistent performance • Performance – A good model has low test error – A bad model has high test error 12 What is the optimal model size? • The best model size generalizes to unseen data the best. • Approximate this by testing error. • One way to optimize parameters is to minimize testing error. – This operation uses testing data as tuning or development data • Sacrifices training data in favor of parameter optimization • Can we do this without explicit evaluation data? 13 Context for linear regression • • • • Simple approach Efficient learning Extensible Regularization provides robust models 14 Linear Regression • Identify the best parameters, w, for a regression function 15 Overfitting • Recall: overfitting happens when a model is capturing idiosyncrasies of the data rather than generalities. – Often caused by too many parameters relative to the amount of training data. – E.g. an order-N polynomial can intersect any N+1 data points 16 Dealing with Overfitting • • • • Use more data Use a tuning set Regularization Be a Bayesian 17 Regularization Regularization •InIn a linear regression model overfitting is a Linear Regression model, overfitting is characterized by large parameters. characterized by large weights. w0 w1 w2 w3 w4 w5 w6 w7 w8 w9 M = 0 0.19 M = 1 0.82 -1.27 M= 3 0.31 7.99 -25.43 17.37 M = 9 0.35 232.37 -5321.83 48568.31 -231639.30 640042.26 -1061800.52 1042400.18 -557682.99 125201.43 18 Penalize large weights • Introduce a penalty term in the loss function. Regularized Regression (L2-Regularization or Ridge Regression) 19 Regularization Derivation 20 21 Regularization in Practice 22 Regularization Results 23 More regularization • The penalty term defines the styles of regularization • L2-Regularization • L1-Regularization • L0-Regularization – L0-norm is the optimal subset of features 24 Curse of dimensionality • Increasing dimensionality of features increases the data requirements exponentially. • For example, if a single feature can be accurately approximated with 100 data points, to optimize the joint over two features requires 100*100 data points. • Models should be small relative to the amount of available data • Dimensionality reduction techniques – feature selection – can help. – L0-regularization is explicit feature selection – L1- and L2-regularizations approximate feature selection. 25 Bayesians v. Frequentists • What is a probability? • Frequentists – A probability is the likelihood that an event will happen – It is approximated by the ratio of the number of observed events to the number of total events – Assessment is vital to selecting a model – Point estimates are absolutely fine • Bayesians – A probability is a degree of believability of a proposition. – Bayesians require that probabilities be prior beliefs conditioned on data. – The Bayesian approach “is optimal”, given a good model, a good prior and a good loss function. Don’t worry so much about assessment. – If you are ever making a point estimate, you’ve made a mistake. The only valid probabilities are posteriors based on evidence given some prior 26 Bayesian Linear Regression • The previous MLE derivation of linear regression uses point estimates for the weight vector, w. • Bayesians say, “hold it right there”. – Use a prior distribution over w to estimate parameters • Alpha is a hyperparameter over w, where alpha is the precision or inverse variance of the distribution. • Now optimize: 27 Optimize the Bayesian posterior As usual it’s easier to optimize after a log transform. 28 Optimize the Bayesian posterior As usual it’s easier to optimize after a log transform. 29 Optimize the Bayesian posterior Ignoring terms that do not depend on w IDENTICAL formulation as L2-regularization 30 Context • Overfitting is bad. • Bayesians vs. Frequentists – Is one better? – Machine Learning uses techniques from both camps. 31 Next Time • Logistic Regression 32