+ b - eecrg

Transcript + b - eecrg

NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 4 Regression Analysis

John Birks

REGRESSION ANALYSIS

Introduction, Aims, and Main Uses Response model Types of response variables y Types of predictor variables x Types of response curves Transformations Types of regression Null hypothesis, alternative hypothesis, type I and II errors,  and  Quantitative response variable Nominal explanatory (predictor) variables Quantitative explanatory (predictor) variables General linear model

REGRESSION ANALYSIS continued

Presence/absence response variable Nominal explanatory (predictor) variables Quantitative explanatory (predictor) variables Generalised linear model (GLM) Multiple linear regression Multiple logit regression Selecting explanatory variables Nominal or nominal and quantitative explanatory variables Assessing assumptions of regression model Simple weighted average regression Model II regression Software for basic regression analysis

INTRODUCTION

Explore relationships between variables and their environment +/– or abundances for species (responses) Individual species, one or more environmental variable (predictors) Species abundance or presence/absence Environmental variables - response variable Y - explanatory or predictor variables X

AIMS

1. To describe response variable as a function of one or more explanatory variables. This RESPONSE FUNCTION usually cannot be chosen so that the function will predict responses without error. Try to make these errors as

small as possible and to average them to zero.

2. To predict the response variable under some new value of an explanatory variable. The value predicted by the response function is the expected response, the response with the error averaged out. cf. CALIBRATION 3. To express a functional relationship between two variables thought, a priori, to be related by a simple mathematical relationship, but where only one of the variables is known exactly. cf. MODEL II REGRESSION

MAIN USES

(1) Estimate ecological parameters for species, e.g. optimum, amplitude (tolerance) ESTIMATION AND DESCRIPTION (2) Assess which explanatory variables contribute most to a species response and which explanatory variables appear to be unimportant. Statistical testing MODELLING (3) Predict species responses (+/–, abundance) from sites with observed values of explanatory variables PREDICTION (4) Predict environmental variables from species data CALIBRATION or ‘INVERSE REGRESSION’ Sokal & Fox (2002) Rohlf (1995) Draper & Smith (1981) Montgomery & Peck (1992) Crawley (2002, 2005)

RESPONSE MODEL

Systematic part - regression equation Error part statistical distribution of error error

Y = b

+ b

x +



0 fixed but unknown coefficients = intercept

1 = slope

response variable explanatory variable Ey = b 0 + b 1

SYSTEMATIC PART Error part is distribution of  , the random variation of the observed response around the expected response.

Aim is to estimate systematic part from data while taking account of error part of model. In fitting a straight line, systematic part simply estimated by estimating

0 and

1 .

Least squares estimation – error part assumed to be normally distributed.

TYPES OF RESPONSE VARIABLES

Quantitative (log transformation) % quantitative Nominal including +/–

TYPES OF EXPLANATORY or PREDICTOR VARIABLES

Quantitative Nominal Ordinal (ranks) - treat as nominal 1/0 if few classes, quantitative if many classes

TYPES OF RESPONSE CURVES

If one explanatory variable x, consists of fitting curves through data.

What type of curve?

(i) (ii) EDA scatter plots of y and x.

Underlying theory and available knowledge.

TYPES OF RESPONSE CURVES

Shapes of response curves. The expected response (Ey) is plotted against the environmental variable (x). The curves can be constant (a: horizontal line), monotonic (b: sigmoid curve, c: straight line), monotonic decreasing (d: sigmoid curve), unimodal (e: parabola, f: symmetric, Gaussian curve, g: asymmetric curve and a block function) or bimodal (h).

Response curves derived from a bimodal curve by restricting the sampling interval. The curve is bimodal in the interval a-f, unimodal in a-c and in d-f, monotonic in b-c and c-e and almost constant in c-d. Ey = expected response; x = environmental variable.

TRANSFORMATIONS

Usually needed

TRANSFOR

Response variable

Quantitative +/-

TYPES OF REGRESSION

One Nominal Quantitative ANOVA  2 test Explanatory variable

Linear and non-linear regression Logit Nominal Many Quantitative Multiple LR with nominal dummy variables [Log linear contingency tables] Multiple LR Multiple logit regression (LR = Linear Regression) Also weighted averaging regression and model II regressions

NULL HYPOTHESIS, ALTERNATIVE HYPOTHESIS, TYPE I ERROR, TYPE II ERROR,



, AND

 Null hypothesis H 0 Alternative hypothesis H 1 ‘

not correlated with

’ No difference, no association, no correlation. Hypothesis to be tested, usually by some type of significance test.

Postulates non-zero difference, association, correlation. Hypothesis against which null hypothesis is tested.

Tests of statistical hypotheses are probabilistic

Can just as well estimate the degree to which an effect is felt as judge whether the effect exists or not.

As a result, can compute probabilities of two types of error.

Type I error (  ) Type II error (  ) probability that we have mistakenly rejected a true null hypothesis probability that we have mistakenly failed to reject a false null hypothesis DECISION TRUTH H 0 true H 0 false Accept H 0 No error: 1 Type II error:   Reject H 0 Type I error:  No error: 1 

Power of a test is simply the probability of not making type II error, namely 1  . The higher the power, the more likely it is to show, statistically, an effect that really exists.

 0.05

0.01

0.001

 Type I error:  Type II error:  Rarely estimated. Function of critical value of sample size, and the magnitude of effect being looked for.

Error that results when the null hypothesis is FALSELY REJECTED Error that results when the null hypothesis is FALSELY ACCEPTED

QUANTITATIVE RESPONSE VARIABLE, NOMINAL EXPLANATORY VARIABLE

Relative cover (log-transformed) of a plant species ( the log-transformed cover in each type.

 ) in relation to the soil types of clay, peat and sand. The horizontal arrows indicate the mean value in each type. The solid vertical bars show the 95% confidence interval for the expected values in each type and the dashed vertical lines the 95% prediction interval for

QUANTITATIVE RESPONSE VARIABLE, NOMINAL EXPLANATORY VARIABLE

Plant cover

3 soil types

Response model - Systematic part. 3 expected responses, one for each soil type. Error part – observed responses vary around expected responses in each soil type. Normally distributed, and variance within each soil type is same.

Assume responses are independent.



ANALYSIS OF VARIANCE (ANOVA) Estimate:

Expected responses in 3 soil types. Least squares. Sum over all sites of squared differences between observed and expected response to be minimal. Parameter that minimises this SS is the mean.

Difference between Ey and observed response is residual. Least squares minimises sum of squared vertical distances. Residual SS.

Ey, standard error, and 95% confidence interval = Estimate  t (0.95) x s.e

5% critical value in 2-tailed test. Degrees of freedom (v) = n-q parameters

ANOVA table

Means and ANOVA table of the transformed relative cover of the above figure Term Clay Peat Sand Overall mean 2.33

mean 1.7

3.17

2.33

ANOVA table s.e.

0.33

0.38

95% confidence interval (1.00, 2.40) (2.37, 3.97) (1.53, 3.13) (ss/df) Estimate ± t (0.05) (v)  s.e.

Value of ( v t 0.05

(v) depends on number of degrees of freedom ) of the residual with v = 17, t 0.05

(17) = 2.11

d.f.

q-1 n-q n-1 Regression Residual Total R 2 adj = 0.25

d.f.

2 17 19 s.s

7.409

14.826

22.235

m.s

3.704

0.872

1.17

4.24

variance = ms regression df = 2 ms residual (n - q df = 17) Critical value of F at 5% level is 3.59 q = parameters = 3, n = number of objects = 20 ms = ss/df Total ss = Regression ss (q - 1 = 2 df) + Residual ss (n - q = 17 df) (n - 1 = 19 df) R 2 adj R 2 = 1 – (residual variance / total variance) = 1 - (0.872/1.17) = 0.25

= 1 – (residual sum of squares / total sum of squares) = 1 - (14.826/22.235) = 0.333

QUANTITATIVE RESPONSE VARIABLE, QUANTITATIVE EXPLANATORY VARIABLE

Straight line fitted by least-squares regression of log-transformed relative cover on mean water-table. The vertical bar on the far right has length equal to twice the sample standard deviation 

, the other two smaller vertical bars are twice the length of the residual standard deviation ( 

R ).

The dashed line is a parabola fitted to the same data (●) Error part – responses independent and normally distributed around expected values z y

Straight line fitted by least-squares: parameter estimates and ANOVA table for the transformed relative cover of the figure above Term Constant Water-table Parameter Estimate

b b

1 0 4.411

-0.037

s.e.

0.426

0.00705

T (= estimate/se) 10.35

-5.25

df Parameters-1 n-parameters n-1 ANOVA table d.f.

Regression 1 Residual 18 Total 19 R 2 adj = 0.58

s.s.

13.45 13.45

8.78

R 2 = 0.61

m.s.

0.488

22.23 1.17

r = 0.78

27.56 df 1,18

QUANTITATIVE RESPONSE VARIABLE, QUANTITATIVE EXPLANATORY VARIABLE

Does expected response depend on water table?

 F = 27.56 >> 4.4

(F = MS regression MS residual )

Does slope

b 1

  b 1 se

= 0 ?



0.05,18 = 2.10

(critical value 5%) df (1, 18) (df = parameters – 1, n – parameters )

   5 .

25  absolute value of critical value of two tailed t-test at 5%

1 not equal to 0 [exactly equivalent to F test

Construct 95% confidence interval for

estimate 

0.05, v  se =  0.052 /  0.022

Does not include 0  0 is unlikely value for

seb

1 2 

]

Check assumptions of response model

Plot residuals against x and Ey

Could we fit a curve to these data better than a straight line?

Parabola Ey = b 0 + b 1 x + b 2

2 Straight line fitted by least-squares regression of log-transformed relative cover on mean water table. The vertical bar on the far right has a length equal to twice the sample standard deviation 

, the other two smaller vertical bars are twice the length of the residual standard deviation ( 

). The dashed line is a parabola fitted to the same data (  ).

Polynomial regression

Parabola fitted by least-squares regression: parameter estimates and ANOVA table for the transformed relative cover of above figure.

Term Constant Water-table (Water-table) 2 Parameter

2 Estimate 3.988

-0.0187

-0.000169

s.e.

0.819

0.0317

0.000284

4.88

-0.59

Not different from 0 1 extra parameter



1 less d.f.

Regression Residual Total ANOVA table d.f.

2 17 19 R 2 adj (R 2 adj = 0.57

= 0.58 for linear model) s.s.

13.63

8.61

22.23

m.s.

6.815

0.506

1.17

13.97

GENERAL LINEAR MODEL Regression Analysis Summary

Response variable Y = EY + e where

is the expected value of

for particular values of the predictors and e is the variability ("error") of the true values around the expected values

The expected value of the response variable is a function of the predictor variables EY = f(X 1 , ..., X

)

= systematic component, e = stochastic or error component.

Simple linear regression Polynomial regression Null model EY = f(X) = b 0 + b 1

EY = b 0 + b 1 X + b 2

2 EY = b 0

EY = Ŷ = b 0 +

j p

  1

b j x j

Fitted values allow you to estimate the error component, the regression residuals e i = Y

–

i Total sum of squares (variability of response variable) TSS =

i n

  1 (

Y i



) 2 where = mean of

This can be partitioned into (i) The variability of

of squares

explained by the fitted model, the regression or model sum MSS =

i n

  1 (



) 2 (ii) The residual sum of squares RSS =

i n

  1 (

Y i

 ˆ

) 2 =

i n

  1

e i

2 Under the null hypothesis that the response variable is independent of the predictor variables MSS = RSS if both are divided by their respective number of degrees of freedom.

PARABOLA FITTED TO LOG-ABUNDANCE DATA, fitting a Gaussian unimodal response curve to original abundance data

z (y)

z = c exp[-0.5(x-u) 2 /t 2 ] (y) Gaussian response curve with its three ecologically important parameters: maximum (c), optimum (u) and tolerance (t). Vertical axis: species abundance. Horizontal axis: environmental variable. The range of occurrence of the species is seen to be about 4t.

log

z = b

+ b

x + b

x

= log

(c) - 0.5 (x-u)

/t

Optimum u =



b

/ (2b

) Tolerance t = 1/



(



2b

) Maximum c = exp (b

+ b

u + b

u

) If b

+, minimum Approximate SE of u and t can be calculated

PRESENCE-ABSENCE RESPONSE VARIABLE, NOMINAL EXPLANATORY VARIABLE

Numbers of fields in which Achillea ptarmica is present and absent in meadows with different types of agricultural use and frequency of occurrence of each type (unpublished data from Kruijne et al., 1967). The types are pure hayfield (ph), hay pastures (hp), alternate pasture (ap) and pure pasture (pp).

Achillea ptarmica

Agricultural use

Explanatory variables χ 2 =  



 2

o = observed frequency e = expected frequency Response

Present Absent

Total Frequency ph 37 109 146 0.254

hp 40 356 396 0.101

ap 27 402 429 0.063

pp 9 558 567 0.016

Total 113 1425 1538 0.073

(r-1) (c-1) degrees of freedom Relative frequency of occurrence is 113/1538 = 0.073

Under null hypothesis, the expected number of fields with Achillea ptarmica present is, pure hayfield (ph) 0.073 x 146 = 10.7, haypasture (hp) 0.073 x 396, etc. Calculated x ptarmica depends on field type.

= 102.1 compared with critical value of 7.81 at 0.05 level with 3 df. Conclude that occurrence of A.

PRESENCE-ABSENCE RESPONSE VARIABLE, QUANTITATIVE EXPLANATORY VARIABLE

1: Ey = b o +b 1

Can be negative 2: Ey = exp(b o +b 1 x) Can be >1 3: Ey = p = [exp(b o +b 1 x)] [1 + exp (b o +b 1 x)] (b o + b 1 x) linear predictor Straight line (a), exponential curve (b) and sigmoid curve (c) representing equations 1,2, and 3, respectively.

Systematic part – defined as shown Error part – response can only have two values therefore binomial error distribution Cannot estimate parameters by least-squares regression as errors not normally distributed and have no constant variance LOGIT REGRESSION – special case of GLM GENERALISED LINEAR MODEL

Logit or

GENERALISED LINEAR MODEL (GLM)

Not the same as General Linear Model, more generalised log

   1



    linear predictor p = [exp (linear predictor)] / [1 + exp (linear predictor)] Estimation in GLM by maximum likelihood.

Likelihood is defined for a set of parameter values as the probability of responses actually observed when that set of values is the true set of parameter values. ML chooses the set of parameter values for which likelihood is maximum.

Measure deviation of observed responses to fitted responses, not by residual SS as in least-squares, but by RESIDUAL DEVIANCE.

[Least-squares principle equivalent to ML if errors are independent and follow normal distribution].

Least-squares regression is one type of GLM.

GLIM

Solved iteratively.

GENSTAT R or S-PLUS

Sigmoid curve fitted by logit regression of the presences (● at p = 1) and absences (● at p = 0) of a species on acidity (pH). In the display, the sigmoid curve looks like a straight line but is not. The curve expresses the probability (p) of occurrence of the species in relation to pH. Sigmoid curve fitted by logit regression parameter estimates and deviance table for the presence-absence data of the above figure.

Term Parameter Estimate s.e.

Constant b pH b 0 1 d.f.

2.03

-0.484

Deviance 1.98

0.357

1.03

-1.36 (not >|2.111|) Mean deviance Residual 33 43.02

1.304

Not different from a horizontal line, as t-test of b 1 = 0 not rejected

Parabola (a), Gaussian curve (b) and Gaussian logit curve (c) representing the equations, respectively. If we take for linear predictor the logit transformation of p log e p = [exp (linear predictor) ]/[ 1 + exp (linear predictor)] [p/(1-p)] = linear predictor For a parabola (b 0 + b 1 x + b

2 x 2

) we get p = [exp (b 0 + b 1 x + b

2 x 2

) ]/[1 + exp (b 0 + b 1 x + b 2

2 )] or log   1



  = b 0 + b 1 x + b GAUSSIAN LOGIT CURVE 2

Gaussian logit curve fitted by logit regression of the presences (● at p = 1) and absences (● at p = 0) of a species on acidity (pH). u = optimum; t = tolerance; p

max

= maximum probability of occurrence.

Gaussian logit curve fitted by logit regression: parameter estimates and deviance table for presence-absence data Term Constant pH pH 2 b b b 0 1 2 Estimate -12.88

49.4

4.68

s.e.

51.1

19.8

1.9

-2.52

2.5

-2.47

> t of 1.96

Residual d.f.

32 Deviance 23.17

Mean deviance 0.724

u = -b 1 / (2b 2 ) t = 1 / (√(-2b 2 )

max = {1 + exp (-b 0 – b 1 u – b 2

2 )}

– tests of

2 ,

1 and

0 Deviance tests - Gaussian logit curve model → linear – logit (sigmoidal) → null Drop in deviance > χ 2 3.84

Residual deviance of a model is compared with that of an extended model. The additional parameters in the extended model (e.g. Gaussian logit) are significant when the drop in residual deviance is larger than the critical value of a χ 2 distribution with

degrees of freedom (

=number of additional parameters) Example: Gaussian logit model – residual deviance = 23.17

Sigmoidal model – residual deviance = 43.02

43.02 - 23.17=19.85 which is >> χ 2 0.05

(1)=3.84

RESPONSE VARIABLE WITH MANY ZERO VALUES

Counts 0,1,2,3...

Log Ey = linear predictor Log-linear or Poisson regression Can be (b 0 + b 1 x) (b 0 + b 1 x + b 2

2 ) exponential curve Gaussian curve (if

2 < 0 ) [Poisson error distribution, link function log] Can transform to PSEUDOSPECIES (as in TWINSPAN) and use as +/– response variables in logit regression.

QUANTITATIVE RESPONSE VARIABLE, MANY QUANTITATIVE EXPLANATORY VARIABLES

Response variable expressed as a function of two or more explanatory variables. Not the same as separate analyses because of correlations between explanatory variables and interaction effects.

MULTIPLE LEAST-SQUARES LINEAR REGRESSION Planes Ey = b 0 + b 1

1 + b 2

2 explanatory variables

2 – expected response when

1 and

2 = 0 – rate of change in expected response along

1 – rate of change in expected response along

2 axis axis

2 measures change of

measures change of

with

1 with

2 for a fixed value of

A plane displays the linear relation between the abundance value (y) of a species and two environmental variables (x

1 and x 2

) fitted to artificial data (  ).

A straight line displays the linear relationship between the abundance value (y) of a species and an environmental variable (x), fitted to artificial data (  ). (a = intercept; b = slope or regression coefficient).

Three-dimensional view of a plane fitted by least-squares regression of responses (●) on two explanatory variables x1 and x2. The residuals, i.e. the vertical distances between the responses and the fitted plane are shown. Least-squares regression determines the plane by minimization of the sum of these squared vertical distances. Estimates of

0 ,

1 ,

2 and standard errors and

( estimate / se ) ANOVA total SS, residual SS, regression SS R 2 = 1  Residual SS Total SS R 2 adj = 1  Residual MS Total MS Ey = b 0 + b 1

1 + b 2

2 + b 3

3 + b 4

4 + ……..b m

m MULTICOLLINEARITY Forward selection Selection of explanatory variables: Backward selection ‘Best-set’ selection

REF

REGRESSION AND ANOVA

In multiple regression, where

y i

linear model is: are

y i

=  0 +  1

+  independent variables (response), the familiar 2

i2 + ….+ 

+ 

(A1) REF where

x ij

’s (

predictor variables) are known constants, parameters and 

’s are independent normal random variables. In matrix notation, the model is written as y = X  +  , with matrices:  0 ,  1 ,…, 

are unknown y         .

y y y

1 2

n T

      X         1 1 .

x x

11 21  1

x n T

1 .

x x

k n T k

               .

   1

0                .

   1 2

n T

       where

n T

= total number of replicates. The least squares estimates parameters  are obtained by the normal equations:

of the

X’Xb = X’y

(A2) And taking the inverse of X’X , we have: REF

b = [X’X] -1 [X’y]

(A3) REF

REF In a similar fashion, consider the linear model for a one-way ANOVA: REF

Y ij

=  + 

+ 

(A4) where mean,

y ij



is the value of the is the effect of the jth ith replicate in the treatment and ith 

with that replicate. The model for the expectation of treatment,

 is the overall parametric is the random normal error associated in any particular treatment is: E(y i ) =  + t i (A5) with

t i

the ith treatment effect. If there were, for example, three treatments, the model could be written as: E(y) =  X 0 + t 1 X 1 + t 2 X 2 + t 3 X 3 (A6) The values of A6 are:

X i

required to reproduce the model E(y

) = 

0 = 1 and + t

for a given

y i

, using equation

X i

 1 if the

th treatment is applied, otherwise

X i

 0 REF REF

REF This can be expressed by the following matrices:

                   

y y y y

y y .

11 1 2 3

j j j

21 31                   

   1 0   1 0   1 1 0 0   1 0  1 1 0 0          1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 0 1 0         1 0 1 0 0 1 0 1 0 1          1 0 0 1 

  μ  

1  

t t

2 3      REF where the columns of the matrix

correspond to

0 , X 1 , X 2 and A least-squares solution may again be obtained by the equation:

X’Xb=X’y

(A7)

3 , respectively. REF REF

RESPONSE SURFACES

PARABOLA

Ey = b

+ b

x + b

2 If log Y Gaussian curve QUADRATIC SURFACE

Ey = b

+ b

1 parameters)

+ b

1 2

+ b

2 2 (5 Bivariate Gaussian response surface if b 2 are both negative and b 4 T-tests to test  0 Test if surface is unimodal in direction of

1 null hypothesis

2  0 against

2 < 0 (t of b 2 ) by

4 – test if surface is unimodal in direction of

2 Can also test if

2 influences abundance of

in addition to

1 , i.e. do

3 and

4 = 0 ?

MORE COMPLEX MODELS Ey = b 0 + b 1

1 + b 2

1 2 + b 3

2 + b 4

2 2 + b 5

3 + b 6

3 2 + ... b t

m 2 Hence need for selecting explanatory variables

PRESENCE-ABSENCE RESPONSE VARIABLE MANY QUANTITATIVE EXPLANATORY VARIABLES MULTIPLE LOGIT REGRESSION

Multiple logit regression

log

   1



   

0 

1 

2 Test for effects of x 1 and x 2 . t-tests of b 1 and b 2 .

Bivariate Gaussian logit surface

log

   1



   

0 

1 

x 1 2



x 2



x 2 2

2 expl variables 2 expl variables

log

   1



   

0 

1 

x 1 2



x 2



x 2 2

Three-dimensional view of a bivariate Gaussian logit surface with the probability of occurrence (p) plotted vertically and the two explanatory variables x1 and x2 plotted in the horizontal plane.

Elliptical contours of the probability of occurrence p plotted in the plane of the explanatory variables x 1 and x 2 . One main axis of the ellipses is parallel to the x 1 axis and the other to the x 2 axis. Gaussian logit surface

INTERACTION EFFECTS OF

1 Product terms

Ey = b 0 + b 1

1 = (b 0 + b 2

2 + b ) + (b 1 2

2 + b + b 3

2 3

1 ) x

1 2 Intercept and slope and hence values of

1 depend on

2 Effect of If

2 also depends on

1 = 0, NO INTERACTION between

1 and

AND

2 Quadratic surface

Ey = b 0 + b 1

1 + b 2

1 2 + b 3

2 + b 4

2 2 + b 5

2 If

2 + b 4 < 0 and 4b 2

4 – b 5 2 > 0, axes not necessarily orthogonal have unimodal surface with ellipsoidal contours but Can calculate overall optimum

u u

2 1 = (b 5

3 = (b 5

1 – 2b 1

4 ) / d – 2b 3

2 ) / d d = 4b 2

4 – b 5 2

Gaussian logit surface

log

  1



  

0 

1 

2 1 

2 

2 2 

2 If If

5 ≠ 0, = 0, optimum with respect optimum with respect to x 1 to x 1 does depend on value of i.e. NO INTERACTION

2 . does not depend on values of

2 ,

SELECTING EXPLANATORY VARIABLES

• If model is balanced, parameters can be entered or removed in any order • Adequate model: Non-significantly different from the best model • Best subset method for selecting variables Try all possible combinations, select the best Look at the others as well • Automatic selection of variables does not necessarily give the best subset Backward elimination: Start with all variables, then remove variables starting with the worst, and continue until all remaining are significant Forward selection: Start with nothing, add best, as long as the new variables are significant Stepwise: Start with forward selection, but try backward elimination after every step J.D. Olden & D.A. Jackson (2000) Ecoscience 7, 501-510.

Torturing data for the sake of generality: how valid are our regression models?

AKIAKE INFORMATION CRITERION (AIC)

Index of fit that takes account of the parsimony of the model by penalising for the number of parameters. The more parameters in a model, the better the fit. You get a perfect fit if you have a parameter for every data point but the model has no explanatory power.

Trade-off between goodness of fit and the number of parameters required by parsimony.

AIC useful as it explicitly penalises any superfluous parameters in the model by adding 2p to the variance or deviance.

AIC = -2 x (maximised log-likelihood) + 2 x (number of parameters)

Small values are indicative of a good fit to the data.

In multiple regression, AIC is just the residual variance plus twice the number of regression coefficients (including the intercept).

Used to compare the fit of alternative models with different numbers of parameters, and thus useful in model selection.

Smaller the AIC, better the fit.

Given the alternative models involving different numbers of parameters, select the model with the lowest AIC.

MANY EXPLANATORY NOMINAL OR NOMINAL AND QUANTITATIVE VARIABLES

Three soil types - clay, peat, sand

Clay - reference class Peat - dummy variable Sand - dummy variable

3 = 1 = 1 when peat when sand , 0 , 0 when clay or sand when clay or peat classes , k – 1 dummy variables Systematic part

3 = expected response in reference class (clay) = = Ey = b 1 + b 2

2 + b 3

3 difference in expected response between peat and clay difference in response between sand and clay Multiple logit regression - +/– response variable, one continuous variable (x 1 ) and one nominal variable (3 classes (x 2 , x 3 )) log

  1



  

0 

1 

1 2 

2 

Response curves for Equisetum fluviatile fitted by multiple logit regression of the occurrence of E. fluviatile in freshwater ditches on the logarithm of electrical conductivity (EC) and soil type surrounding the ditch (clay, peat, sand). Data from de Lange (1972).

Residual deviance tests to test if maxima are different by dropping x 2 and x 3 .

ASSESSING ASSUMPTIONS OF REGRESSION MODEL

Regression diagnostics – Faraway (2005) chapter 4 Linear least-squares regression 1. relationship between

transformation and

is linear, perhaps after 2. variance of random error is constant for all observations 3. errors are normally distributed 4. errors for

observations are independently distributed Assumption (2) required to justify choosing estimates of

parameters so as to minimise residual SS and needed in tests of

and

values. Clearly in minimising SS residuals, essential that no residuals should be larger than others.

Assumption (3) needed to justify significance tests and confidence intervals.

RESIDUAL PLOTS

Plot (Y – EŶ) against

EŶ

RESIDUAL PLOTS

Residual plots from the multiple regression of gene frequencies on environmental variables for Euphydryas editha: (a) standardised residuals plotted against Y values from the regression equation, (b) standardised residuals against X

1 ,

(c) standardised residuals against X

2 ,

(d) standardised residuals against X

3 ,

(e) standardised residuals against X

, and (f) normal probability plot. Normal probability plot –plot ordered standardised residuals against expected values assuming standard normal distribution. If (Y – Ŷ

) is standard residual for I, expected value is value for standardised normal distribution that exceeds proportion {i – (⅜)} / (n + (¼)) of values in full population

Standardised residual =

(



ˆ )

MSE

SIMPLE WEIGHTED AVERAGE REGRESSION

OPTIMA +/–

 1

n i n

  1

x i

Abundance data

u k



i n

  1

n i

  1

y y ik ik x i

TOLERANCES +/–

y ik

abundance of species

at site

tˆ

   1

n i n

  1 

x i



 2   1 2 Abundance data

     

i n

  1

y ik



x i i n

  1

y ik



u k

 2      1 2

WACALIB CALIB C2

DISREGARDS ABSENCES - DEPENDS ON DISTRIBUTION OF EXPLANATORY VARIABLE

ter Braak & Looman (1986) Vegetatio 65: 3-11 +/– data - WA just as good as GLR when: 1. species is rare and has narrow tolerance 2. distribution of environmental variable amongst sites is reasonably homogenous over range of species occurrences 3. site scores (

i ) are closely spaced in comparison with species amplitude or tolerance Abundance data: 1. Poisson distributed 2. sites homogeneously distributed

WEIGHTED AVERAGES ARE GOOD ESTIMATES

... of species optima if: 1. Sites

are evenly distributed about optimum

2. Sites are close to each other ... of gradient values if: 1. Species optima

are evenly distributed about site

2. All species have equal response widths

3. All species have equal maximum abundance

4. Optima

other are close to each

Conditions are strictly true only for infinite gradients.

J. Oksanen (2002)

BIAS AND TRUNCATION IN WEIGHTED AVERAGING

Weighted averages are usually good estimates of Gaussian optima, unless the response is truncated. Overestimation at the low end of the gradient, underestimation at the high end of the gradient.

Slight bias towards the gradient centre: shrinkage of WA estimates WA GLR WA GLR J. Oksanen (2002)

MODEL II REGRESSION

When both the response and predictor variables of the model are random (not controlled by the researcher), there is error associated with measurements of both

and

This is model II regression Examples: Body mass and length In vivo fluorescence and chlorophyll a Respiration rate and biomass Want to estimate the parameters of the equation that describes the relationship between pairs of random variables.

Must use model II regression for parameter estimation, as the slope found by ordinary least-squares regression (model I regression) may be biased by the presence of measurement error in the predictor variable.

MODEL II REGRESSION METHODS

Choice of model II regression method depends on the reasons for use and on the features of data Method Use and data Test possible OLS MA RMA Error on y >> error on x Distribution is bivariate normal Variables are in the same physical units or dimensionless Variance of error about the same for x and y Yes Distribution is bivariate normal Error variance on each axis proportional to variance of corresponding variable Check scatter diagram: no outliers Yes SMA OLS Correlation r is significant Distribution is not bivariate normal Relationship between x and y is linear No Yes OLS MA To compute forecasted (fitted) or predicted y values (Regression equation and confidence intervals are irrelevant) To compare observations to model predictions OLS = ordinary least squares regression SMA = standard major axis regression MA = major axis regression RMA = ranged major axis regression Yes Yes

MODEL II

(www.fas.umontreal.ca/biol/legendre)

MODEL II REGRESSION METHODS (continued) (1) Major axis regression (MA) is the first principal component of the scatter of points. This axis minimises the squared Euclidean distances between the points and the regression line instead of the vertical distances as in OLS (2) Standard major axis regression (SMA) is a way to make the variables dimensionally homogenous prior to regression.

i) standardise variables

deviation) and

(subtract mean, divide by standard ii) compute MA regression on standardised

and

iii) back-transform the slope estimate to the original units by multiplying it by s

where s = standard deviations of

and

MODEL II REGRESSION METHODS (continued) (3) Ranged major axis regression (RMA) A disadvantage of SMA regression is that the standardisation makes the variances equal.

In RMA, variables are made dimensionally homogeneous by ranging

y i

1 

y i y

max  

min

min i) transform variable

and

by ranging ii) compute MA regression on ranged

and

iii) back-transform the slope estimate to the original units by multiplying them by the ratio of the ranges (y max – y min )/(x max – x min ) (4) Ordinary least squares regression (OLS) Assumes no error on

. If error on y >> to estimate the slope parameter error on

, OLS can be used

STATISTICAL TESTING FOR MODEL II REGRESSION

Confidence intervals – with all methods, confidence intervals are large when

is small. Become smaller as

reaches about 60, after which they change very slowly. Model II regression should ideally be used with data sets with 60 or more observations. Confidence intervals for slope and intercept possible for MA, SMA, RMA, and OLS.

Statistical significance of slope – can be assessed by permutation tests for the slopes of MA, OLS, and RMA and for the correlation coefficient

. Cannot test by permutation the slope in SMA as the slope estimate is s

and for all permuted data s

is constant. All one can do is to test the correlation

r xy

instead of testing b SMA .

General advice is to compute MA, RMA, SMA, and OLS and evaluate results carefully in light of the features of the data (magnitude of errors, distributions) and the purpose of the regression.

Legendre & Legendre (1998) pp. 500-517 McArdle (1998) Can. J. Zool. 66, 2329-2339

COMPUTING SOFTWARE FOR REGRESSION ANALYSIS

Basic regression MINITAB SYSTAT GENSTAT or GLIM STATISTIX (SX) R or S-PLUS Weighted average regression C2 Model II regression MODEL II

+ b - eecrg

Transcript + b - eecrg

NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 4 Regression Analysis

John Birks

REGRESSION ANALYSIS

REGRESSION ANALYSIS continued

INTRODUCTION

AIMS

MAIN USES

RESPONSE MODEL

Y = b

+ b

x +

TYPES OF RESPONSE VARIABLES

TYPES OF EXPLANATORY or PREDICTOR VARIABLES

TYPES OF RESPONSE CURVES

TYPES OF RESPONSE CURVES

TRANSFORMATIONS

TYPES OF REGRESSION

NULL HYPOTHESIS, ALTERNATIVE HYPOTHESIS, TYPE I ERROR, TYPE II ERROR,

, AND

QUANTITATIVE RESPONSE VARIABLE, NOMINAL EXPLANATORY VARIABLE

QUANTITATIVE RESPONSE VARIABLE, NOMINAL EXPLANATORY VARIABLE

ANOVA table

QUANTITATIVE RESPONSE VARIABLE, QUANTITATIVE EXPLANATORY VARIABLE

QUANTITATIVE RESPONSE VARIABLE, QUANTITATIVE EXPLANATORY VARIABLE

GENERAL LINEAR MODEL Regression Analysis Summary

log

z = b

+ b

x + b

x

= log

(c) - 0.5 (x-u)

/t

Optimum u =

b

/ (2b

) Tolerance t = 1/

(

2b

) Maximum c = exp (b

+ b

u + b

u

) If b

+, minimum Approximate SE of u and t can be calculated

PRESENCE-ABSENCE RESPONSE VARIABLE, NOMINAL EXPLANATORY VARIABLE

PRESENCE-ABSENCE RESPONSE VARIABLE, QUANTITATIVE EXPLANATORY VARIABLE

GENERALISED LINEAR MODEL (GLM)

RESPONSE VARIABLE WITH MANY ZERO VALUES

QUANTITATIVE RESPONSE VARIABLE, MANY QUANTITATIVE EXPLANATORY VARIABLES

REGRESSION AND ANOVA

RESPONSE SURFACES

PRESENCE-ABSENCE RESPONSE VARIABLE MANY QUANTITATIVE EXPLANATORY VARIABLES MULTIPLE LOGIT REGRESSION

INTERACTION EFFECTS OF

AND

SELECTING EXPLANATORY VARIABLES

AKIAKE INFORMATION CRITERION (AIC)

MANY EXPLANATORY NOMINAL OR NOMINAL AND QUANTITATIVE VARIABLES

ASSESSING ASSUMPTIONS OF REGRESSION MODEL

SIMPLE WEIGHTED AVERAGE REGRESSION

DISREGARDS ABSENCES - DEPENDS ON DISTRIBUTION OF EXPLANATORY VARIABLE

WEIGHTED AVERAGES ARE GOOD ESTIMATES

BIAS AND TRUNCATION IN WEIGHTED AVERAGING

MODEL II REGRESSION

MODEL II REGRESSION METHODS

STATISTICAL TESTING FOR MODEL II REGRESSION

COMPUTING SOFTWARE FOR REGRESSION ANALYSIS

Directory