Transcript Document
A fuzzy clustering approach to improve the accuracy of Italian students’data An experimental procedure to correct the impact of the outliers on assessment test scores Claudio Quintano, Rosalia Castellano, Sergio Longobardi UNIVERSITY OF NAPLES “PARTHENOPE” [email protected] [email protected] [email protected] OUTLINE This work considers data assessments collected by the on students’ performance Italian National Evaluation Institute of the Ministry of Education (INVALSI) THE INVALSI SURVEY 3 AREAS reading, mathematics and science 5 SCHOOL LEVELS –2th and 4th year of primary school –1th year of lower secondary –1th and 3th year of upper secondary • OUTLIER UNITS, at class level, which brings to biased distributions of the average scores by class • The AIM is to MITIGATE THE PRESENCE of outliers and correcting the overestimation of children ability A fuzzy clustering approach to improve the accuracy of Italian students’data An experimental procedure to correct the impact of the outliers on assessment test scores C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” DISTRIBUTIONS OF MEAN SCORES AT CLASS LEVEL (MATHEMATICS ASSESSMENT) MATHEMATICS CLASS MEAN SCORE - S.Y 2004/05 Histogram Histogram Histogram Histogram Histogram 2.000 500 500 1.250 1.250 IIII CLASS I CLASS UPPER CLASS IVIIUPPER CLASS LOWER SECONDARY PRIMARY SCHOOL SECONDARY SECONDARY PRIMARY SCHOOL SCHOOL SCHOOL SCHOOL Frequency Frequency Frequency Frequency Frequency 400 400 1.500 1.000 1.000 300 300 750 750 1.000 200 500 200 500 500 250 100 100 250 0 00 0 0 0 0,00 0,00 0,00 0 20 20,00 20,00 20,00 20 40 60 60,00 40,00 40,00 60,00 40,00 avergita 60,00 40 60 VAR00002 VAR00005 VAR00005 avergita 80 80,00 80,00 80,00 80 100 100,00 100,00 100,00 100 Mean =74,71 =59,57 MeanDev. Std. =14,133 Mean =52,21 Mean =71,65 Mean =51,24 =10,382 Dev. Std. N =30.097 Std. Dev. =15,229 Std. Dev. =16,15 Std. Dev. =27.437 N =14,451 N =9.280 N =8.454 N =29.559 A fuzzy clustering approach to improve the accuracy of Italian students’data An experimental procedure to correct the impact of the outliers on assessment test scores C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” CLASS MEAN SCORE 2.500 3.000 1.250 2.000 Frequency 750 1.500 1.000 500 1.000 500 250 Mean =87,81 Std. Dev. =8,14 N =30.031 0 0 0 0 20 40 60 80 100 Reading s.y.Histogram 2004/05 20 40 60 80 Mean =74,71 Std. Dev. =14,133 N =30.097 0 0 100 Mathematics s.y. 2004/05 Histogram 2.000 2.000 1.500 1.500 20 40 60 80 100 Science s.y. 2004/05 avergita avergita avergita Histogram 5.000 1.000 Frequency 4.000 Frequency Frequency Frequency 1.000 2.000 Frequency II CLASS - PRIMARY SCHOOL Histogram Histogram Histogram 1.000 3.000 2.000 500 500 1.000 0 0 20 40 60 avergita 80 Reading s.y. 2005/06 100 Mean =77,72 Std. Dev. =13,029 N =29.802 0 0 20 40 60 avergita 80 100 Mathematics s.y. 2005/06 Mean =81,5 Std. Dev. =11,439 N =29.816 0 0 20 40 60 avergita 80 Science s.y. 2005/06 A fuzzy clustering approach to improve the accuracy of Italian students’data An experimental procedure to correct the impact of the outliers on assessment test scores C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” 100 STEP I Deletion of micro considered as units –students- “PSEUDO NON RESPONDENTS” Students who haven’t given the minimum number of answers to compute a performance score The presence of these units varies from 9% to 16% A fuzzy clustering approach to improve the accuracy of Italian students’data An experimental procedure to correct the impact of the outliers on assessment test scores C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” COMPUTATIONSUMMARY OF CLASS LEVEL INDICATOR For each student class the Class mean score following indexes are computed: At first step the micro units Standard deviation considered as “pseudo-non Class mean score : of mean score respondents” have been Class response rate Standard deviation Index ofnon answers’ homogeneity NUMBER BOTH OF ITEM NON dropped fromscore dataset then SCORE OF I STUDENT J of mean REPSONSES AND OF OF INVALID CLASS GINI MEASURE COMPUTED the following indexes, at OF HETEROGENEITY RESPONSES FOR THE I STUDENT OF FOR EACH S TEST QUESTION ADMINISTERED TO non response rate THE Class J CLASS class level, N are computed: EACH STUDENT OF J CLASS N jN Q jj 2 pijEpM pijj ij sj Epj Jj jis1ii111 MC NUMBER OF ADMINISTERED Njjj Q NN Q Index of answers’ ITEMS TO J CLASS TH TH TH TH TH TH TH homogeneity NUMBER OF RESPONDENT STUDENTS OF JTH CLASS NUMBER OF RESPONDENT A fuzzy clustering approach to improve the accuracy of Italian students’data STUDENTS OF JTH CLASS An experimental procedure to correct the impact of the outliers on assessment test scores C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” PRINCIPAL COMPONENT ANALYSIS (PCA) By the PCA we are able to describe the answer behaviour of each student class through two variables SECOND Component Class non response rate INDEX OF CLASS COLLABORATION TO SURVEY FIRST Component OUTLIERS IDENTIFICATION AXIS A fuzzy clustering approach to improve the accuracy of Italian students’data An experimental procedure to correct the impact of the outliers on assessment test scores C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” PRINCIPAL COMPONENT ANALYSIS (PCA) It is possible to detect, graphically, the outlier classes of students Projection on the first two factorial axes plane of second class primary students OUTLIER CLASSES A fuzzy clustering approach to improve the accuracy of Italian students’data An experimental procedure to correct the impact of the outliers on assessment test scores C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” THE FUZZY K-MEANS APPROACH On the basis of the two factorial dimensions the students’classes are classified in 8 clusters by a FUZZY KMEANS algorithm Computation of fuzzy partition matrix where for each students’ class (rows of the matrix) the degree of belonging to each cluster (columns of the matrix) is computed A fuzzy clustering approach to improve the accuracy of Italian students’data An experimental procedure to correct the impact of the outliers on assessment test scores C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” DETECTION OF OUTLIERS High negative scores on “outliers identification axis” (x-axis) that indicates a high class average scores and minimum within variability respect to scores and test answers OUTLIER CLUSTER Projection of centroids computed by fuzzy k-means Factorial scores close to zero respect to the “index of class collaboration to survey” A fuzzy clustering approach to improve the accuracy of Italian students’data An experimental procedure to correct the impact of the outliers on assessment test scores C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” DETECTION OF OUTLIERS Indicating with “a” the outlier cluster, the degree of belonging to this cluster is: µja This measure is considered as the “outlier probability” of jth class Otherwise it can be interpreted as the “outlier level” of each class A fuzzy clustering approach to improve the accuracy of Italian students’data An experimental procedure to correct the impact of the outliers on assessment test scores C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” CORRECTION PROCEDURE On the basis of the outlier cluster degree, a weighting factor is developed: Weighting factor Outlier probability Wj =1 - µja Wj varies from 0 to 1 The students’ class with high probability to belong to outlier cluster will have a low weight while the class very far from this cluster will have a weight close to 1 A fuzzy clustering approach to improve the accuracy of Italian students’data An experimental procedure to correct the impact of the outliers on assessment test scores C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” EFFECTS OF THE CORRECTION PROCEDURE ORIGINAL DISTRIBUTION ADJUSTED DISTRIBUTION A fuzzy clustering approach to improve the accuracy of Italian students’data An experimental procedure to correct the impact of the outliers on assessment test scores C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” THE INSPIRATION PRINCIPLE OUTLIER Go over the dichotomous logic NOT OUTLIER FUZZY APPROACH Compute an “OUTLIER LEVEL” measure for each unit to calibrate the correction A fuzzy clustering approach to improve the accuracy of Italian students’data An experimental procedure to correct the impact of the outliers on assessment test scores C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” RELATIONSHIP BETWEEN THE SCHOOL LOCALIZATION AND THE PRESENCE OF OUTLIER CLASSES Box plot of outlier level µja Degree to belonging to the outlier cluster (cluster n.2) A fuzzy clustering approach to improve the accuracy of Italian students’data An experimental procedure to correct the impact of the outliers on assessment test scores C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” RELATIONSHIP BETWEEN THE SCHOOL LOCALIZATION AND THE PRESENCE OF OUTLIER CLASSES CLASS AVERAGE SCORE DISTRIBUTIONS ONLY FOR THE NORTHERN AND CENTRAL REGIONS A fuzzy clustering approach to improve the accuracy of Italian students’data An experimental procedure to correct the impact of the outliers on assessment test scores C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” REGIONAL SCORES 86 84 II elem MAT 0405 82 80 78 76 74 72 70 68 NOT WEIGHTED AVERAGE 66 64 62 60 58 56 54 52 Si ci lia Sa rd eg na Pu gl ia Ba si lic at a C al ab ria La zi o Ab ru zz o M ol is e C am pa ni a Va lle Pi em on te D 'A os ta Lo Tr m en ba tin r d o ia Al to Ad ig e Fr iu Ve li Ve ne to ne zi a G iu lia Li Em gu ilia ria R om ag na To sc an a U m br ia M ar ch e 50 Media 86 84 II elem MAT 0405 82 80 78 76 74 72 70 WEIGHTED AVERAGE 68 66 64 62 60 58 56 54 52 Media ponderata A fuzzy clustering approach to improve the accuracy of Italian students’data An experimental procedure to correct the impact of the outliers on assessment test scores C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” Si ci lia Sa rd eg na Pu gl ia Ba si lic at a C al ab ria La zi o Ab ru zz o M ol is e C am pa ni a M ar ch e U m br ia Pi em on Va te lle D 'A os ta Lo Tr m en ba tin r d o ia Al to Ad ig e Fr iu Ve li Ve ne to ne zi a G iu lia Li Em gu ilia ria R om ag na To sc an a 50 Index of answers’ homogeneity Q Index of answers’ homogeneity Ej E s 1 sj Q The mean of the Q Gini indexes (Esj) computed for each sth test Question administered to each student of jth class: nt Where Esj is a Gini measure of heterogeneity: E sj 1 t 1 N j n h t Nj 2 denotes the ratio of students of jth class that has given the tth answer to sth question The Gini measure is equal to zero when all students of jth class have given the same answer to the sth question. It reaches the maximum value: h-1/h (h is the number of alternative answers to question sth) when there is perfect heterogeneity of answers to sth question in the jth class A fuzzy clustering approach to improve the accuracy of Italian students’data An experimental procedure to correct the impact of the outliers on assessment test scores C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope” EFFECTS OF THE CORRECTION PROCEDURE Original distribution Adjusted distribution MEAN 74,71 71,67 MODE 100,00 68,75 I QUARTILE 64,42 63,12 MEDIAN 73,61 71,09 III QUARTILE 85,94 80,69 KURTOSIS SKEWNESS A fuzzy clustering approach to improve the accuracy of Italian students’data An experimental procedure to correct the impact of the outliers on assessment test scores C.Quintano, R.Castellano, S.Longobardi - University of Naples “Parthenope”