Transcript Slide 1
The importance of data management Paul Lambert, 31st January 2012 Talk to the seminar ‘Data management in the social sciences and the contribution of the DAMES Node’, a session organised as part of the Data Management through e-Social Science ESRC research Node www.dames.org.uk DAMES, 31/JAN/2012, T1 Today’s session (2V1/2V3) DAMES, 31/JAN/2012, T1 2 ‘Data Management though eSocial Science’ DAMES – www.dames.org.uk ESRC funded research Node Funded 2008-11, with ongoing work into 2012 with the NeISS (www.neiss.org.uk) and ‘eStat’ (www.bristol.ac.uk/cmm/research/estat/) projects Aim: Useful social science provisions Specialist data topics – occupations; education qualifications; ethnicity; social care; health Computer science research on secure data models; metadata and linking data; workflows Programme of case studies and provisions DAMES, 31/JAN/2012, T1 3 ‘Data management’ means… ‘the tasks associated with linking related data resources, with coding and re-coding data in a consistent manner, and with accessing related data resources and combining them within the process of analysis’ […DAMES Node..] Usually performed by social scientists themselves Most overt in quantitative survey data analysis • ‘variable constructions’, ‘data manipulations’ • navigating abundance of data – thousands of variables Usually a substantial component of the work process Here we differentiate from archiving / controlling data itself DAMES, 31/JAN/2012, T1 4 Some components… Manipulating data Recoding categories / ‘operationalising’ variables Linking data Linking related data (e.g. longitudinal studies) combining / enhancing data (e.g. linking micro- and macro-data) Secure access to data Linking data with different levels of access permission Detailed access to micro-data cf. access restrictions Harmonisation standards Approaches to linking ‘concepts’ and ‘measures’ (‘indicators’) Recommendations on particular ‘variable constructions’ Cleaning data ‘missing values’; implausible responses; extreme values DAMES, 31/JAN/2012, T1 5 Example – recoding data Count educ4 1.00 Degree -9.00 Highest educational qualification 2.00 Diploma 4.00 School level or below Total -9 Missing or wild 323 0 0 0 0 323 -7 Proxy respondent 982 0 0 0 0 982 1 Higher Degree 0 425 0 0 0 425 2 First Degree 0 1597 0 0 0 1597 3 Teaching QF 0 0 340 0 0 340 4 Other Higher QF 0 0 3434 0 0 3434 5 Nursing QF 0 0 161 0 0 161 6 GCE A Levels 0 0 0 1811 0 1811 7 GCE O Levels or Equiv 0 0 0 0 2518 2518 8 Commercial QF, No O Levels 0 0 0 331 0 331 9 CSE Grade 2-5,Scot Grade 4-5 0 0 0 0 421 421 10 Apprenticeship 0 0 0 257 0 257 102 0 0 0 0 102 0 0 0 0 2787 2787 138 0 0 0 0 1545 2022 3935 2399 5726 11 Other QF 12 No QF 13 Still At School No QF Total 3.00 Higher school or vocational 6 138 15627 Example - Linking data (on related adults in the BHPS) Used health services in last year (Y=43%) GHQ score indv cp hh xhid indv cp hh xhid Female 0.63 0.77 0.69 0.65 1.36 1.36 1.36 1.53 Age 0.02 0.03 0.02 0.02 0.13 0.13 0.14 0.14 Age-squared(*100) -0.12 -0.13 -0.13 -0.13 Cohabiting -0.58 -0.58 -0.54 -0.59 Ln(household inc.) -0.09 -0.14 -0.12 -0.11 -0.63 -0.62 -0.63 -0.62 Constant -0.65 -0.67 -0.59 -0.55 12.9 12.8 12.6 12.6 ICC L2% (VC) 0 6.3 8.8 7.9 0 22.9 15.8 7.8 Mean cluster size 1 1.4 1.8 4.6 1 1.4 1.8 4.5 L2:sd(cons) 0.61 0.51 0.53 2.54 1.91 1.15 L2:sd(fem) 2.00 0.82 0.00 2.81 2.32 1.64 L1:sd(cons) 1.81 1.81 1.81 1.81 5.40 4.30 4.76 5.28 -Log-like (-40k) 9648 9625 9624 9632 3529 3383 3410 3512 ‘The significance of data management for social survey research’ The data manipulations described above are a major component of the social survey research workload Pre-release manipulations performed by distributors / archivists • Coding measures into standard categories; Dealing with missing records Post-release manipulations performed by researchers • Re-coding measures into simple categories • All serious researchers perform extended post-release management (and have the scars to show for it) We do have existing tools, facilities and expert experience to help us…but we don’t make a good job of using them efficiently or consistently So the ‘significance’ of DM is about how much better research might be if we did things more effectively… DAMES, 31/JAN/2012, T1 8 ..being more effective probably involves.. Knowing about, using and citing previous standard measures/strategies Effective documentation/dissemination of information on the approach used Being proactive (not just relying on the most convenient measure to hand) Trying a few alternatives – sensitivity analysis DAMES, 31/JAN/2012, T1 9 ‘Documentation’ (and its dissemination) is probably the key… By documentation we mean the ‘paper trail’ (such as data & syntax files during secondary survey research) For scientists, this is the log book / journal / laboratory notebook For social sciences, there are few agreed standards Effective documentation is possible, but requires some effort (e.g. Long, 2009) Image of Alexander Graham Bell’s 1876 notebook, taken from: http://sandacom.wordpress.com/2010/ 03/11/the-face-rings-a-bell/ 10 ..good levels of documentation are not engrained in the social sciences! “…Little or nothing is systematically archived from these electronic sources. How many of us routinely keep copies of our old wordprocessing files once they are no longer of current relevance for research or teaching activities. We have been reminded…of the insecurity and non-survival of departmental and professional files stored in broom cupboards, but how many electronic files even get into that cupboard in the first place?” (p142 of Scott, J. (2005) ‘Some principal concerns in the shaping of sociology’, in Halsey, A.H. and Runciman, W. (eds) British Sociology: See from without and within. London: British Academy) ...Yet, ‘documentation for replication’ is a reasonable expectation for a scientific model of research (e.g. Steuer, Dale, Freese)… Steuer, M. (2003). The Scientific Study of Society. Boston: Kluwer Academic. Dale, A. (2006). Quality Issues with Survey Research. International Journal of Social Research Methodology, 9(2), 143-158. Freese, J. (2007). Replication Standards for Quantitative Social Science: Why Not Sociology? DAMES, 31/JAN/2012, 11 Sociological Methods & Research, 36(2), 153-71. T1 A bit of focus… Most of the DAMES applications aim to facilitate one of two data management activities, their documentation, and the dissemination of that documentation: 1) Variable constructions o Coding and re-coding values 2) Linking datasets o Internal and external linkages DAMES, 31/JAN/2012, T1 12 ‘Documentation for replication’ supports replication of.. Your own analysis in response to comments, revisions, requests for access) Others’ analysis To build upon – cumulative science To critique / cross-examine In secondary survey research Complex data is often updated (new related records; revised and re-released; re-weighted or re-standardardised; new levels of access/linkage) New analysis feasible - variable operationalisations; new statistical methods Most documentation requirements are achieved by effective use of software (‘syntax’ programming) See our training workshops, www.dames.org.uk/workshops 13 DAMES, 31/JAN/2012, T1 Keep clear records of your DM activities! Reproducible (for self) Replicable (for all) Paper trail for whole lifecycle In survey research, this means using clearly annotated syntax files (e.g. SPSS/Stata) Syntax Examples: www.dames.org.uk/workshops DAMES, 31/JAN/2012, T1 14 15 We’ve written a guide for researchers... ‘Software Session 1: Documentation and workflows with popular software packages’ (www.dames.org.uk/workshops/stir10/docs_workflows_2010.html) Dozens of sample command files in SPSS, Stata and R from DAMES Node workshops at www.dames.org.uk DAMES, 31/JAN/2012, T1 16 For data distributors, the provision of systematic metadata is also beneficial Example of DDI format metadata (see also talk 5) DAMES, 31/JAN/2012, T1 17 DAMES, 31/JAN/2012, T1 18 NESSTAR DAMES, 31/JAN/2012, T1 19 What more is needed for good data management? 1) Good standards in the operationalisation of variables See yesterday’s workshop sessions (www.dames.org.uk) Most options have already been studied! Using GEODE/GEMDE/GEEDE to facilitate sensitivity analysis and comparisons of alternative plausible measures • Collect documentation/metadata on specialist records • Promote more effective measurement options e.g. effect proportional scaling; replication of measures used before; derivation of recommended standards DAMES, 31/JAN/2012, T1 20 DAMES ‘GESDE’ tools: online services for data coordination/organisation Tools for handing variables in social science data Recoding measures; standardisation / harmonisation; Linking; Curating 21 ES2 E6 E3 G13 G10 G5 G2 R7 WR9 O8 MN I99 CF CF2 ISEI AWM WG2 GN1 ES5 E9 E5 E2 G11 G7 G3 K4 WR O17 O4 I9 CM CM2 CG SIOP WG1 WG3 Increase in R-squared Increase in BIC Predictors of ‘poor health’ in Sweden -.05 0 .05 .1 Sweden measures, from DAMES, TP 2011-1) (comparison of different occupation-based ES2 E6 E3 G13 G10 G5 G2 R7 WR9 O8 MN I99 CF CF2 ISEI AWM WG2 GN1 ES5 E9 E5 E2 G11 G7 G3 K4 WR O17 O4 I9 CM CM2 CG SIOP WG1 WG3 What more is needed for good data management? 2) Incentives/disincentives Arguably, good data management is penalised at present (‘Don’t get it right, get it published’) Few formalised requirements of documentation or data management activity (cf. metadata publishing standards such as DDI) Citation rankings might incentivise here (citation of your do files..) Prospects are probably rather bleak for good science..!! DAMES, 31/JAN/2012, T1 23 Summary the ‘significance’ of DM is about how much better research might be if we did things more effectively… Can (try to) provide data oriented facilities supporting improved data management May also need a cultural change in expectations… DAMES, 31/JAN/2012, T1 24