Vernon Gayle, Roxanne Connelly, Chris Playford
University of Edinburgh
There is a buzz around administrative data, many people seem to be using it, are you missing out?
Administrative data are records and information that are gathered in order to organise, manage or deliver a service. Although they are not primarily collected for research some administrative data resources contain information on individuals that have great potential for sociological research. These datasets are best described as ‘administrative social science datasets’.
It is becoming increasingly common for administrative data to be linked to existing large-scale social survey datasets, but historically social scientists have only had highly restricted access to administrative records. The ESRC have funded the Administrative Data Research Network (ADRN) which aims to appropriately open up access to a plethora of data that have been locked away in databases and files. The goal is to provide researchers with access to data from Government Departments and other agencies that routinely collect data relevant to social research.
The new ADRN will allow researchers to gain carefully supervised access to data to undertake studies that are ethical and feasible. A critical feature of administrative social science data is that people cannot be identified and data cannot be linked back to individuals. This ensures that nobody’s privacy is infringed. The bar for gaining access to administrative social science data is set high because a great deal of work is required to link data and to get de-identified data ready for researchers to analyse. The outcome however will be unparalleled new sources of social science data suitable for sociological research. These data will support detailed empirical analyses of social and economic life in contemporary Britain.
Examples of plausible sociological analyses that could be undertaken with administrative social science data are legion, but we list a few to illustrate the diversity of administrative social science data sources, and to prime the reader’s sociological imagination.
Better understanding intergenerational income mobility with tax records from parents and their children later in adult life.
- Investigating flows into and out of child poverty with linked tax and benefits records.
- Investigating potential relationships between being in care in childhood and criminal careers in adulthood with linked data from social services and the criminal justice system.
- Exploring the relationship between weather and children’s behaviour with linked meteorological data and school exclusion data.
Currently there is a growing fervour to describe administrative social science datasets as ‘big data’. This is rather unhelpful since the term ‘big data’ has been sloppily deployed to describe data as diverse as outputs from the Large Hadron Collider, scrapped Twitter feeds, Facebook status updates, transactional data from supermarkets, and sensory and tracking information from mobile phones and GPS devices.
There is a risk to forcing administrative social science data under the ‘big data’ umbrella and this should be avoided. There is an emerging idea that ‘big data’ heralds the end of the requirement for comprehensive statistical analyses because the sheer volume of data means that simpler correlations are sufficient. The suggestion is that knowing ‘what’ but not ‘why’ will be more important in the ‘big data’ era. The recent poor predictions from Google Flu Trends are probably the most striking cautionary tale as to why simple correlations should be avoided. We could also point to any one of the humorous spurious correlations that are usually used when teaching sociology undergraduates (our favourite is the correlation between storks and fertility).
The majority of sociological studies using administrative social science datasets will be examining a variable by case matrix that is very similar to those provided by large-scale social surveys. There is absolutely nothing that convinces us that when analysing a variable by case matrix of administrative social science data, we can ignore the helpful lessons that have emerged from decades of sociology, statistics and econometrics. For example if an administrative dataset has repeated measurements on the same individuals the usual problems associated with non-independence of observations or the possibility of residual heterogeneity will not vanish simply because of the size of the dataset, or because the data are from an administrative source rather than a social survey. Anyone who suggests that administrative social science datasets (or ‘big data’ more generally) have special immunity will have to work hard to persuade those of us who are knowledgeable and experienced social science data analysts.
The most notable positive feature of administrative social science datasets is that they are usually large in scale (although they are seldom as large as the ‘big datasets’ that are produced in areas such as particle physics). In many instances they will provide many more cases than existing large social surveys. Because the data are collected for administrative or organisational purposes, measures may be more accurate than if they had been collected from a participant in a study. For example we can envisage that personal income information within a HMRC dataset might be more accurate than information collected in a semi-structured interview.
In some cases administrative datasets will not be a sample but will be an entire population. This is sometimes referred to as n=all. Therefore some of the well-known issues relating to representativeness (including sample selection bias, unit non-response and sample attrition) will be less prevalent.
The positive features of administrative social science datasets are appealing, however like every other source of data relevant to sociological research there are also notable negative features. At a practical level it may simply not be possible to link together some sources of administrative data because there is not a suitable unique identifier available on each dataset to act as a key.
In comparison with large-scale social surveys, especially omnibus surveys such as the Understanding Society (UK Household Longitudinal Survey) which is specifically designed to support multi-disciplinary secondary data analyses, the number of available explanatory variables in most administrative datasets will be extremely limited. For example variables measuring a person’s ethnicity or level of education which are implicated in many sociological analyses might be completely absent in a dataset because they are administratively irrelevant.
Much information in large-scale social surveys is sociologically informed and specialised measures and variables are collected. Such sociologically informed measures might not be available in administrative social science datasets. In some administrative datasets there will be suitable proxy measures, but in other datasets the proxy measures available may be less suitable. For example in an education related dataset the measure of ‘eligibility for free school meals’ might seem like a suitable proxy for household disadvantage. ‘Eligibility for free school meals’ will perform relatively poorly however when compared with a sociologically informed measure such as the National Statistics Socio-Economic Classification in an analysis of school examination performance.
Much administrative data will naturally be high quality in terms of both validity and reliability. Sociologists have a long track record of being reflexive and concerned about the research value of data. Thought must similarly be placed into the quality of measures within administrative datasets. The inaccuracies that individuals detect and the errors that emerge in transactions with the benefits agencies, tax authorities, transport agencies, the national health service and local authorities, coupled with the errors, miscalculations and inaccuracies that occur in transactions with service providers such as banks, credit cards, utility companies, and delivery and transport providers, all hang a reasonable question mark over the quality of some administrative data for sociological research.
In some instances the administrative datasets will be an entire population, but this cannot be assumed. Individuals may be missing from a dataset and they will be hard to detect. It is likely that these missing individuals will be a special subset of the population. There may also be missing information on some measures within the datasets. This ‘missingness’ might be unimportant, however the narrowness of the range of other explanatory variables in the dataset will greatly restrict the possibility of formal statistical methods for missing data being applied.
The ADRN will take on much of the burden for negotiating and securing access to data, however some sources of data may still remain inaccessible. Gaining access to specific datasets held by some organisations can be prohibitively slow and therefore disadvantageous for shorter projects. Many datasets will only be available to be analysed in approved ‘safe-settings’ and this places an extra burden on the data analyst. Because of the sensitivity of the data some data providers will place controls on the output that sociologists produce, and typically extra time will be required for researchers to get results cleared for presentations and publications.
Training in the standard range of statistically informed multivariate data analysis techniques that are required for analyses of large-scale social surveys are also required for analysing administrative datasets. We contend that given the scope and the restrictions of administrative data it is likely that some of the most fruitful sociological enterprises will involve analyses of administrative data that have been linked to existing well designed large-scale social science studies. The ADRN is ground breaking and offers support for intellectually exciting opportunities for sociological research using administrative data.
Don’t swipe left on administrative data.