Author Archives: detectiveshandbook

Research with our trousers down: Publishing sociological research as a Jupyter Notebook

Roxanne Connelly, University of Warwick

Vernon Gayle, University of Edinburgh

On the 29th of May 2017 the University of Edinburgh hosted the ‘Social Science Gold Rush Jupyter Hackathon’. This event brought together social scientists and computer scientists with the aim of developing our research and data handling practices to promote increased transparency and reproducibility in our work. At this event we contemplated whether it might ever be possible to publish a complete piece of sociological work, in a mainstream sociology journal, in the form of a Jupyter Notebook. This November, 6 months after our initial idea, we are pleased to report that the paper ‘An investigation of social class inequalities in general cognitive ability in two British birth cohorts’ was accepted in the British Journal of Sociology accompanied by a Jupyter Notebook which documents the entire research process.

Jupyter Notebooks allow anyone to interactively reproduce a piece of research. Jupyter Notebooks are already effectively used in ‘big science’, for example the Nobel Prize winning LIGO project makes their research available as Jupyter Notebooks. Providing statistical code (e.g. Stata or R code) with journal outputs would be a major step forward in sociological research practice. Jupyter Notebooks take this a step further by providing a fully interactive environment. Once a researcher has downloaded the requested data from the UK Data Archive, they can rerun all of our analyses on their own machine. Jupyter Notebooks encourage the researcher to engage in literate programming by clearly documenting the research process for humans and not just computers, which greatly facilitates the future use of this code by other researchers.

When presenting the results of social science data analyses in standard journal articles we are painfully confined by word limits, and are unable to describe all of the steps we have taken in preparing and analysing complex datasets. There are hundreds of research decisions undertaken in the process of analysing a piece of existing data, particularly when using complex longitudinal datasets. We make decisions on which variables to use, how to code and operationalise then, which cases to include in an analysis, how to deal with missing data, and how to estimate models. However only a brief overview of the research process and how analyses have been conducted can be presented in a final journal article.

There is currently a replication crisis in the social sciences where researchers are unable to reproduce the results of previous studies, one reason for this is that social scientists generally do not prepare and share detailed audit trails of their work which would make all of the details of their research available to others. Currently researchers tend to place little emphasis on undertaking their research in a manner that would allow other researchers to repeat it, and approaches to sharing details of the research process are ad hoc (e.g. on personal websites) and rarely used. This is particularly frustrating for users of infrastructural data resources (e.g. the UK’s large scale longitudinal datasets provided by the UK Data Service), as these data can be downloaded and used by any bone fide researcher. Therefore it should be straightforward, and common place for us to duplicate and replicate research using these data, but sadly it is not. We see the possibility of a future of social science research where we can access full information about a piece of research, and duplicate or replicate the research to ultimately develop research more efficiently and effectively to the benefit of knowledge and society.

The replication crisis is also accompanied by concerns of scientific malpractice. It is our observation that P-hacking is a common feature of social science research in the UK, this is not a statistical problem but a problem of scientific conduct. Human error is also a possible source of inaccuracy in our research outputs, as much quantitative sociological research is carried out by single researchers in isolation. Whilst co-authors may carefully examine outputs produced by colleagues and students, it is still relatively rare to request to examine the code. In developing our Jupyter Notebook we have borrowed two techniques from software development, ‘pair programming’ and ‘code peer review’. Each of us repeated the research process independently using a different computer and software set-up. This was a laborious process, but labour well spent in order to develop robust social science research. This process made apparent several problems which would otherwise be overlooked. At one point we were repeating our analysis whilst sharing the results over Skype, and frustratingly models estimated in Edinburgh contained 7 fewer cases than models estimated in Coventry. After many hours of investigation we discovered that the use of different versions [1] of the same dataset, downloaded from the UK Data Archive, contained slightly different sample numbers.

We describe this work as ‘research with our trousers down’ [2], as publishing our full research process leaves us open to criticism. We have already faced detailed questions from reviewers which would not have occurred if they did not have access to the full research code. It is also possible that other researchers will find problems with our code, or question the decisions which have been made. But criticism is part of the scientific process, we should be placing ourselves in a position where our research can be tested and developed. British sociology lags behind several disciplines, such as Politics and Psychology, in the drive to improve transparency and reproducibility in our work. As far as we are aware there are no sociology journals which demand researchers to provide their code in order to publish their work. It is most likely only a top-down change from journals, funding bodies or data providers which would develop the practices within our discipline. Whilst British sociologists are not yet talking about the ‘reproducibility crisis’ with the same concern as psychologists and political scientists, we have no doubts that increased transparency will bring great benefits to our discipline.

[1] This problem is additionally frustrating as the UK Data Service do not currently have an obvious version control protocol, and do not routinely make open sufficient metadata for users to be able to identify precise versions of files and variables. We have therefore documented the date and time that datasets where downloaded and documented this in our Jupyter Notebook. Doubtlessly, the UK Data Service adopting a clear and consistent version control protocol would be of great benefit to the research community as it would accurately locate data within the audit trail.
[2] We thank our friend Professor Robin Samuel for this apposite term.

The Determinants of Charity Misconduct

Diarmuid McDonnell & Alasdair Rutherford, 2017

As Corrado “Junior” Soprano, plagiarising a Chinese curse of dubious provenance, puts it: may you live in interesting times. Charities in the UK have been the subject of intense media, political and public scrutiny in recent years, resulting in three parliamentary inquiries. Public confidence and trust in the sector has been questioned in light of various “scandals” including unethical fundraising practices (resulting in the establishment of a new fundraising regulator for England and Wales in 2016), high levels of chief executive pay, politically-motivated lobbying and advocacy work, and poor financial management. Using novel data supplied by the Office of the Scottish Charity Regulator (OSCR), my colleague Dr Alasdair Rutherford and I describe the nature and extent of alleged and actual misconduct by Scottish charities, and ask what organizational and financial factors are associated with this outcome?

Background

First, some background on what we mean when we say “charity”. The Scottish Charity Register is maintained by OSCR which was established in 2003 as an Executive Agency and took up its full powers when the Charities and Trustee Investment (Scotland) Act 2005 came into force in April 2006. In Scotland, a charity is defined (under statute) as an organization that is listed on the Register after demonstrating that it passes the charity test: it must have only charitable purposes; the organization must or intend to provide some form of public benefit; it must not allow its assets to be used for non-charitable purposes; it cannot be governed or directed by government ministers; and it cannot be a political party. One of OSCR’s main responsibilities is to identify and investigate apparent misconduct and protect charity assets. It operationalises this duty by opening an investigation (what they term an inquiry) into the actions of a charity suspected of misconduct and other misdemeanours.

Investigations are mainly initiated as a result of a public complaint but they can also be opened by a referral from a department in OSCR or another regulator. For example, one of the founders of the charity The Kiltwalk reported the organization to OSCR on the grounds that he has concerns over the amount of funds raised by the organization that are spent on meeting the needs of beneficiaries. OSCR can only deal with concerns that relate to charity law – such as damage to charitable assets or beneficiaries, misconduct or misrepresentation – though it can refer cases to other bodies such as when criminal activity is suspected. Finally, the outcome is recorded for each investigation. Outcomes are varied and often specific to each investigation but most can be related to three common categories: no action taken or necessary; advice given; and regulatory intervention.

Method

This study examines two dimensions of charity misconduct that deserve greater attention: regulatory investigation and subsequent action. Regulatory action can take the following two, broad forms: the provision of advice (e.g. recommending a charity improve its financial controls to counteract the threat of fraud or misappropriation) and the use of OSCR’s formal regulatory powers (e.g. reporting the charity to prosecutors or suspending trustees). This study overcomes many of the limitations outlined previously by utilising a novel administrative dataset, derived from OSCR, covering the complete population (current and historical) of registered Scottish charities. It is constructed from three sources: the Scottish Charity Register, which is the official, public record of all charities that have operated in Scotland; annual returns, which are used to populate many of the fields on the Register (e.g. annual gross income); and internal OSCR departmental data relating to misconduct investigations. Once linked using each observation’s Scottish Charity Number, this dataset contains 25,611 observations over the period 2006-2014.

The outcome of being investigated by the regulator is measured using a dichotomous variable that has the value 1 if a charity has been investigated and 0 if not. The other two dependent variables are also dichotomous: regulatory action takes the value 1 if a charity has had regulatory action taken against it and 0 if not; and intervention takes the value 1 if a charity is subject to regulatory intervention and 0 if not (i.e. it received advice instead). The dependent variables are modelled using binary logistic regression. We model the probability of investigation using binary logistic regression as a function of organization size, age, institutional form, field of operations and geographical base.  For the sub-sample of organizations that were investigated, we then model the probability of regulatory action, and its different forms, being taken based on the same characteristics plus the source of the complaint made.

Describing Investigations and Regulatory Action

There have been 2,109 regulatory investigations of 1,566 Scottish charities over the study period: this represents six percent of the total number of organizations active during this time. The number of investigations increased steadily during OSCR’s early years and then plateaued at around 400 per year until 2013/14, when the figure declined slightly. The majority of investigations (78 percent) concerned charities that were investigated only once in their history. A little over 30 percent of investigations resulted in regulatory action being taken against a charity: 16 percent received advice and 13 percent experienced intervention by OSCR.

It is a member of the public that is most likely to contact OSCR with a concern about a charity. Internal stakeholders of the charity account for 31 percent of all investigation initiators, though this disregards the strong possibility that many of those recorded as anonymous are involved in the running of the charity they have a concern about. The concerns that prompt these actors to raise a complaint with OSCR are numerous and diverse. Figure 1 below visualizes the associations between the most common types of complaint and the response of the regulator. The overriding concern is general governance, as well as associated issues such as the duties of trustees and adherence to the founding document. Financial misconduct also ranks highly, particularly the misappropriation of funds and suspicion of financial irregularity.

Figure 1. Association between type of complaint and regulator response

Figure1

Note: Each complaint can have two types, and maps to one of the regulatory responses. The fifteen most common complaint types are shown. The thickness of the line is proportional to the number of complaints leading to each regulatory response.

Modelling the Risk of Investigation and Action

In Table 3, we report the odds ratios (exponentiated coefficients) rather than the log odds as they approximate the relative risk of each outcome occurring. This is appropriate not only for ease of interpretation but because the absolute chance of either outcome occurring is low (i.e. it is better to know which charities are more likely relative to their peers). The category with the most observations is chosen as the base category for each nominal independent variable.

Table 1. Results of Logistic Regression on dependent variables

Tabel1

We first examine the effects of organization age and size on the outcomes. The coefficient for age varies across the three outcomes. A one-unit increase in the log of age results in a five percent decrease in the odds of being investigated or being subject to regulatory action; however, the odds of experiencing intervention compared to receiving advice are higher for older charities. There appears to be a clear income gradient present in the investigation model: as organization size increases so do the odds of being investigated compared to the reference category. With regards to the actor that initiates an investigation, it appears that stakeholders with a monitoring role (e.g. funders, auditors or other regulators) are more likely than members of the public to report concerns that warrant some form of regulatory action; in contrast, internal charity stakeholders such as employees and volunteers have higher odds of identifying concerns that merit the provision of advice by OSCR and lower odds of triggering regulatory intervention in their charity.  While size predicts complaints, it is the source of the complaint that is a more reliable predictor of the need for regulators to take action.

A more nuanced examination of the effect of organization size is possible by comparing categories of this variable to each other and not just the base category (shown in Figure 2). Drawing on suggestions by Firth (2003), Firth and Menezes (2004), and Gayle and Lambert (2007), we employ quasi-variance standard errors to ascertain whether categories of organization size are significantly different from each other. Unsurprisingly, the largest charities have significantly higher odds than all other categories; however it appears that the middle categories (charities with income between £100,000 and £1m) are not significantly different from each other and neither are organizations between £500,000 and £10m.

Figure 2. Quasi-Variance log odds of being investigated

Figure2

Conclusion

The results of the multivariate analysis point to the factors associated with charity investigation and misconduct, showing the mismatch between those predicting complaints and those predicting regulatory action. This has considerable implications for charity regulators seeking to deploy their limited resources effectively and in a way that ultimately protects and enhances public confidence. By revealing the disconnect between the level of complaints and concerns that require regulatory action, we argue there is much work to do for practitioners in the sector with regards to charity reputation and stakeholder communication. Charity boards are ultimately responsible for the governance of their organization, and must ensure that adequate policies and procedures are in place. This includes reducing the risk of misconduct occurring, taking corrective action in response to guidance from the regulator, and developing the management and reporting functions required to deal with the consequences. Recognition should also be given to the role that stakeholders such as funders and auditors must play in self-regulation of the sector, given their proximity to charities through their day-to-day activities. It is no longer sufficient (if indeed it ever was) to rely on charity status to convey trust and inspire confidence in the conduct of an organization.

All maps are inaccurate but some have very useful applications: Thoughts on Complex Social Surveys

Vernon Gayle, University of Edinburgh

rudi_cafe-copy

This blog post provides some thoughts on analysing data from complex social surveys, but I will begin with an extended analogy about maps.

All maps are inaccurate. Orienteering is a sport that requires navigational skills to move (usually running) from point to point in diverse and often unfamiliar terrain. It would be ridiculous to attempt to compete in an orienteering event using a road map drawn on a scale of 1:250,000, this is because 1 cm of the map represents 2.5 kilometres. Similarly it would be inappropriate to drive from Edinburgh to London using orienteering maps which are commonly drawn on a scale of 1:15,000. On an orienteering map 1 cm represents 150 metres of land.

Hillwalking is a popular pastime in Scotland. Despite having similar aims many hillwalkers use the standard Ordnance Survey (OS) 1:50,000 map (the Landranger Series) but others prefer the 1:25,000 OS map. These maps are not completely accurate but they have useful applications for the hillwalker. For some hillwalking excursions the extra detail offered by the 1:25,000 map is useful. For other journeys the extra detail is superfluous and having coverage of a larger geographical area is more useful. When possible I prefer to use the Harvey’s 1:25,000 Superwalker maps. This is because they are printed on waterproof paper and they tend to cover whole geographic areas so walks are usually contained on a single map. I also find the colour scheme helpful in distinguishing features (especially forests and farmland), and the enlargements (for example the 1:12,500 chart of the Aonagh Egach Ridge on the reverse of the Glen Coe map) aid navigation in difficult terrain.

The London Underground (or Tube) map is probably one of the best known schematic maps. It was designed by Harry Beck in 1931. Beck realised that because the network ran underground, the physical locations of the stations were largely irrelevant to a passenger who simply wanted to know how to get from one station to another. Therefore only the topology of the train route mattered. It would be unusual to use the Tube map as a general navigational aid but it has useful applications for travel on the London Underground.

The Tube map has undergone various evolutions, however the 1931 edition would still be an adequate guide for a journey on the Piccadilly Line from Turnpike Lane to Earls Court. By contrast a journey from Turnpike Lane station to Southwark station using the 1931 map will prove confusing since the map does not include the Jubilee Line, and Southwark station was not opened until the 1990s. A traveller using the 1931 map will not be aware that Strand station on the Northern Line was closed in the early 1970s.

Contemporary versions of the Tube map include the fare zones, which is a useful addition for journey planning. More recently editions include the Docklands Light Railway and Overground trains which extend the applications of the Tube map for journeys in the capital.

Here are two further thoughts on the accuracy of the tube map and its applications. First, when I was a schoolboy growing up in London I was amused that what appeared to me the shortest journey on the Tube map from Euston Square station to Warren Street station involved three stops and one change. I knew that in reality the stations were only less than 400 metres apart (my father was a London Taxi driver). Walking rather than taking the Tube would save both time and money.

Second, more recently I have become aware of the journey from Finchley Road tube station to Hampstead tube station which involves travelling on the Jubilee Line and making changes onto the Victoria Line and then the Northern Line. The estimated journey on the Transport for London website is about 30 minutes. Consulting a London street map reveals that the stations are less than a mile apart. A moderately fit traveller could easily walk that distance in less than half an hour. The street map (like the Tube map) is unlikely to warn the traveller that the journey is up hill however. Finchley Road underground station is 217 feet above sea level and Hampstead station is 346 feet above sea level (see here).

This preamble hopefully reinforces my opening point that all maps are inaccurate, but sometimes they have very useful applications. Some readers will know the statement made by the statistician George Box that all models are wrong but some are useful. This statement is especially helpful in reminding us that models are representations of the social world and not accurate depictions of the social world. Similarly a map is not the territory. When thinking about samples of social science data I find the analogy with maps useful as a heuristic device.

All samples of social science data are inaccurate, especially those that are either small or have been selected unsystematically. Some samples are both small and unsystematically selected. Small sample and unsystematic samples may prove useful in some circumstances but their design places limitations on how accurately the data represents the population being studied. Large-scale samples that are selected systematically will tend to be more accurate and better represent target populations. The usefulness of any sample of social science data, much like a map, will depend on its use (e.g. the research question that is being addressed).

Some large-scale social surveys use simple statistical techniques to select participants. The data within these surveys can be analysed relatively straightforwardly. Many more contemporary large-scale social surveys have complex designs and use more sophisticated statistical techniques to select participants. The motivation is usually to better represent the target population, to minimise the costs of data collection, and to allow meaningful analyses of subpopulations (or smaller groups).These are positive features but they come at the cost of making the data from complex surveys more difficult to analyse.

It is possible to approach the analysis of data from complex social surveys naively and treat them as if they were produced by a simple design and selection strategy. For some analyses this will be an adequate approach. This is analogous to using a suboptimal map but still being able to arrive close enough to your desired destination.

For other studies a naïve approach to analysis will be inappropriate. Comparing naïve results with results from more sophisticated analysis can help us to assess the appropriateness of naïve approaches. The difficulty is that reliable statements cannot easily be made a priori on the appropriateness of naïve approaches. To draw further on the map analogy, when using an inadequate map it is difficult to assess how close you get to the correct destination unless you have previously visited that location.

The benefit of social surveys with complex designs is that they have complex designs. The drawback of social surveys with complex designs is that they have complex designs. All maps are inaccurate but some have very useful applications. All samples of social science data are inaccurate but some have very useful applications. The consideration of the usefulness of a set of social science data requires serious methodological thought and this will most probably be best supported by exploratory investigations and sensitivity analyses.

To learn more about analysing data from both non-complex and complex social surveys come to grad school at the University of Edinburgh (http://www.sps.ed.ac.uk/gradschool).

Using Quantitative Methods to study Big Data skills: considering relevant proxies for ‘Big Data’ skills

Alana McGuire, University of Stirling, 2016

Background

This blog is based upon work being undertaken for a PhD at the University of Stirling which explores the impact of Big Data on skill requirements for employers in Scotland. A version of this article was presented as a poster at the National Centre for Research Methods Festival, Bath, July 2016. The project research design applies mixed methods using a hybrid adaption of the explanatory sequential design (Creswell and Clark, 2011). Questions the study will address include: How is Big Data changing skill demands for employers? Is data becoming a more central part of organisations, and if so, is this causing changes in the job roles of employees in the organisation? Are there discrepancies between the skills that employees are being equipped with on training courses and the skills that employers are seeking? Is there evidence of social/gender/ethnic inequalities in Big Data skills?          

The definition of Big Data is contested. The term ‘Big Data’ in the context of this project refers to complex data that requires a change in what is actually perceived as data (Lagoze, 2014). This may be structured in a conventional dataset or unstructured, for example, data from a health device or Twitter. This can also take a variety of formats. The size of the data itself is not the defining characteristic for the purposes of my research.

Mellody (2014: 10) argues that the main skills needed to work with Big Data are ‘computing and software engineering’, ‘machine learning’ and ‘optimization’. Machine learning focuses on ‘how to get computers to program themselves’ (Mitchell, 2006: 1). By optimization, Mellody is referencing ‘database optimization’, that is the programing of the database so that commands are executed and results obtained in the quickest way possible (Mullins, 2010).  As well as these skills, Yiu (2012) argues that Big Data specialists must also have ‘soft’ skills such as good communication, collaboration and creativity. Further to this, Yui suggests critical consumption of data and statistical methods are skills which have been neglected by the literature exploring the abilities needed to work with Big Data. Although needs for these skill may not be unique to Big Data, it is essential when working with Big Data that the analyst understands which methods are appropriate and how to interpret output from these.

Routinely collected and deposited data sources, such as the Labour Force Survey (ONS, 2016), do not capture variables which encompass the combination of skills discussed in the literature as necessary in the practice of analysis using Big Data.  A key issue is therefore to find proxies that can robustly measure Big Data skills. Given this dearth of resources a plausible alternative strategy may be available in the Employer Skills Survey (ESS). These data contain some information on skills shortages which can be used to assess need within sectors of the economy, for example, data on numeracy, IT, and communication skill shortages (see UK Commission for Employment and Skill, 2016). In addition to the ESS, the 1970 British Cohort Study (BCS) tested ability in maths, several of the items used in this test are related to those abilities considered definitive of Big Data skills.

This remainder of this post outlines two proxy measures that could be relevant to understanding the prevalence of skills associated in working with Big Data.

Data and Methods

The Employer Skills Survey is a large scale survey conducted annually by the UK Commission for Employment and Skills (2016). For the 2013 survey, 91,279 interviews were completed. This survey is one of the largest of its kind in the UK, providing a wealth of data surrounding skills shortages in the UK.

The Employer Skills Survey was used to define a variable that measures the basic skills that are needed in an industry in order to potentially make use of data in that industry. This was given the form of a score, constructed from several skill shortage variables, these included, communication, numeracy, and IT skills. Graph 1, below, shows the distribution of this variable. The skill score variable has a mean of 1.89 and a range of between zero and six, zero being indicative of no difficulty finding Big Data base skills and six being indicative of having difficulty finding every one of the big data base skills. Should an industry score highly on this variable this can be considered to indicate that the particular industry or organisation finds it difficult to recruit employees with skills identified as necessary for working with Big Data. It would be particularly problematic if the industries lacking these skills were ones which could benefit from the analysis of Big Data.

Graph 1

graph1_2

Applying this variable makes it is possible to make an assessment of industries, or sectors, which may be experiencing a shortfall in recruitment of the core skills. A simple analysis is presented using OLS regression controlling the organisation type, comparing non-market organisation and profit making companies, the size of the organization, being an SME (small/medium sized enterprise) or not, whether the organisation is based in Scotland, compared to the rest of the UK and an interaction term between being based in Scotland and being an SME. If there is a skills shortfall of this type the effort and expenditure required to upskill staff from a poor basis would be far greater.

Alongside the Employer Skill Survey analysis, I have undertaken some initial analysis using the British Cohort Study. This study takes a group of babies born in a week in 1970 and follows these individuals throughout their lives. There was a follow up study to this which gave an arithmetic test to a sample of the cohort at aged sixteen. Many of the questions in this test are highly relevant for understanding statistics and data distributions. Further to this, there are also datasets from later studies which contain socioeconomic information on the same individuals. One avenue for these data in my study is to consider the score on the arithmetic tests as proxy for Big Data skills. This presumes that statistical literacy is a key element of Big Data skills and at the moment it is unclear that this is the case. If we make an assumption that mathematical and statistical abilities are important aspects of working with big data, then the distribution of these skills in the population could relate to whether sectors of the economy are able to tap into these skills. National statistics, scocio-economic classification (NS-SEC) 7 class is used to estimate the level of these skills by social class. The NS-SEC social class measure was captured during follow up to the original study, done in 2004/05 (University of London, 2016b). At this point individuals would have reached an age of occupational maturity (around 34 years of age) (Goldthorpe, 1987).

Results and Discussion

Table 1 presents the results of an analysis of the ESS using OLS regression. The skill score variable described above is set as the dependent variable. As described above, dummy variables are included which control for the comparison between a non-market organisation and a profit making company; being an SME or not; based in Scotland, compared to the rest of the UK; and an interaction term between being based in Scotland and being an SME. The associations for non-market organisations and SMEs are statistically significant which suggests that organisations with these characteristics are more likely to have fewer employees with a skill base capable of working with Big Data. This resonates with findings reported by E-Skills UK (2013) which suggests that SMEs are far less likely to make use of Big Data. Being based in Scotland is not significant and neither is the interaction term, suggesting that organisations located in Scotland are not any more likely to lack the Big Data base skills than organisations located elsewhere in the UK. Testing whether is finding is consistent is an important focus for my wider PhD study.

Table 1, OLS regression results, the dependent variable is the Big Data base skill score

  Coefficient Standard error P-value
Non market .554 0.098 0.000
SME .233 0.082 0.005
Scotland .003 -0.35 0.989
SME*Scotland -.154 -0.64 0.532
Constant 1.659 1.53 0.000

 I used the BCS to examine whether there are any suggestions in the data of social, gender, and ethnic inequalities in the distribution of the maths test results. In order to do this, I looked at correlations between arithmetic scores from 1986 (University of London, 2016a) and later data from the module including measures of social class. Graph 2 shows the mean arithmetic scores with confidence intervals form 1986 with NS-SEC from 2004/05. In this graph, routine occupations is the lowest occupational social class in the NS-SEC in this data and higher managerial is the highest. A gradual decline of arithmetic scores in line with declining NS-SEC occupational social class is evident. This is indicative of a possible social divide in Big Data skills. Albeit this only holds if statistical skills are a good indicator of Big Data skills and more research on my part is necessary to find out if this is the case.

Graph 2

graph2

Conclusion

This post has proposed two proxy measures of Big Data skills using data from the Employer Skill Survey and the British Cohort Study. These proxies may be relevant for measuring the prevalence of Big Data skills in the general population and for assessing how social stratification relates to Big Data skills. Going forward, more research is needed to ensure that these measures are robust.

This work provides a starting point for me to examine social, gender, and ethnic inequalities in Big Data skills. Alongside my statistical analysis, I will be supplementing this with qualitative research in the form of interviews with skills providers, employers, and employees. My statistical measures will be revisited after interviews to examine if the measures that I have used thus far are valid proxy variables for Big Data skills. If this is not the case, I will collect additional primary data which can then be used in my analysis. I would be glad to receive any constructive feedback in respect of my study and to hear from anyone working on a related topic.

Acknowledgements: I would like to acknowledge the help of my project supervisors, Dr Alasdair Rutherford and Professor Paul Lambert, I would also like to thank Dr Roxanne Connelly for suggestions made for this paper, and the PhD is funded by the ESRC.

Blog: https://alanainprogress.wordpress.com/
Twitter: @_AlanaMcGuire
Email: alana.mcguire@stir.ac.uk

References

Creswell, J.W., and Clark, V.L. (2011). Designing and Conducting Mixed Methods Research. Sage: London

Tashakkori, A., and Creswell, J.W. (2007). The new era of mixed methods. Mixed Methods Research. 1: pp.3-7.

E-Skills UK. (2013) Big Data Analytics: Adoption and Employment Trends, 2012-2017. Accessed  online at <http://www.e-skills.com/Documents/Research/General/BigDataAnalytics_Re port_Nov2013.pdf>

Goldthorpe, J. H. (1987) Social Mobility and Class Structure in Modern Britain, 2nd edition. Oxford: Clarendon Press.

 Lagoze, C. (2014) Big Data, data integrity, and the fracturing of the control zone. Big Data & Society, pp.1-11.

 Mellody, M. (2014). Training Students to Extract Value from Big Data: Summary of a Workshop. National Research Council.

Mitchell, T. (2006) The Discipline of Machine Learning. Accessed on 09/12/15 at                 <http://www.cs.cmu.edu/~tom/pubs/MachineLearning.pdf>

Mullins, C. (2010) Defining Database Performance. Database Trends and Applications. Accessed on        9/12/15 at <http://www.dbta.com/Columns/DBA-Corner/Defining-Database-Performance-70236.aspx>

Office for National Statistics. Social Survey Division. (2016). Quarterly Labour Force Survey Household Dataset, January – March, 2016. [data collection]. UK Data Service. SN: 7991, http://dx.doi.org/10.5255/UKDA-SN-7991-1

UK Commission for Employment and Skills. (2016). Employer Skills Survey, 2013. [data collection]. 2nd Edition. UK Data Service. SN: 7484, doi: http://dx.doi.org/10.5255/UKDA-SN-7484-2.

University of London. Institute of Education. Centre for Longitudinal Studies. (2016a). 1970 British Cohort Study: Sixteen Year Follow-up, Arithmetic Test, 1986. [data collection]. 2nd    Edition. UK Data Service. SN: 6095, doi: http://dx.doi.org/10.5255/UKDA-SN-6095-2.

University of London. Institute of Education. Centre for Longitudinal Studies. (2016b). 1970. British Cohort Study: Thirty-Eight-Year Follow-Up, 2008-2009. [data collection]. 4th Edition. UK Data Service. SN: 6557, doi: http://dx.doi.org/10.5255/UKDA-SN-6557-3.

Yiu, C. (2012). The Big Data Opportunity. Policy Exchange. Accessed online at      <http://www.geomapix.com/pdf/big%20data.pdf&gt;

Grammar Schools: Theresa May and the Rise of the Meritocracy

Roxanne Connelly, University of Warwick

the_rise_of_the_meritocracy_1967_cover

On September 9th Theresa May declared her desire for Britain to be “the world’s great meritocracy – a country where everyone has a fair chance to go as far as their talent and their hard work will allow”. May’s vision of meritocracy could have been plagiarised word for word from the pages of Michael Young’s dystopian novel “The Rise of the Meritocracy”. Young describes a meritocratic system as a political ideal whereby social position is achieved through “ability and effort”.

It is disheartening that May’s comments seem shockingly and dishearteningly to echo Young’s description of the development of an unpleasant and dehumanising society. Misplaced use of the political concept of meritocracy is not new, it was also a favourite of the Blairite government and their consistent misinterpretation of the meaning of meritocracy caused Young himself to exclaim “Down with Meritocracy” as the term he coined has been continually used to describe a political ideal which is the antithesis of its intended message. Here Young (1958, p.38) accurately and eerily describes the present day:

“Englishmen of the solid centre never believed in equality. They assumed that some men were better than others, and only waited to be told in what respect. Equality? Why, there would be no one to look up to any more. Most Englishmen believed, however dimly, in a vision of excellence which was part and parcel of their own time-honoured aristocratic tradition. It was because of this that the campaign for comprehensive schools failed. It was because of this that we have our modern society: by imperceptible degrees an aristocracy of birth has turned into an aristocracy of talent”.

May wants “Britain to be a place where advantage is based on merit not privilege; where it is your talent and hard work that matter not where you were born, who your parents are or what your accent sounds like”. May asks “Where is the meritocracy in a system that advantages the privileged few over the many? How can a meritocratic Britain let this situation stand?” The irony however is that a meritocratic Britain means exactly that – the success of the few to the detriment of the many. Meritocracy is at its core a political ideology based on legitimised inequality. As Michael Young states, “It is hard indeed in a society that makes so much of merit to be judged as having none. No underclass has ever been left as morally naked as that”.

The Evidence?

Whilst May is not as explicit as Gove in stating that Britain has had enough of experts she is certainly keen to ignore the exceptionally large volume of high quality evidence on this issue that is being shouted from the rooftops by some of the most talented analysts in this field. The IFS report on the available evidence is very clear – there are benefits of the selective system for those who get in, and those who get in are likely to be from more advantaged families. Selective schools increase educational inequalities and inequalities in later earnings, as those who fail to get into grammar schools tend to do worse than they would in a comprehensive education system. Lindsay MacMillan also describes international evidence on school selection stating that countries with grammar schools create greater earnings inequality, and that countries with selective education systems are more segregated in terms of socio-economic status.

The sociology of education also provides a wealth of accounts as to why selecting on the basis of test scores at the end of primary school privileges those from advantaged backgrounds. Willis (1977) famously described the antagonistic relationship between working class boys and education, which often led to disengagement and disinterest in educational attainment. This cultural approach to understanding processes of educational disadvantage has also been highlighted more recently by Reay (2006) who described the alienation and disaffection of working class children. More advantaged children and young people also benefit from their parent’s knowledge of the education system and cultural capital. More advantaged parents often engage in focussed organised parenting practices to develop their children’s skills, encourage a wide-range of cultural interests and foster an appreciation of education. Lareau (2011) describes these middle class parenting practices as ‘concerted cultivation’. The pipelining of young people based on a test score clearly cannot overcome the complex and multifaceted nature of inequality which will continue to be reproduced in the grammar school system.

What happens to those left behind?

What strikes me as most troubling is that we have heard very little about what happens to those children who do not get into the Grammar schools. How are we preparing the education system to help these young people fulfil their potential? How are we designing the education system so that there are second, third and fourth chances and that no young person who could benefit from Grammar school education is neglect? It appears from May’s speech that there is no clear additional plan for those children deemed not suitable for Grammar school education. May states that grammar schools will “provide a stretching education for the most academically able”, perhaps we are to assume from this that those who do not get into the grammar schools will not be given the opportunity to be stretched?

The plan for those who don’t get streamed into the grammar school system is that they will remain in the education system as it currently stands. Pupils will be distributed (unevenly) into Free schools, University schools (I believe there are only 2), faith schools, and of course you can always pay for an independent school. A review of evidence on the effectiveness of these different types of schools tends to indicate that the best performing of these schools are successful because they are able to select a proportion of their intake.

If the current education system is not serving the needs of working class kids, how does a grammar school system help them fulfil their potential? The proposition seems to be that they will be left inhabiting a residual version of today’s education system, but one in which the highest performing kids are syphoned off. It is not clear to me how this is “a future in which Britain’s education system shifts decisively to support ordinary working class families”.

 

Young, M. (1958). The Rise of the Meritocracy. London: Thames and Hudson.

Lareau, Annette. (2011). Unequal childhoods: Class, race, and family life. Berkeley: University of California Press.

Reay, D. (2006). “The Zombie Stalking English Schools: Social Class and Educational Inequality.” British Journal of Educational Studies 54 (3):288-307.

Willis, P. (1977). Learning to Labour: How Working-Class Kids Get Working-Class Jobs. New York: Colombia University Press.

 

 

Analyses can only be as good as the measures which underlie them

Roxanne Connelly, University of Warwick

Quantitative sociological research hinges on the collection of data in the form of measured variables, and its summary through statistical analysis of the ‘relationships between variables’ (e.g. Marsh, 1982). In the last decades, methodological innovations and analysis options in quantitative research have rapidly developed, alongside increasing computer power and software capabilities for the sophisticated analysis of the large volumes of micro-data which we now have at our disposal. These methodological advances in social survey data analysis are well documented, and social researchers are becoming increasingly able to deploy relatively complex and specialised statistical modelling techniques. Yet the results of analyses can only be as good as the measures which underlie them. Whilst it is normal for most researchers to have a good justification for the way in which the variables most central to their analysis are operationalised, there are certain ‘key variables’ – measures within quantitative research that are routinely recorded and feature in a great many analyses, whether as explanatory or outcome variables – for which measurement and operationalisation is sometimes only briefly considered (and often inappropriately simplified). Indeed, from the 1950s to the present day, social survey methodologists have heralded the same warning on several occasions – that the construction and careful analysis of such ‘key variables’ has habitually been overlooked in literature and practice (Blumer, 1956, Bulmer, et al., 2012, Burgess, 1986, Stacey, 1969).

I recently published a series of papers on this issue with Professor Vernon Gayle and Professor Paul Lambert. The papers provide an overview of the measurement options available for the analysis of three ‘key variables’, namely measures based upon occupation, education and ethnicity. There are, of course, many more variables (e.g. gender, age, health, wellbeing, religiosity) which could be considered in detail. The three variables chosen as the focus of these papers are utilised very widely in quantitative research either as explanatory or dependent variables, they are also variables for which a range of measurement options are available. Furthermore, there is a degree of debate over how these three variables should be operationalised and the complexities of the use of these variables are often overlooked in practice. These papers build on the reviews of Stacey (1969), Burgess (1986) and the more recent contribution of Bulmer et al. (2010) and discusses contemporary approaches and issues in the construction and modelling of these measures.

One of the issues which we sought to emphasise is that the manner in which a variable is constructed relies upon the decisions of the analyst and subsequently influences the form and outcomes of statistical models. The best research publications ought to show evidence of evaluation of alternative measures and careful documentation of the route taken, which can easily be made available to the reader through electronic sources (as argued by Dale, 2006). This is especially important in areas of the social sciences where there are many and, often disputed, measurement alternatives, thus leading to complex possibilities for the construction of variables. However, this practice is rarely carried out and the measures used in quantitative sociological research are neglected in discussions and can be poorly described.

It is widely noted that the data preparation and variable construction stage of the research process is the most time consuming. Methodologists generally recommend that researchers should take their time in constructing measures from a dataset in a clear, assiduous manner with every operation carefully documented through well annotated software command files (e.g. Long, 2009). If this is achieved, a clear trace of the variable construction process is developed which is readily replicable in the future, and after which the statistical analysis stage of the research can usually progress relatively swiftly. A common complaint, however, concerning social science research projects, is that the activities of variable construction are often neither well documented nor replicable by others (e.g. Treiman, 2009). This typically arises for two reasons. The first is the sub-optimal exploitation of software (for instance, due to researchers not using command files at all, or using them in a poorly organised sequence). This poor practice arguably represents long-term shortcomings in the training and information organisation skills of survey researchers (e.g. Long, 2009). The second issue is researchers’ lack of awareness (or at a minimum, their lack of inclination) to seek out, engage with, and ideally re-use, existing approaches to variable constructions. Researchers frequently invent new variable constructions ‘on the fly’ during the research process, in a manner which makes documentation and replication very difficult (see Lambert, et al., 2007).

There ought to be good news with regards to variable construction in quantitative research, insofar as many social scientists have already put a great deal of effort into the production of carefully constructed and tested measures. In most situations there are a range of suitable pre-existing variable constructions to choose from, and this is particularly true of ‘key measures’ in the social sciences. Throughout this series of papers we argue that clear documentation plays a central role in high quality social research, and provides a solid basis for replication and the incremental development of our knowledge base.

References

 Blumer, H., 1956. Sociological analysis and the” variable”. American sociological review 21, 683-690.

Bulmer, M., Gibbs, J., Hyman, L., 2010. Social Measurement through Social Surveys: An Applied Approach. Ashgate, Farnham.

Bulmer, M., Gibbs, M.J., Hyman, L., 2012. Social Measurement through Social Surveys: An Applied Approach. Ashgate Publishing, Ltd., Farnham.

Burgess, R., 1986. Key Variables in Social Investigation. Routledge and Kegan Paul, London.

Dale, A., 2006. Quality Issues with Survey Research. International Journal of Social Research Methodology 9, 143-158.

Lambert, P., Gayle, V., Tan, L., Turner, K., Sinnott, R., Prandy, K., 2007. Data Curation Standards and Social Science Occupational Information Resources. The International Journal of Digital Curation 2, 73-91.

Long, J.S., 2009. The Workflow of Data Analysis Using Stata. Stata Press, College Station.

Marsh, C., 1982. The Survey Method: The Contribution of surveys to sociological explanation. Allen and Unwin, London.

Stacey, M., 1969. Comparability in Social Research. Heinemann, London.

Treiman, D.J., 2009. Quantitative Data Analysis: Doing Social Research to Test Ideas. John Wiley & Sons, San Francisco.

The concealed middle? An exploration of ordinary young people and school GCSE subject area attainment

Christopher J. Playford and Vernon Gayle, University of Edinburgh

School examination results were historically a private matter, and the awareness of results day was usually confined to pupils, teachers and parents. School exam results are now an annual newsworthy item in Britain and every summer the British media transmit live broadcasts of groups of young people receiving their grades. This recurrent event illustrates, and reinforces, the importance of school-level qualifications in Britain.

The General Certificate of Secondary Education (GCSE) is the standard qualification undertaken by pupils in England and Wales at the end of year 11 (age 15-16). School GCSE outcomes are worth of sociological examination because in the state education system they mark the first major branching point in a young person’s educations career and play a critical role in determining pathways in education and employment.

In our paper, we turned our attention to exploring school GCSE attainment at the subject-area level, rather than looking at overall outcomes or outcomes in individual GCSE subjects. This is an innovative approach to studying school GCSE outcomes. The initial theoretical motivation was to explore if there were substantively interesting combinations or patterns of GCSE outcomes, which might be masked when the focus is either overall outcomes or outcomes in individual subjects. Within the sociology of youth there has been a growing interest in the experiences of ordinary pupils who have outcomes somewhere between the obviously successful and unsuccessful levels, and this group have been referred to as the ‘missing middle’.

The data used in the paper are from the Youth Cohort Study of England and Wales (YCS) which is a major longitudinal study that began in the mid-1980s. It is a large-scale nationally representative survey funded by the government and is designed to monitor the behaviour of young people as they reach the minimum school leaving age and either remain in education or enter the labour market. School GCSE outcomes are challenging to analyse because there are many GCSEs available, there is an element of pupil choice in the diet of GCSE that a pupil undertakes, some pupils study more GCSEs than others, each GCSE subject is awarded an individual grade on an alphabetical scale (A* being the highest and G being the lowest), and subject GCSE outcomes are highly correlated. We employ a latent variable approach as a practicable methodological solution to address the messy and complex nature of school GCSE outcomes.

In the paper we identify substantively interesting subject-level patterns of school-level GCSE outcomes that would be concealed in analyses of overall measures, or analyses of outcomes within individual GCSE subjects (see Table 1). The modelling process uncovers four distinctive latent educational groups. The first latent group is characterised by good GCSE outcomes, and another latent group is characterised by poor GCSE outcomes. There are two further latent groups with ‘middle’ or ‘moderate’ GCSE outcomes. These two latent groups have similar levels of overall (or agglomerate) outcomes, but one group has better outcomes in science GCSEs and the other has better outcomes in arts GCSEs.

Table 1. Latent group model results (four group model) school GCSE subject area outcomes.

playford pictureNote: Youth Cohort Study of England and Wales, Cohort 6; All pupils gaining a GCSE passes at grades A–G; n = 14,281; Posterior probabilities and prior probabilities reported as percentages. Reproduced from Playford and Gayle 2016, Table 5 p.156.

Membership of the latent educational groups is highly stratified. Socially advantaged pupils are more likely to be assigned to group 1 ‘Good Grades’. In contrast, the pupils assigned to group 4 ‘Poor Grades’ are more likely to be from manual and routine socioeconomic backgrounds. The analyses uncovered two latent educational groups with similar levels of moderate overall school GCSE outcomes, but different overall patterns of subject level outcomes. A notable new finding is that pupils in latent educational group 2 ‘Science’, had a different gender profile to pupils in group 3 ‘Arts’, but both groups of pupils were from the same socioeconomic backgrounds.

Our paper is innovative because it documents a first attempt to explore patterns of school GCSE attainment at the subject area level in order to investigate whether there are distinct groups of pupils with ‘middle’ levels of attainment. The sociologist Phil Brown made the pithy statement that there is an invisible majority of ordinary young people who neither leave their names engraved on the school honours board nor gouged into the top of their desks. We conclude that such pupils are found in the two ‘middle’ latent educational groups. We see no obvious reasons why school exam results will not continue to be an annual newsworthy item and we suspect that the media focus is most likely to remain on pupils with exceptional outcomes rather than those with the more modest results that characterise the two ‘middle’ latent educational groups.

A new GCSE grading scheme is likely to be introduced from August 2017. A new set of grades ranging from 1 to 9 (with 9 being the highest) will replace the A*–G scheme. Early indications suggest that the older eight alphabetical grades (A*–G) will not map directly onto the new 1–9 grades, but there will be some general equivalence. Despite the potential reorganisation of GCSEs, and the proposed changes in the grading system, school level GCSEs will continue to be complicated and messy and the methodological approach used in this paper will be equally appealing for the analysis of more contemporaneous educational cohorts.

Playford, Christopher J., and Vernon Gayle. “The concealed middle? An exploration of ordinary young people and school GCSE subject area attainment.” Journal of Youth Studies 19.2 (2016): 149-168. DOI: 10.1080/13676261.2015.1052049

Administrative data is a bit like Tinder: Other people seem to be using it, are you missing out?

Vernon Gayle, Roxanne Connelly, Chris Playford

University of Edinburgh

There is a buzz around administrative data, many people seem to be using it, are you missing out?

Administrative data are records and information that are gathered in order to organise, manage or deliver a service. Although they are not primarily collected for research some administrative data resources contain information on individuals that have great potential for sociological research. These datasets are best described as ‘administrative social science datasets’.

It is becoming increasingly common for administrative data to be linked to existing large-scale social survey datasets, but historically social scientists have only had highly restricted access to administrative records. The ESRC have funded the Administrative Data Research Network (ADRN) which aims to appropriately open up access to a plethora of data that have been locked away in databases and files. The goal is to provide researchers with access to data from Government Departments and other agencies that routinely collect data relevant to social research.

The new ADRN will allow researchers to gain carefully supervised access to data to undertake studies that are ethical and feasible. A critical feature of administrative social science data is that people cannot be identified and data cannot be linked back to individuals. This ensures that nobody’s privacy is infringed. The bar for gaining access to administrative social science data is set high because a great deal of work is required to link data and to get de-identified data ready for researchers to analyse. The outcome however will be unparalleled new sources of social science data suitable for sociological research. These data will support detailed empirical analyses of social and economic life in contemporary Britain.

Examples of plausible sociological analyses that could be undertaken with administrative social science data are legion, but we list a few to illustrate the diversity of administrative social science data sources, and to prime the reader’s sociological imagination.

Better understanding intergenerational income mobility with tax records from parents and their children later in adult life.

  • Investigating flows into and out of child poverty with linked tax and benefits records.
  • Investigating potential relationships between being in care in childhood and criminal careers in adulthood with linked data from social services and the criminal justice system.
  • Exploring the relationship between weather and children’s behaviour with linked meteorological data and school exclusion data.

Currently there is a growing fervour to describe administrative social science datasets as ‘big data’. This is rather unhelpful since the term ‘big data’ has been sloppily deployed to describe data as diverse as outputs from the Large Hadron Collider, scrapped Twitter feeds, Facebook status updates, transactional data from supermarkets, and sensory and tracking information from mobile phones and GPS devices.

There is a risk to forcing administrative social science data under the ‘big data’ umbrella and this should be avoided. There is an emerging idea that ‘big data’ heralds the end of the requirement for comprehensive statistical analyses because the sheer volume of data means that simpler correlations are sufficient. The suggestion is that knowing ‘what’ but not ‘why’ will be more important in the ‘big data’ era. The recent poor predictions from Google Flu Trends are probably the most striking cautionary tale as to why simple correlations should be avoided. We could also point to any one of the humorous spurious correlations that are usually used when teaching sociology undergraduates (our favourite is the correlation between storks and fertility).

The majority of sociological studies using administrative social science datasets will be examining a variable by case matrix that is very similar to those provided by large-scale social surveys. There is absolutely nothing that convinces us that when analysing a variable by case matrix of administrative social science data, we can ignore the helpful lessons that have emerged from decades of sociology, statistics and econometrics. For example if an administrative dataset has repeated measurements on the same individuals the usual problems associated with non-independence of observations or the possibility of residual heterogeneity will not vanish simply because of the size of the dataset, or because the data are from an administrative source rather than a social survey. Anyone who suggests that administrative social science datasets (or ‘big data’ more generally) have special immunity will have to work hard to persuade those of us who are knowledgeable and experienced social science data analysts.

The most notable positive feature of administrative social science datasets is that they are usually large in scale (although they are seldom as large as the ‘big datasets’ that are produced in areas such as particle physics). In many instances they will provide many more cases than existing large social surveys. Because the data are collected for administrative or organisational purposes, measures may be more accurate than if they had been collected from a participant in a study. For example we can envisage that personal income information within a HMRC dataset might be more accurate than information collected in a semi-structured interview.

In some cases administrative datasets will not be a sample but will be an entire population. This is sometimes referred to as n=all. Therefore some of the well-known issues relating to representativeness (including sample selection bias, unit non-response and sample attrition) will be less prevalent.

The positive features of administrative social science datasets are appealing, however like every other source of data relevant to sociological research there are also notable negative features. At a practical level it may simply not be possible to link together some sources of administrative data because there is not a suitable unique identifier available on each dataset to act as a key.

In comparison with large-scale social surveys, especially omnibus surveys such as the Understanding Society (UK Household Longitudinal Survey) which is specifically designed to support multi-disciplinary secondary data analyses, the number of available explanatory variables in most administrative datasets will be extremely limited. For example variables measuring a person’s ethnicity or level of education which are implicated in many sociological analyses might be completely absent in a dataset because they are administratively irrelevant.

Much information in large-scale social surveys is sociologically informed and specialised measures and variables are collected. Such sociologically informed measures might not be available in administrative social science datasets. In some administrative datasets there will be suitable proxy measures, but in other datasets the proxy measures available may be less suitable. For example in an education related dataset the measure of ‘eligibility for free school meals’ might seem like a suitable proxy for household disadvantage. ‘Eligibility for free school meals’ will perform relatively poorly however when compared with a sociologically informed measure such as the National Statistics Socio-Economic Classification in an analysis of school examination performance.

Much administrative data will naturally be high quality in terms of both validity and reliability. Sociologists have a long track record of being reflexive and concerned about the research value of data. Thought must similarly be placed into the quality of measures within administrative datasets. The inaccuracies that individuals detect and the errors that emerge in transactions with the benefits agencies, tax authorities, transport agencies, the national health service and local authorities, coupled with the errors, miscalculations and inaccuracies that occur in transactions with service providers such as banks, credit cards, utility companies, and delivery and transport providers, all hang a reasonable question mark over the quality of some administrative data for sociological research.

In some instances the administrative datasets will be an entire population, but this cannot be assumed. Individuals may be missing from a dataset and they will be hard to detect. It is likely that these missing individuals will be a special subset of the population. There may also be missing information on some measures within the datasets. This ‘missingness’ might be unimportant, however the narrowness of the range of other explanatory variables in the dataset will greatly restrict the possibility of formal statistical methods for missing data being applied.

The ADRN will take on much of the burden for negotiating and securing access to data, however some sources of data may still remain inaccessible. Gaining access to specific datasets held by some organisations can be prohibitively slow and therefore disadvantageous for shorter projects. Many datasets will only be available to be analysed in approved ‘safe-settings’ and this places an extra burden on the data analyst. Because of the sensitivity of the data some data providers will place controls on the output that sociologists produce, and typically extra time will be required for researchers to get results cleared for presentations and publications.

Training in the standard range of statistically informed multivariate data analysis techniques that are required for analyses of large-scale social surveys are also required for analysing administrative datasets. We contend that given the scope and the restrictions of administrative data it is likely that some of the most fruitful sociological enterprises will involve analyses of administrative data that have been linked to existing well designed large-scale social science studies. The ADRN is ground breaking and offers support for intellectually exciting opportunities for sociological research using administrative data.

Don’t swipe left on administrative data.