Monthly Archives: January 2018

Research with our trousers down: Publishing sociological research as a Jupyter Notebook

Roxanne Connelly, University of Warwick

Vernon Gayle, University of Edinburgh

On the 29th of May 2017 the University of Edinburgh hosted the ‘Social Science Gold Rush Jupyter Hackathon’. This event brought together social scientists and computer scientists with the aim of developing our research and data handling practices to promote increased transparency and reproducibility in our work. At this event we contemplated whether it might ever be possible to publish a complete piece of sociological work, in a mainstream sociology journal, in the form of a Jupyter Notebook. This November, 6 months after our initial idea, we are pleased to report that the paper ‘An investigation of social class inequalities in general cognitive ability in two British birth cohorts’ was accepted in the British Journal of Sociology accompanied by a Jupyter Notebook which documents the entire research process.

Jupyter Notebooks allow anyone to interactively reproduce a piece of research. Jupyter Notebooks are already effectively used in ‘big science’, for example the Nobel Prize winning LIGO project makes their research available as Jupyter Notebooks. Providing statistical code (e.g. Stata or R code) with journal outputs would be a major step forward in sociological research practice. Jupyter Notebooks take this a step further by providing a fully interactive environment. Once a researcher has downloaded the requested data from the UK Data Archive, they can rerun all of our analyses on their own machine. Jupyter Notebooks encourage the researcher to engage in literate programming by clearly documenting the research process for humans and not just computers, which greatly facilitates the future use of this code by other researchers.

When presenting the results of social science data analyses in standard journal articles we are painfully confined by word limits, and are unable to describe all of the steps we have taken in preparing and analysing complex datasets. There are hundreds of research decisions undertaken in the process of analysing a piece of existing data, particularly when using complex longitudinal datasets. We make decisions on which variables to use, how to code and operationalise then, which cases to include in an analysis, how to deal with missing data, and how to estimate models. However only a brief overview of the research process and how analyses have been conducted can be presented in a final journal article.

There is currently a replication crisis in the social sciences where researchers are unable to reproduce the results of previous studies, one reason for this is that social scientists generally do not prepare and share detailed audit trails of their work which would make all of the details of their research available to others. Currently researchers tend to place little emphasis on undertaking their research in a manner that would allow other researchers to repeat it, and approaches to sharing details of the research process are ad hoc (e.g. on personal websites) and rarely used. This is particularly frustrating for users of infrastructural data resources (e.g. the UK’s large scale longitudinal datasets provided by the UK Data Service), as these data can be downloaded and used by any bone fide researcher. Therefore it should be straightforward, and common place for us to duplicate and replicate research using these data, but sadly it is not. We see the possibility of a future of social science research where we can access full information about a piece of research, and duplicate or replicate the research to ultimately develop research more efficiently and effectively to the benefit of knowledge and society.

The replication crisis is also accompanied by concerns of scientific malpractice. It is our observation that P-hacking is a common feature of social science research in the UK, this is not a statistical problem but a problem of scientific conduct. Human error is also a possible source of inaccuracy in our research outputs, as much quantitative sociological research is carried out by single researchers in isolation. Whilst co-authors may carefully examine outputs produced by colleagues and students, it is still relatively rare to request to examine the code. In developing our Jupyter Notebook we have borrowed two techniques from software development, ‘pair programming’ and ‘code peer review’. Each of us repeated the research process independently using a different computer and software set-up. This was a laborious process, but labour well spent in order to develop robust social science research. This process made apparent several problems which would otherwise be overlooked. At one point we were repeating our analysis whilst sharing the results over Skype, and frustratingly models estimated in Edinburgh contained 7 fewer cases than models estimated in Coventry. After many hours of investigation we discovered that the use of different versions [1] of the same dataset, downloaded from the UK Data Archive, contained slightly different sample numbers.

We describe this work as ‘research with our trousers down’ [2], as publishing our full research process leaves us open to criticism. We have already faced detailed questions from reviewers which would not have occurred if they did not have access to the full research code. It is also possible that other researchers will find problems with our code, or question the decisions which have been made. But criticism is part of the scientific process, we should be placing ourselves in a position where our research can be tested and developed. British sociology lags behind several disciplines, such as Politics and Psychology, in the drive to improve transparency and reproducibility in our work. As far as we are aware there are no sociology journals which demand researchers to provide their code in order to publish their work. It is most likely only a top-down change from journals, funding bodies or data providers which would develop the practices within our discipline. Whilst British sociologists are not yet talking about the ‘reproducibility crisis’ with the same concern as psychologists and political scientists, we have no doubts that increased transparency will bring great benefits to our discipline.

[1] This problem is additionally frustrating as the UK Data Service do not currently have an obvious version control protocol, and do not routinely make open sufficient metadata for users to be able to identify precise versions of files and variables. We have therefore documented the date and time that datasets where downloaded and documented this in our Jupyter Notebook. Doubtlessly, the UK Data Service adopting a clear and consistent version control protocol would be of great benefit to the research community as it would accurately locate data within the audit trail.
[2] We thank our friend Professor Robin Samuel for this apposite term.