Statistics anxiety: Busting the anxious women myth?

Dr Vicky Gorton, Dr Kevin Ralston, June 2020

For many students Statistics = Anxiety. This anxiety is often characterised as limiting students’ engagement with statistics and impacting on their performance on quantitative methods courses at university. The relationships between age, gender and statistics anxiety are some of the most examined in the research literature. A survey of these findings might lead us to reformulate the Statistics = Anxiety equation to read: Statistics + Women = >Anxiety, as previous research has tended to identify women as more likely to experience anxiety and at greater levels.

In our article, ‘Anxious women or complacent men? Examining statistics anxiety in UK sociology undergraduates’, we wanted to revisit the core demographic variables of age and sex to examine their association with reported anxiety of statistics. Unlike most other research in the field however, we modelled an interaction between these two variables. This allowed us to explore whether reported anxiety of statistics varies within and between sexes by levels of age (comparing under 25s with those 25 and over).

The research is based on a secondary analysis of a dataset on the attitudes of sociology and political science students towards quantitative methods. These data, gathered by Williams et al. (2009) and shared on the UK data archive, are amongst the most comprehensive ever collected on attitudes of undergraduates to QM. Crucially, for our aims, the students were asked whether they felt anxious about learning statistics. This made it possible to interrogate these data to explore in detail the relationship between age, gender and anxiety of statistics.

The methods we used for the analysis are the same general techniques that many social science undergraduates will learn about during their own quantitative methods courses – logistic regression models and bivariate analysis. Our paper provides a simple applied account of these methods, which would be a relevant example in learning-teaching settings.

The results indicate that it is older men, not women, who are most likely to report experiencing anxiety of statistics in social science contexts. This is only apparent when considering the interaction between age and gender, without this interaction there is no difference between men and women in the likelihood of experiencing statistics anxiety.

It is therefore possible that young men, who are less anxious, have driven the gender differences that have previously been reported in research. This is to say that, rather than experiencing excessive anxiety, women may seem more anxious in previous studies because of their comparison to a group of more complacent young men.

The results call into question the potentially damaging ‘anxious women’ narrative that predominates the literature on the teaching-learning of maths and statistics. We suggest that this paradigm may be misleading, distracting, and an oversimplification. Despite the research focus on statistics anxiety, there is no strong evidence that it has a meaningfully negative influence on the learning of statistics for those on social science courses. By comparison, the pedagogical implications of an issue like complacency in this context has received little consideration. Overall, we argue that it is time to move away from the perception that women studying social sciences are excessively anxious of statistics. Our findings suggest that this is a myth in need of busting.

Photo by Priscilla Du Preez on Unsplash

 

Data driven: the employment rate of 16 to 24 year olds

Kevin Ralston 2019, York St John University

This series of posts will apply nationally collected, representative data to highlight some of the trends underlying the official employment rate.

This current post employs data from the Labour Force Survey, Labour Market Statistics to chart the youth employment rate. The data are freely available from the Office for National Statistics (ONS) and the UK data service.

Figure 1 (available here)
Youth_unemployment

The UK Government has been celebrating what is described as a record high employment rate. The most recent employment rate estimate is at 75.8%. It seems to be unalloyed good news. Yet a high employment rate is only part of the story. The high top-line level of employment has been underscored by stagnating wages and increasing levels of extreme poverty. People have paid work, but many are still getting poorer.

Figure 1 illustrates that this high level of employment is not experienced equally by all groups. The youth employment rate is only 54%. The male youth employment rate is around ten percent lower than it was in 2001. The female youth employment rate is around seven percent lower than it was 2001.

The youth labour market was collapsing from 2001-2002. The Figure also indicates just how catastrophic the Great Recession of 2008 was for young people’s employment prospects.

The chances of young men and women making an early transition into the labour force has declined substantially in the last twenty years. This is worth bearing in mind when you see it asserted that the UK employment rate is high. It is, but there are substantial problems underlying the top-line trend.

Technical note:
The data from 1993 to 2017 were published by ONS as Labour Market Statistics. This did not, at time of writing, include information for 2018. The 2018 figure was taken from the first three quarters of the 2018 Labour Force Survey. In estimating the rate the individual person weight [PWT17] was applied.
The code to generate the graph and the data are available on GitHub

 

Funky data and complicated models

A Mediation analysis of a Poisson outcome with a binary mediator in Stata, using the PARAMED module

Kevin Ralston 2019, York St John University

This blog examines options available to undertake Mediation analyses in packages SPSS, R and Stata. Mediation analysis is a growing area of interest and the blog considers the case of undertaking a Mediation of a count outcome with a binary categorical mediator. It is shown that the functionality to undertake analysis of these data is available in all three packages and an example is given for Stata.

Background

During the long summer of 2018 a colleague contacted me asking if I knew how to undertake Mediation analysis. Off the top of my head I did not, but a few years ago I had attended a course on causal modelling in Stata (it was a later version of this course). The functionality demonstrated on the course had just become possible in the most recent versions of Stata (13 I think). I had a couple of meetings with Dr Davis, who led the research, where it was explained what was needed.

It was very interesting. They are psychologists studying the adult experience of hallucinations and whether this is predicted by childhood imaginary play partners (imaginary friends). They wanted to know whether the relationship they observed in modelling, between having an imaginary friend in childhood and subsequent experience of hallucinations in adulthood, was mediated by having experienced abuse (‘childhood adversity’).

Sharma (2015) explains that Mediation analysis refers to the estimation of the indirect effect of X on Y through an intermedi­ary mediator variable M causally located between X and Y (i.e., a model of the form X M Y). Or, as the graphic below describes IV → MV → DV.

This is exactly what they were looking to do. There were some complications however, they wanted the outcome variable to be modelled as a Poisson count of level of hallucination severity, the mediator was a binary indicator of abuse and the explanatory variable was also a binary indicator of whether the individual had an imaginary childhood friend. There were additional categorical control variables of gender and income.

mediation_gr-2

Finding an appropriate model

Mediation analysis has a long history. A version of mediation analysis was outlined by Sewall Wright in 1934. As is often the case with statistical methods, although extremely clever people were able to demonstrate the possibility of a method several generations ago, this is unfortunately not equivalent to making the method accessible to those of lesser mathematical knowledge. Indeed, it is only in the last decades that the application of mediation analysis has expanded in fields such as psychology. This has been contingent on the growth of computing power along with software which renders these tools accessible to applied analysts. Even given current computer power and the ability of standard statistical software to handle mathematical models of increasing complexity, Mediation analysis has only relatively recently been absorbed into the most commonly used statistical packages.

R is very versatile and can handle a substantial range of data and models via recently released packages such as mediation:R. Those involved in statistical analysis know that R is a fantastically powerful software. The main drawback is that it requires a relatively high threshold of user knowledge to work in the environment. Stata comes somewhere between SPSS and R. To undertake analysis Stata requires more of a learning curve than SPSS but offers superior functionality and modelling capability. It offers less versatility or modelling capability than R, but nevertheless offers a wide variety of possibilities that are likely to meet the needs of most social-scientists.

Mediation:R would certainly do what we needed, but my experience as an analyst in R is that running into a problem in the package leads to substantial project delay. The knowledge threshold that R requires is comparatively high so that overcoming problems demands substantial outlays of time and mental capacity. This is not in itself an issue, but sometimes you have time to spend, and sometimes you need a result! My colleagues wanted the final piece of analysis for their paper.

A presentation by Grotta and Bellocco provided the solution in Stata. The presentation outlined several approaches to Mediation analysis including the PARAMED module. PARAMED enables Mediation analysis of categorical dependent variables such as our Poisson outcome. Checking the documentation that accompanied the installed module it allowed for exactly the model my colleagues needed. It also turned out that the PARAMED module is based on SAS and SPSS macros for running Mediation. Therefore the PARAMED functionality has equivalent in both SAS and SPSS.

I cannot claim to be a deep expert in SPSS but that is the package the team were working in. They were taking a look at the SPSS package PROCESS. I understand that the SPSS PROCESS package, has been written to allow Mediation analysis and version 3 handles categorical dependent variables. The developer of the PROCESS macro points to their book, Introduction to Mediation, Moderation, and Conditional Process Analysis for those interested in using this. Although it looks as though the analysis would be possible in PROCESS I have not yet found an instantiation, but there may well be one in the book.

It is apparent that functionality to undertake the modelling required is generally available. Stata is my preferred package and an example of a Mediation analysis in Stata is given below.

Analysis

The analysis used Stata 15. Variables specified in the analysis are listed below. The variable names are somewhat esoteric, sorry about that:

  • UHRSUMPER is the count measure of hallucination
  • ICTWOWAY is the binary measure of imaginary childhood companion (described as CIC status – childhood imaginary companion)
  • SUMADVERSITY2WAY is the binary indicator of abuse
  • Income is in three categories
  • Male is binary men and women

The first thing I did was to install PARAMED and check the documentation, help files.

ssc install paramed
help paramed

I tried various model specifications, starting most simply.

The Stata code below specifies the full model:

paramed UHRSUMPER, avar(ICTWOWAY) mvar(SUMADVERSITY2WAY) cvars(under10k ten_25k  male) a0(0) a1(1) m(1) yreg(poisson) mreg(logistic) nointer boot seed(1234)

paramed invokes the paramed routine in Stata, yreg(poisson) specifies that the dependent variables is a count, mreg(logistic)specifies the mediator as binary. cvars(under10k ten_25k male) are dummy categories for the dummy variables of sex and income. The code a0(0) a1(1) m(1) specifies levels of the explanatory variable and the mediator. boot specifies whether a bootstrap procedure should be performed to compute bias-corrected bootstrap confidence intervals and seed specifies the seed for the bootstrap.

Mediation results output from Stata

Estimate Std Err P>|z| Lower 95% confidence Interval Upper 95% confidence Interval
cde 1.253955 .10531446 0.032 1.0200853 1.5414429
nde 1.253955 .10531446 0.032 1.0200853 1.5414429
nie 1.088400 .03164556 0.007 1.0229427 1.158046
mte 1.3648047 .10552532 0.003 1.1098021 1.6784

mte=Total effect, nde=Natural direct effect, nie=Natural indirect effect

It was reported in the paper, The relationship between CIC status and hallucination symptoms was mediated by childhood adversity where the total effect was significant (Estimate = 1.36, CI, 1.11 to 1.68) p = .003, as well as the natural direct effect (Estimate = 1.25, CI, 1.02 to 1.54) p = .032, and the natural indirect effect (Estimate = 1.09, CI, 1.02–1.16) p = .007.

Conclusions

This blog has discussed some options available to undertake Mediation analyses in packages SPSS, R and Stata. An example of a potentially problematic Mediation analysis of a Poisson outcome has been outlined and it is shown that Stata was able to handle a tricky model like this via the user written program, PARAMED. In addition to giving readers an insight into options available to those interested in Mediation analysis the blog provides an opportunity to give due credit to the authors of the PARAMED module, Richard Emsley and Hanhua Liu. Unfortunately the journal that published the research article would not allow the inclusion of the reference for the PARAMED module, although we were able to name check the module in the text of the article. I have uploaded a pre-publication version of the paper with the reference attached and the full reference is provided below. Thank you Professor Emsley and Dr Hanhua Liu.

The co-auhthors on the research article are, Paige E. Davis, York St. John University, Lisa A. D. Webster, Leeds Trinity University, Charles Fernyhough Durham University, Helen J. Stain Leeds Trinity University Susanna Kola-Palmer University of Huddersfield.

**

Richard Emsley & Hanhua Liu, 2013. “PARAMED: Stata module to perform causal mediation analysis using parametric regression models,” Statistical Software Components S457581, Boston College Department of Economics, revised 26 Apr 2013.

A Categorical Can of Worms III

Examining categorical interactions in logit models using Marginal estimates and Marginsplot

Kevin Ralston 2018, York St John University

Introduction

This post is the third in a series of blogs which examine parameterisations of interactions in logit models. The first post outlined the generic, ‘conventional’ approach to including categorical interactions in logit models. The second post outlined an alternative specification of a categorical interaction in a logit. The current post outlines the application of marginal estimates and the marginsplot graph in the examination of categorical interactions in logit models.

Marginal estimates

Marginal estimates of categorical data are now part of the standard tool box in sociological research outputs. Margins produce estimates which have a ready interpretation. This is helpful because, as we have seen, working out what a model is showing us when an interaction is included is not straightforward. Williams (2017) explains what a marginal probability shows us in a logit model:

In the logit marginal results report the probability that a category is in the category coded 1 on the outcome. The MEM [marginal effect at means] for categorical variables therefore shows how P(Y=1) changes as the categorical variable changes from 0 to 1, holding all other variables at their means.

quietly logit class3 i.sex##i.ft i.qual c.age
margins i.sex#i.ft,

To produce marginal estimates at means we will estimate the basic model we have specified previously. We then follow this with a new line of code which includes the margins command, along with the variables included in the interaction. The quietly command here tells Stata not to produce the output for the model (we’ve seen it already).

Table1, Stata output, marginal estimates at means for an interaction from a logistic regression modelling membership of social class III, including independent variables sex, has a qualification, working full-time or part-time and age, also an interaction between age and working FT/PT. Source is GHS 1995, teaching dataset
Margins1

In this case the margins are interpreted as the probability that each of the categories is in social class III at the average value (mean) of the other variables included in the model.

A standard criticism of marginal estimates at means is that the average value at which the estimates are calculated may have no substantive meaning. For example this model includes a categorical measure of whether an individual has qualifications, or not. By coincidence this variable is balanced close to 50% in each category. In a model including say, 30% with no qualifications the average marginal probabilities would be computed for an individual with 30% no qualification. In this model the margins are for an individual with ~50% no qualifications. This is problematic because we are referring to discrete categories. Someone with 50% no-qualifications cannot exist.

quietly logit class3 i.sex##i.ft i.qual c.age
margins i.sex#i.ft, at(qual=1) post,

It is also possible to estimate the marginal at a specific value of independent variables, such as qualifications. These have been described as adjusted predictions or predictive margins. This may be preferred. This is the specification I prefer as it offsets the criticism made above. It does not however mean that anyone in the data necessarily occupies the combination of categories in the model. There may still be no part time male workers with no-qualifications at the mean age of the sample. If there were we would expect them to have a probability of occupying social class III of .178 (quite low, closer to 0 than 1).

Table2, Stata output, adjusted predictions for an interaction from logistic regression modelling membership of social class III, including independent variables sex, has a qualification, working full-time or part-time and age, also an interaction between age and working FT/PT. Source is GHS 1995, teaching datasetMargins2

Marginsplot

The margins command has a neat graphing functionality.

Figure1, is a graphic of the marginal probability at means of being in social class III for the working full-time, part-time and sex interaction. The code for this is reported below.Margins3

logit class3 i.ft##i.sex i.qual c.age

       margins i.ft#i.sex,

             marginsplot , name(g2, replace) scheme(s1mono) ///

                    title (“Margins of ft/pt working and sex interaction”) ///

                    subtitle(“Outcome: member of social class III”) ///

                    legend(pos(7) ring(0)) ///

                    xtitle(“”) ytitle(“”) ///

                    xlabel(,angle(45))  ///

                    caption(“Source: GHS 95 teaching dataset”)   

 

To produce this graph you might notice I switched the position of the ft and sex dummy variables in the model. The graphical specification seems more sensible depicting ft/pt on the x-axis and depicting the difference within and between men and women. Maybe I should switch all the models so they are consistent. I had originally included sex in the model first for two reasons. Firstly, people have a biological sex and a socially constructed gender which influences their experience and choices, before they have a full time or part time job. Secondly, gendered occupational segregation is the area of substantive interest.

Building an analysis is an iterative process. There are good reasons to include sex before ft in the model, but in this case the interaction is presented more sensibly when organised i.ft##i.sex. Constructing an analysis often involves making small decisions and trade-offs like this.

Conclusion

In conclusion, I would suggest anyone fitting categorical interactions in logit models should both apply and report the marginal estimates. These have ready and relatively straightforward interpretations. They are certainly more intuitive than the interpretation of the results of a categorical interaction output in Stata applying a conventional interaction in a logit model.

Suggested reference should this post be useful to your work:

Ralston, K. 2018. A categorical can of worms III: Examining categorical interactions in logit models using Marginal estimates and Marginsplot. The Detective’s Handbook blog, Available at: thedetectiveshandbook.wordpress.com/2018/10/15/a-categorical-can-of-worms-iii/[Accessed: 15 October 2018].

 

Feynman on sociology

Kevin Ralston, York St John University, 2018

Richard Feynman

Richard P. Feynman (1918-1988) was a theoretical physicist who was part of the team who worked on the atomic bomb at Los Alamos. This year was the 30th anniversary of his death. He won the Nobel Prize for physics in 1965, which he shared with two others. He studied at MIT and Princeton before taking posts at Cornell and Caltech. By the time of his death he was one of the most famous scientists in the world.

From the standpoint of today Feynman seems like an exceptionally high-spirited academic who had many diverse interests. Within physics he developed pedagogical materials and programs of study. Beyond physics he was involved in selecting resources for the high school science curriculum and sat on the enquiry for the Challenger space shuttle disaster. At times he also wrote about what he considered the contribution made by non-scientific fields of study. Perhaps then, it is worth taking note of what a Nobel Laureate wrote about an encounter he had with sociology. This occurred at a conference where he was the scientific representative among academics from various disciplines who had been brought together to discuss the ethics of equality.

There was this sociologist who had written a paper for us all to read ahead of time. I started to read the damn thing, and my eyes were coming out: I couldn’t make head nor tail of it! I figured it was because I hadn’t read any of the books on the list. I had this uneasy feeling of “I’m not adequate,” until finally I said to myself “I’m gonna stop, and read one sentence slowly so I can figure out what the hell it means.”

So I stopped-at random-and read the next sentence very carefully. I can’t remember it precisely, but it was very close to this: “The individual member of the social community often receives his information via visual, symbolic channels.” I went back and forth over it, and translated. You know what it means? “People read.”

Then I went over the next sentence, and realised that I could translate that one also. Then it became a kind of empty business: “Sometimes people read; sometimes people listen to the radio,” and so on, but written in such a fancy way that I couldn’t understand it at first, and when I finally deciphered it, there was nothing to it. 

Richard P. Feynman 1989 ‘Surely you’re joking, Mr. Feynman’, Unwin: London

As a sociology undergraduate and post-grad I read a lot of theory in the original. I read Marx on Capital, the penguin classics edition in three volumes. I read Foucault and remember quoting from the text in a seminar and the lecturer commented on how unusual it was to have a student do so. I read Earnest Mandel on Marxist Economic Theory. I was reading this as a post-grad and got about half-way, but by this point my views on the importance of reading this stuff in minute detail was shifting. I had been making notes in the margins, if I were to dig this book out of the box it is in I could still find where I stopped reading! This is not an attempt to show off, but to establish that I have done some hard yards on theory and believe I have earned the right to be critical.

In my view a substantial proportion of sociology is exactly as Feynman described in the quote above. It is an exercise in obfuscation and the needless use of complex language for its own sake. It is a self-re-enforcing construct (by this I mean there are so many people engaged in this they perpetuate the practice in their interests) intended to appear as if there is something important or profound being communicated, when, in reality, what has been written is mundane or simply empty. I can understand people who have found a way to get paid 50k or 90k, say, to write in a stylised manner about general social life, would logically keep that going, particularly if they are being told by their peers how wonderful their work is. For those not being directly paid (students/the public/Nobel Laureates) there is almost certainly more useful things they could be doing than translating sociology into sensible language.

The final part of my own move away from believing that it is important to spend time deciphering the type of sociology that people have purposely worked to make difficult to understand was reading Colin Mills blog on Blah blah sociology. For me blah blah sociology is where the aim in the writing has become to express things in an obscure manner. Here Mills lamented the reality of a sociology conference where ‘None of the talks seemed to have much truck with carefully articulated questions addressed with appropriate empirical evidence.’ If you read Feynman this is exactly the issue he had with the conference he attended on the ethics of equality.  Feynman’s description anticipated blah blah sociology perfectly.

Feynman and his wife, Gweneth Howarth, at the Nobel ball 1965.

Embed from Getty Images

 

Mortality by occupation: Is occupation no more than a convenient category?

Kevin Ralston, York St John University 2018

Like me, sociologists I have worked with tend to place occupation as of central importance in their examinations of the social world. This underscores a belief in the prominence of occupation as an indicator (and often determinant) of outcomes in people’s lives. This belief is not necessarily shared by those from other disciplines.

I was fortunate to be involved in a recently published work which estimated mortality in the UK by occupational group[1]. The research was led by Dr Srinivasa Vittal Katikireddi. The field for which the analysis was undertaken was public health.

A response, published in the Lancet, to our article, asked the question ‘why choose occupation as the category for analysis? Why not, for example, analyse according to main hobby, or main place of shopping? The answer is partly because occupational data are available’ (Jessop 2017). The piece argued that categorising people by their main job is ambiguous and that other classifications may produce more useful insights, suggesting alternative measures based on hobbies or shopping location may be preferable.

It is certainly possible to hypothesise causal pathways between shopping habits or hobbies and mortality. If we knew the average saturated fat content of the weekly shop we could predict an increased likelihood of a number of diseases and begin to think about specific public health interventions to influence levels of fat consumption. Similarly, whether people regularly participate in fun habits that involve groups and/or physical activity correlates with mental wellbeing and physical health. Knowledge of factors that stimulate involvement in sports or social networks can be used to improve health outcomes.

That being said, it is unlikely that general measures of hobbies or place of shopping would tell us more than if we know an individual’s occupation. A paper by Connelly et al (2016) describe occupation as the ‘most powerful single indicator of levels of material reward, social standing and life chances’. Indeed, occupation is likely to be a reasonable proxy of hobby types and is associated with shopping habits. What is more, people’s hobbies and shopping habits are outcomes influenced by occupational position. We know that social class background idicates whether people shop at Waitrose, play violin or are a member of a golf club. On the other hand it is difficult to imagine a realistic scenario where shopping at Sainsbury’s, being a keen angler or involved in a book club could have systematic influence on whether people are employed as teachers, carers or medical doctors.

The ongoing importance of public health analyses based upon occupation could be defended on a number of bases. Occupational analyses have a grand, long-run and robust theoretical underpinning. This is something categories such as hobby or favoured supermarket do not offer. This blog will not take the direction of constructing an argument in favour of occupation based on theory. Instead it will make a short general empirical justification in support of the use of occupation in public health analyses. I am (May 2018) working on a follow up paper to our research examining mortality by occupation. I thought I’d take a break from this to present a small piece of analysis which demonstrates something of the strength of association between an occupationally based measure and mortality.

Data

The data are from the ONS Longitudinal Study (LS) which contains linked census and life events for a 1% sample of the population of England and Wales. The LS has linked records at each census since the 1971 Census, for people born on one of four selected dates in a calendar year. These four dates were used to update the sample at the 1981, 1991, 2001 and 2011 Censuses. Life events data are also linked for LS members, including births to sample mothers, deaths and cancer registrations. New LS members enter the study through birth and immigration (if they are born on one of the four selected birth dates). From these data we have taken a sample of those present at the 2001 Census only. Death of a sample member is linked from administrative records. The outcome variable is age standardised all-cause mortality rate (per 100,000 person- years). The sample are men aged 20-59 years. Additional information on the sample can be found in the paper.

Occupation was self-reported in the 2001 census, in response to the question “What is the full title of your main job?”. Responses to this question were used to derive Standard Occupational Classification (SOC) 2000 codes that are readily available in the data. The follow up period for death was until 2011. Because of disclosure control issues we used SOC at three digit ‘minor’ level. There are 81 occupational groups coded at this level, we were able to report on 59 of these. From this we calculated European age standardised mortality rates and 95% confidence intervals by occupational group. The three digit SOC codes were used to apply a CAMSIS score to the occupational group. CAMSIS is an occupationally based measure of social stratification in the form of a scale of social distance and occupational advantage. More advantaged occupations score more highly on the scale, which ranges from 0 to 100 and is designed to have a mean of 50 for the general population (if you have not heard of or used CAMSIS before I suggest you check it out HERE, I highly recommend the measure).

Results

Figure 1
CAMSIS_MORTALITY_20180529

Figure 1 describes the relationship between CAMSIS and the mortality rate. The first graph shows estimated mortality with confidence intervals for the occupational group. The second shows only the point estimates for the occupational group, a linear fit line and a quadratic curve of the association with mortality. A strong correlation is evident between CAMSIS and the mortality rate (-.79). Although there is a deal of overlap in confidence intervals for many occupations, the pattern of association is clear, more advantaged occupations tend to have lower estimated mortality.

Conclusion

Correlation is not causation. Occupation in conjunction with all-cause mortality is limited in terms of its utility in ‘explaining’ the gradient in mortality observed. The estimated differences will be due to a range of factors, many of which are not directly applicable to the occupation, but which may be materially associated. That being said, it is certainly possible to identify direct testable hypotheses based on occupation. For example, recent work has shown that it is likely that firefighters experience increased rates of cancer because of contaminated equipment. This built upon more general work noting a higher incidence of cancer amongst firefighters. Questions I often wonder about, but have not had time to take further include: what is the risk of serious respiratory disease to delivery drivers who work in large cities versus those in rural areas? Are those in the new gig economy disproportionately affected?

These are points similar to those made by Jessop in commenting on our article. Nevertheless, it is necessary to firmly rebut the idea that we study occupation simply because it is what is available (whilst measures of hobby or favoured grocery shop are not). The small piece of analysis here demonstrates something of the magnitude of the association between an occupationally based measure and a measure of mortality. This is in line with Connelly et al.’s (2016) description of occupation as the ‘most powerful single indicator of levels of material reward, social standing and life chances’. There has been a long history of interdisciplinary overlap between sociology and public health. There is great potential for research drawing sociologically upon occupation as a basis for analyses of public health outcomes. Far from being a category that should be replaced, I would suggest occupation remains under exploited in public health research.

Acknowledgments

This study received no specific funding. SVK is funded by a NHS Research Scotland Senior Clinical Fellowship (SCAF/15/02). SVK and AHL are funded by the Medical Research Council (MC_UU_12017/ 13 & MC_UU_12017/15) and Scottish Government Chief Scientist Office (SPHSU13 & SPHSU15). DS is funded by the Wellcome Trust Investigator Award (100709/Z/12/Z) and the European Research Council (HRES-313590).

The permission of the Office for National Statistics (ONS) to use the Longitudinal Study is gratefully acknowledged, as is the help provided by staff of the Centre for Longitudinal Study Information and User Support (CeLSIUS). CeLSIUS is supported by the ESRC Census of Population Programme (award reference ES/K000365/1). The authors alone are responsible for the interpretation of the data.

Statistical data from ONS is Crown Copyright. Use of the ONS statistical data in this work does not imply the endorsement of the ONS in relation to the interpretation or analysis of the statistical data. This work uses research datasets that might not exactly reproduce ONS aggregates.

[1] The paper was also co-authored by Prof Alastair H Leyland, Prof Martin McKee and Prof David Stuckler

 

Research with our trousers down: Publishing sociological research as a Jupyter Notebook

Roxanne Connelly, University of Warwick

Vernon Gayle, University of Edinburgh

On the 29th of May 2017 the University of Edinburgh hosted the ‘Social Science Gold Rush Jupyter Hackathon’. This event brought together social scientists and computer scientists with the aim of developing our research and data handling practices to promote increased transparency and reproducibility in our work. At this event we contemplated whether it might ever be possible to publish a complete piece of sociological work, in a mainstream sociology journal, in the form of a Jupyter Notebook. This November, 6 months after our initial idea, we are pleased to report that the paper ‘An investigation of social class inequalities in general cognitive ability in two British birth cohorts’ was accepted in the British Journal of Sociology accompanied by a Jupyter Notebook which documents the entire research process.

Jupyter Notebooks allow anyone to interactively reproduce a piece of research. Jupyter Notebooks are already effectively used in ‘big science’, for example the Nobel Prize winning LIGO project makes their research available as Jupyter Notebooks. Providing statistical code (e.g. Stata or R code) with journal outputs would be a major step forward in sociological research practice. Jupyter Notebooks take this a step further by providing a fully interactive environment. Once a researcher has downloaded the requested data from the UK Data Archive, they can rerun all of our analyses on their own machine. Jupyter Notebooks encourage the researcher to engage in literate programming by clearly documenting the research process for humans and not just computers, which greatly facilitates the future use of this code by other researchers.

When presenting the results of social science data analyses in standard journal articles we are painfully confined by word limits, and are unable to describe all of the steps we have taken in preparing and analysing complex datasets. There are hundreds of research decisions undertaken in the process of analysing a piece of existing data, particularly when using complex longitudinal datasets. We make decisions on which variables to use, how to code and operationalise then, which cases to include in an analysis, how to deal with missing data, and how to estimate models. However only a brief overview of the research process and how analyses have been conducted can be presented in a final journal article.

There is currently a replication crisis in the social sciences where researchers are unable to reproduce the results of previous studies, one reason for this is that social scientists generally do not prepare and share detailed audit trails of their work which would make all of the details of their research available to others. Currently researchers tend to place little emphasis on undertaking their research in a manner that would allow other researchers to repeat it, and approaches to sharing details of the research process are ad hoc (e.g. on personal websites) and rarely used. This is particularly frustrating for users of infrastructural data resources (e.g. the UK’s large scale longitudinal datasets provided by the UK Data Service), as these data can be downloaded and used by any bone fide researcher. Therefore it should be straightforward, and common place for us to duplicate and replicate research using these data, but sadly it is not. We see the possibility of a future of social science research where we can access full information about a piece of research, and duplicate or replicate the research to ultimately develop research more efficiently and effectively to the benefit of knowledge and society.

The replication crisis is also accompanied by concerns of scientific malpractice. It is our observation that P-hacking is a common feature of social science research in the UK, this is not a statistical problem but a problem of scientific conduct. Human error is also a possible source of inaccuracy in our research outputs, as much quantitative sociological research is carried out by single researchers in isolation. Whilst co-authors may carefully examine outputs produced by colleagues and students, it is still relatively rare to request to examine the code. In developing our Jupyter Notebook we have borrowed two techniques from software development, ‘pair programming’ and ‘code peer review’. Each of us repeated the research process independently using a different computer and software set-up. This was a laborious process, but labour well spent in order to develop robust social science research. This process made apparent several problems which would otherwise be overlooked. At one point we were repeating our analysis whilst sharing the results over Skype, and frustratingly models estimated in Edinburgh contained 7 fewer cases than models estimated in Coventry. After many hours of investigation we discovered that the use of different versions [1] of the same dataset, downloaded from the UK Data Archive, contained slightly different sample numbers.

We describe this work as ‘research with our trousers down’ [2], as publishing our full research process leaves us open to criticism. We have already faced detailed questions from reviewers which would not have occurred if they did not have access to the full research code. It is also possible that other researchers will find problems with our code, or question the decisions which have been made. But criticism is part of the scientific process, we should be placing ourselves in a position where our research can be tested and developed. British sociology lags behind several disciplines, such as Politics and Psychology, in the drive to improve transparency and reproducibility in our work. As far as we are aware there are no sociology journals which demand researchers to provide their code in order to publish their work. It is most likely only a top-down change from journals, funding bodies or data providers which would develop the practices within our discipline. Whilst British sociologists are not yet talking about the ‘reproducibility crisis’ with the same concern as psychologists and political scientists, we have no doubts that increased transparency will bring great benefits to our discipline.

[1] This problem is additionally frustrating as the UK Data Service do not currently have an obvious version control protocol, and do not routinely make open sufficient metadata for users to be able to identify precise versions of files and variables. We have therefore documented the date and time that datasets where downloaded and documented this in our Jupyter Notebook. Doubtlessly, the UK Data Service adopting a clear and consistent version control protocol would be of great benefit to the research community as it would accurately locate data within the audit trail.
[2] We thank our friend Professor Robin Samuel for this apposite term.

Generations of worklessness, a myth that won’t die

Kevin Ralston, York St John University, 2017

The idea that there are multiple generations of the same family who have never had a job has popular, political and international resonance. In politics, UK Minister, Chris Grayling, is on record as stating there are ‘four generations of families where no-one has ever had a job’.

This belief in ‘generations of worklessness’ is often accompanied by the idea that there is an associated culture of worklessness. For example, Esther McVey, when she was Minister of State for Employment, made reference to the widespread notion that there is a ‘something for nothing culture’ among some of those claiming benefits.

Politicians of the red variety have also expressed similar sentiments. In a speech, where he discussed levels of worklessness in the UK, former Labour Prime Minister, Tony Blair, claimed that, behind the statistics, there were some households which have three generations who have never worked.

Ideas associated with generations of worklessness also regularly appear in the traditional UK print media. In 2013 the Daily Mail[1] reported a story about an individual who was convicted of burning down his house, which resulted in deaths. They used his status as a benefits claimant in order to characterise living on welfare benefits as a ‘lifestyle choice’ for some. This point is irrelevant to the human tragedy described but it is useful in spreading the notion of a benefits culture.

Embed from Getty Images

These recent examples have been foreshadowed by long running historical and academic debate. A report for the Department of Work and Pensions suggested versions of ideas like generations or cultures of worklessness have been around for 120 years. Michael B. Katz argues that themes of these types have characterised U.S. welfare for 200 years.

In US politics the idea of the ‘welfare queen’ has been used to justify policy in a similar manner to the UK’s ‘benefits cheats’ stereotype and the general notion, that there is a section of undeserving poor who should receive punishment or correction, is a key aspect of neo-liberal politics.

Underclass theory provides a theoretical expression of the type of thinking present in the generations theses. Central to underclass theory is the idea that generations have been socialised into worklessness.  More widely the theory puts forward that problems of illegitimacy and crime negatively define sections of society (the underclass).

We have undertaken newly published research which has searched for three generations of worklessness. This applied data collected over time, to assess whether there is any truth in the sorts of claims made by people like Chris Grayling. This research was the first to use representative data (British Household Panel Survey) to directly test whether three generations of worklessness could be identified in the UK. We found no evidence to support the belief that there are large numbers of families who have several generations that have never worked.

Although ideas around generations of worklessnes are widely expressed and have a long running history, the evidence does not support the theory. Lindsey Macmillan, an economist from University College London, estimated the numbers of families, from within the same household, in which there are two generations who have never worked. This was found to be a fraction of a percent. Other research has found similar results. A small scale study, which also looked for three generations of worklessness within deprived areas, could not find any such families.

The idea that there are generations of workless, who live in a culture of worklessness, creates a picture that there are large numbers of people trained to expect ‘something for nothing’. Arguments made in support of this type of thinking tend to be self-serving and used to push an agenda ignoring the structural problems that lead to people being unemployed.

The available evidence is against the existence of generations of worklessness. There is an ethical imperative on those involved in journalism, or formulating policy, to, at least, have an awareness of this evidence. Those, in these fields, who maintain these ideas are, at best, ignoring available evidence and at worst, wilfully misrepresenting reality.

In the absence of supporting evidence it is time to end over a century of debate. We need to do away with the pathological idea that there are large numbers of people in receipt of welfare benefits because they come from families that are too lazy to work.

 

[1] I have included this link here in a footnote, as I do not wish to encourage people to visit the Daily Mail web site and contribute to their advertising revenue: http://www.dailymail.co.uk/news/article-2304804/Mick-Philpott-benefits-culture-David-Cameron-backs-George-Osborne-saying-arson-case-raises-questions-welfare-lifestyle-choice.html<accessed 30/01/17>

Stata 15 Dynamic Documents: ‘.do files on steroids’

Roxanne Connelly, University of Warwick

bodybuilder-weight-training-stress-38630

Currently the transparency of social science research is poor, particularly in sociology. We tend to place little emphasis on undertaking research in a manner that would allow other researchers to repeat it, and approaches to sharing details of the research process are ad hoc and rarely used. To improve the transparency and reproducibility of sociological research I believe a step-change is required, not only the way we present the results of our research, but in the research process. Producing documentation for replication throughout the research process seems to be a key way in which we can move transparency from being an afterthought in the research process to being front and centre in our research conduct.

Building research transparency into the research process is not new, and borrows from the principles of literate programming introduced by Knuth (1992) in the field of computing science. Literate programming involves the weaving of narratives directly into live computation, interleaving text and documentation (beyond simple comments) with code and results to construct complete and transparent computations. The goal is to explain to humans, rather than machines, in natural language, what processes are being undertaken. The idea of literate programming has been taken up within the scientific computing community as a means to share self-documenting reproducible workflows but is very rarely implemented in sociology.

There are some packages available that can facilitate this type of literate programming for social science research. A notable example is Jupyter Notebooks, a web-based application that supports literate programming in a wide variety of languages (over 50 at present), including data analysis languages widely used for longitudinal social science research (i.e. R and Stata). Jupyter notebooks can run code from different computer programs in a language agnostic environment and can incorporate text and images. These notebooks can be shared and researchers can re-run the notebook and examine the results for themselves. An introduction to Jupyter Notebooks is available here. I am a big fan of Jupyter Notebooks, but currently an important drawback of this application is that it is difficult to install and there is a steep learning curve to get it working, particularly for those of us with limited computing science skills.

There are other packages available within specific statistical computing software environments that allow the combination of code, outputs and free text, e.g. Markdown and KnitR within R, or MarkDoc and Weaver in Stata. My main package is Stata so I was very excited to hear that their latest release (Stata 15) incorporates the capacity to create dynamic documents using Markdown. This allows you to mix Markdown with Stata commands and create a document that interweave the commands, output and text. Stata describes this as ‘a do-file on steroids.’

This blog provides an initial demonstration of Stata’s dynamic documents in action, and may serve as a useful start-up guide for some. I may add another blog once I have used it for the complete workflow of a real piece of data analysis. Here I describe the use of dyndoc which turns a plain text document into an HTML document, there is also putdocx (to create word documets) and putpdf (to create PDF files) but I have not looked at these yet.

Using dynamic documents is straightforward. First you create a plain text file containing the text you want to contain in the document along with the code. This file can include standard Markdown to create text formatting (e.g. bold, italics). When you have completed this file you run the dyndoc command (shown below) and your plain text file will be converted into an HTML document. You could then convert this to a PDF document using an HTML to PDF converter.

. dyndoc filename.txt, replace

To incorporate Stata code and output you use ‘tags’ in the plain text file which indicate whether the commands should appear in the document or not, or whether the output should appear in the document or not. To get the document formatted nicely you need to download the stylesheet ‘stmarkdown.css’ and the file ‘header.txt’ and save them in your working directory.

Here is my plain text file: blogexample

Here is the file that is produced by dyndoc (saved as a pdf to post): blogexamplehtml2pdf

I am really impressed with dyndoc, it was super quick to learn and provides a really straightforward way to improve the reproducibility of your work. Right now I anticipate that I will use it to create a document that can be attached as supplementary materials to journal publications. A dyndoc would greatly surpass a log file or .do file as a reader friendly way to present the complete workflow of a piece of research. Of course the effectiveness of a dyndoc for enabling reproducibility also requires the researcher to put the work in to provide sufficient annotation and description throughout the file. But if the dyndoc is cultivated throughout the research process this could be relatively painless.

There may be more eloquent ways to make use of dynamic documents in Stata and I am sure I will pick up more tricks as I use this more. I welcome comments from more experienced users of dyndoc!

A categorical can of worms II: Examining interactions in logit models in Stata

An ‘alternative specification’ of a categorical by categorical interaction

Kevin Ralston 2017

Introduction

This post outlines an alternative specification of a categorical interaction in a logit. This is the second post in a series which considers options for specifying categorical interactions in logit models. The first post outlined the generic, ‘conventional’ approach to including categorical interactions in logit models. This model included an interaction between sex and full-time/part-time working and is included below as Additional Table 1. In this model the values reported for the sex category and the full-time/part-time category described a contrast between a comparison category and a base category. The base category in an interaction, constructed like this, is a composite of the base category on both the variables included in the interaction. In this case the coefficient for the interaction term reports how much the association changes at different levels of the dependent variables (Kohler and Kreuter 2009). This highlighted that the interpretation of interactions in logit models is different from the interpretation of interactions in ordinary least squares (OLS) models.

Data

The data used are consistent with the data more comprehensively described in the first blog article which outlined the conventional interaction, and can be found here. The data are from the General Household Survey 1995 teaching dataset (Cooper and Arber 2000) – Table 1. The dependent variable is dichotomous, Register Generals Social Class, controlling for whether an individual is a member of class III or not. Age is included as a linear continuous variable, qualifications is dichotomous, indicating whether an individual has any qualification, or none, sex is a male/female dichotomy, and a final variable controls fulltime or part-time working.

Table 1, distributions of variables of interest

CCW_II_T1An ‘alternative’ parameterisation of a categorical by categorical interaction

An alternative way to specify this interaction is to generate a model that defines all possible categories of the combination of the categorical variables included in the interaction. I describe this here as an ‘alternative specification’ of the interaction. Examples of this specification were provided in Additional table 2 of the first post, as a means to check comparisons between categories.

Table 2 displays the alternatively specified interaction. In this instance the model was produced in Stata 13 by placing just one hashtag between the variables to be interacted (i.sex#i.ft). It is also possible to create a composite variable of sex and ft, which is equivalent to this and to include this in the model as a factor variable, or as dummy categories.

The model in Table 2 is statistically identical to the model reported as the conventional interaction (Additional Table 1). For example, log likelihoods and pseudo R2 are exactly the same. The coefficients, and other estimated statistics, for the explanatory variables age and qualification, are also identical between the models. How the interaction term is reported in the model is different, however.

logit class3 i.sex#i.ft i.qual c.age

Table2, Stata output, logistic regression modelling membership of social class III, including independent variables sex, has a qualification, working full-time or part-time and age, also an alternatively reported interaction between sex and working FT/PT. Source is GHS 1995, teaching dataset

CCW_II_T2

The alternative specification of the interaction variable, in Table2, shows the estimates associated with combinations of categories of sex and working full-time/part-time. There is a reference category and, again, in this instance, it is men working part time. It is often desirable to alter the reference category to check or describe contrasts of interest. In analysis you may choose the reference category depending on numbers of cases in the category. Sometimes there may be a gradient of increasing estimates which tells a story and is neat to show. Alternatively, you may choose the reference category because the contrasts are important to answer a research question.

Here the male/part-time category is the reference category and is contrasted male/FT, female/PT and female/FT. In the model it can be seen that the coefficients for male/working full-time and female/working part time are the same as the coefficients, reported in the conventional interaction (Additional Table 1), for the coefficient values for sex and ft. The interaction in Table 2 also contains a category of females/working full-time which is non-significant, in contrast to the reference category.

The interaction as specified in Additional Table 1 is conventional in the sense that it is specified in a manner in which interactions in OLS models are generally specified. I find the alternative specification, in Table 2, preferable in helping to think through what a categorical by categorical interaction is showing. This parameterisation is not discussed by Kohler and Kreuter (2009) and Royston and Sauerbrei (2012) do not recommend this. In my view the alternative specification provides more clear information. In this specification it is immediately apparent what the reference category is and what the contrasts represent.

It is also useful to switch the reference category used and/or to estimate quasi-variances (see Connelly 2016) to check substantive associations. If you do this and take time to think through the results, then you are likely to build a strong understanding of the associations the model is representing. You are also likely to catch mistakes.

Conclusions                                                                                                                                                         

This post outlines an ‘alternative specification’ for including categorical by categorical interactions in a logit models. This is contrasted with a conventional specification (from the first post in this series). The alternative specification is shown to have a benefit over the conventional specification in that there is an intuitive interpretation for the levels of the interaction. As part of a sensitivity analysis I currently recommend that a researcher should model a categorical interaction using a range of specifications, including the ‘alternative specification’ outlined here. Being able to see levels of the interacted variables, along with significance, in comparison to the reference, allows an analyst to usefully assess substantive as well as statistical importance. It is also possible to publish a model applying interactions specified in this way (e.g. Ralston et al. 2016; Popham and Boyle 2011).

References

Cooper, H. and Arber, S. 2000. General Household Survey, 1995: Teaching Dataset. [data collection]. 2nd Edition.

Kohler, U. and Kreuter, F. 2009. Data Analysis Using Stata: Second Edition. College Station, Tx: Stata Press.

Popham, F. and Boyle, P.J. 2011. Is there a ‘Scottish effect’ for mortality? Prospective observational study of census linkage studies. Journal of Public Health 33(3), pp. 453–458.

Ralston, K. et al. 2016. Do young people not in education, employment or training experience long-term occupational scarring? A longitudinal analysis over 20 years of follow-up. Contemporary Social Science, pp. 1–18.

Royston, P. and Sauerbrei, W. 2012. Handling Interactions in Stata, especially with continuous predictors. . Available at: http://www.stata.com/meeting/germany12/abstracts/desug12_royston.pdf.

Additional Table 1, Stata output, logistic regression modelling membership of social class III, including independent variables sex, has a qualification, working full-time or part-time and age, also an interaction between age and working FT/PT. Source is GHS 1995, teaching dataset

CCW_II_additional_table