Workshop 4 Worksheet
Learning outcomes
By the end of the session, you should be familiar with:
- running logistic regression in JASP
- the interpretation of logistic regression coefficients
- sub-setting/filtering data in JASP
- recoding/dichotomising a variable in JASP
Intro
This workshop introduces another type of regression analysis: binary/binomial “logistic” or ”logit” regression. Binary logistic regression allows us to model dependent (”response”, “outcome”) \(y\) variables that are not measured on a continuous numeric scale, but instead on a simple dichotomous scale that allows only two values (e.g. “1”/“0”; “Yes”/“No”; “True”/“False”; “Present”/“Abstent”; “Success”/“Failure”; “Agree”/“Disagree”; “Trusts”/“Doesn’t trust”; “Survived”/“Didn’t survive”, etc.). Such a variable is called dichotomous because its two value options present a “dichotomy”, but it is also often referred to by other names: binary or binomial (because it can take only two mutually exclusive values), indicator (because it “indicates” membership in a category - usually the category with the “positive” logical connotation (i.e. “yes”, “true”, “survived”, “trusts”, etc.), coded as 1) or “dummy” (from the word’s meaning of a substitute or a proxy - as in “crash-test dummy” as a substitute for a human) - to signal the variable’s role as a placeholder that indicates the presence (typically coded as 1) or absence (coded as 0) of a specific qualitative characteristic. I will be using these terms interchangeably, depending of the context, just to confuse you (and because these terms are all used in the published literature, which can be confusing).
Modelling such a variable using the linear regression approach from Workshop 3 is likely to result in inaccurate coefficients, because linear regression assumes a theoretically unbounded continuous numeric scale of measurement; when it encounters a binary outcome variable, it assumes that we only observed two values (0 and 1) on this unbound scale, but that the scale nevertheless exists in theory and other values could have been observed if we collected more data. But, in fact, we know that no other values are theoretically possible.
Nevertheless, we can generalise the logic of linear regression to make it applicable to dichotomous dependent variables. In fact, the method we learn about today - binary/binomial logistic regression - is an elementary case of a broader category of statistical models called generalised linear models (GLMs). GLMs can be thought of as a two-stage modelling approach, in which we first model the response variable using a probability distribution - such as the binomial distribution in our case -, and then we model the parameter of the distribution using a collection of predictors and a special form of multiple regression. In essence, with binary logistic regression we will attempt to predict the probability that an observation falls into one of the two categories of a dichotomous dependent (outcome) variable based on one or more independent (predictor, explanatory) variables. Observations are predicted to fall into whichever one of the two outcome categories is most probable for them given the values they have on the independent variable(s). Logistic modelling is thus often considered a classification method, especially in machine learning parlance.
This all sounds very technical and it does involve some complex mathematics, but applying logistic regression in practice will feel very similar to what we have done with linear regression before. The challenge will be in finding the simplest way to interpret the statistical results accurately and meaningfully given that the resulting regression coefficients refer to estimates on a mathematically transformed scale rather than the original scale of measurement of the variables as it was the case in linear regression.
The Advanced topics readings assigned for this week outline these challenges for those who would like to gain a deeper understanding of the mathematics and mechanics of logistic regression; for the rest of us, Connelly, Gayle, and Lambert (2016) (on the reading list) provides a very approachable introduction to the major challenges, specifically for sociological analyses (see, specifically, the sections between Parameter estimates in logistic regression models and The presentation of logistic regression results).
In the exercises below, we return to some of the small individual-level datasets we used in Workshop 1 and 2, which contain selected variables from the British Social Attitudes Survey, the Citizenship Survey and the Community Life Survey (find them under the Questions of Trust tab on the Data page).
The exercises will also allow us (require?) to practice some basic data transformation techniques in JASP that may come useful for the assignment work too (such as renaming, filtering, reordering, recoding/computing variables, and setting custom missing values). This will take us an important step closer to being able to follow the ten basic steps of the data analysis cycle, which are also the ones that will structure your assignment reports:
Identify the variables you want to use and provide some summary descriptive statistics for them;
Check some more detailed univariate descriptive tabulations and/or visualisations of the core variables (at least the outcome variable and the main explanatory variable);
Identify and perform any required data cleaning and transformations (e.g. rename variables to more human-readable names, set additional missing values, label and recode categories, reverse, centre or standardise scales);
Explore some initial bivariate descriptive tabulations and/or visualisations of the core variables (at least between the outcome variable and the main explanatory variable);
↺ Identify and perform any additionally required data cleaning and transformations, and return to Step 3
Fit a simple bivariate regression model to test the unadjusted statistical association/relationship between the outcome variable and the main explanatory variable);
↺ Identify and perform any additionally required data cleaning and transformations, and return to Step 3
Build a multiple regression model to test the partial statistical association/relationship between the outcome variable and the main explanatory variable adjusted for the effect of a number of other independent/control/explanatory variables;
↺ Identify and perform any additionally required data cleaning and transformations, and return to Step 3
Provide a comprehensive but concise statistical interpretation of the regression results: (a) the direction and strength of the relationships, (b) the explanatory strength of the model, (c) the reliability and generalisability of the model estimates;
Identify the main limitations of the statistical analysis and discuss how they could be improved with more appropriate measurements and statistical techniques;
Provide a sociological discussion of how the statistical results, acknowledging the limitations of the analysis, contribute to addressing the research question.
Exercise 4.3: Which of the assignment research questions could be addressed using a logistic regression model?
Let’s look again at the assignment research questions. Some of these questions imply a dependent variable which is measured as a numeric scale or at least a long-ish (e.g. 7-point +) ordinal scale in one of the surveys we will use for the assignment (ESS10, WVS7, EVS2017). Other questions imply dependent variables that are more strictly categorical, and as such, we cannot model them using linear regression. For those, we may be able to apply a logistic regression model.
If your chosen outcome variable is categorical with more than two categories but fewer than at least seven, you will have to dichotomise it in order to use it as a dependent variable in a logistic regression. While there are specific modelling methods for “multinomial” type outcome variables, we are not covering them on this module.
In this exercise, explore the assignment dataset using the Variable search function on the website to identify any available variables for answering one/some of the questions below, and check how the implied dependent variable was measured:
- Are religious people more satisfied with life?
- Are older people more likely to see the death penalty as justifiable?
- What factors are associated with opinions about future European Union enlargement among Europeans?
- Is higher internet use associated with stronger anti-immigrant sentiments?
- How does victimisation relate to trust in the police?
- What factors are associated with belief in life after death?
- Are government/public sector employees more inclined to perceive higher levels of corruption than those working in the private sector?
JASP solutions
Below you can download JASP files with solutions to some of the exercises in this worksheet: