Workshop 4 Worksheet

Learning outcomes

By the end of the session, you should be familiar with:

running logistic regression in JASP
the interpretation of logistic regression coefficients
sub-setting/filtering data in JASP
recoding/dichotomising a variable in JASP

Intro

This workshop introduces another type of regression analysis: binary/binomial “logistic” or ”logit” regression. Binary logistic regression allows us to model dependent (”response”, “outcome”) \(y\) variables that are not measured on a continuous numeric scale, but instead on a simple dichotomous scale that allows only two values (e.g. “1”/“0”; “Yes”/“No”; “True”/“False”; “Present”/“Abstent”; “Success”/“Failure”; “Agree”/“Disagree”; “Trusts”/“Doesn’t trust”; “Survived”/“Didn’t survive”, etc.). Such a variable is called dichotomous because its two value options present a “dichotomy”, but it is also often referred to by other names: binary or binomial (because it can take only two mutually exclusive values), indicator (because it “indicates” membership in a category - usually the category with the “positive” logical connotation (i.e. “yes”, “true”, “survived”, “trusts”, etc.), coded as 1) or “dummy” (from the word’s meaning of a substitute or a proxy - as in “crash-test dummy” as a substitute for a human) - to signal the variable’s role as a placeholder that indicates the presence (typically coded as 1) or absence (coded as 0) of a specific qualitative characteristic. I will be using these terms interchangeably, depending of the context, just to confuse you (and because these terms are all used in the published literature, which can be confusing).

Modelling such a variable using the linear regression approach from Workshop 3 is likely to result in inaccurate coefficients, because linear regression assumes a theoretically unbounded continuous numeric scale of measurement; when it encounters a binary outcome variable, it assumes that we only observed two values (0 and 1) on this unbound scale, but that the scale nevertheless exists in theory and other values could have been observed if we collected more data. But, in fact, we know that no other values are theoretically possible.

Nevertheless, we can generalise the logic of linear regression to make it applicable to dichotomous dependent variables. In fact, the method we learn about today - binary/binomial logistic regression - is an elementary case of a broader category of statistical models called generalised linear models (GLMs). GLMs can be thought of as a two-stage modelling approach, in which we first model the response variable using a probability distribution - such as the binomial distribution in our case -, and then we model the parameter of the distribution using a collection of predictors and a special form of multiple regression. In essence, with binary logistic regression we will attempt to predict the probability that an observation falls into one of the two categories of a dichotomous dependent (outcome) variable based on one or more independent (predictor, explanatory) variables. Observations are predicted to fall into whichever one of the two outcome categories is most probable for them given the values they have on the independent variable(s). Logistic modelling is thus often considered a classification method, especially in machine learning parlance.

This all sounds very technical and it does involve some complex mathematics, but applying logistic regression in practice will feel very similar to what we have done with linear regression before. The challenge will be in finding the simplest way to interpret the statistical results accurately and meaningfully given that the resulting regression coefficients refer to estimates on a mathematically transformed scale rather than the original scale of measurement of the variables as it was the case in linear regression.

The Advanced topics readings assigned for this week outline these challenges for those who would like to gain a deeper understanding of the mathematics and mechanics of logistic regression; for the rest of us, Connelly, Gayle, and Lambert (2016) (on the reading list) provides a very approachable introduction to the major challenges, specifically for sociological analyses (see, specifically, the sections between Parameter estimates in logistic regression models and The presentation of logistic regression results).

In the exercises below, we return to some of the small individual-level datasets we used in Workshop 1 and 2, which contain selected variables from the British Social Attitudes Survey, the Citizenship Survey and the Community Life Survey (find them under the Questions of Trust tab on the Data page).

The exercises will also allow us (require?) to practice some basic data transformation techniques in JASP that may come useful for the assignment work too (such as renaming, filtering, reordering, recoding/computing variables, and setting custom missing values). This will take us an important step closer to being able to follow the ten basic steps of the data analysis cycle, which are also the ones that will structure your assignment reports:

Identify the variables you want to use and provide some summary descriptive statistics for them;
Check some more detailed univariate descriptive tabulations and/or visualisations of the core variables (at least the outcome variable and the main explanatory variable);
Identify and perform any required data cleaning and transformations (e.g. rename variables to more human-readable names, set additional missing values, label and recode categories, reverse, centre or standardise scales);
Explore some initial bivariate descriptive tabulations and/or visualisations of the core variables (at least between the outcome variable and the main explanatory variable);

↺ Identify and perform any additionally required data cleaning and transformations, and return to Step 3
Fit a simple bivariate regression model to test the unadjusted statistical association/relationship between the outcome variable and the main explanatory variable);

↺ Identify and perform any additionally required data cleaning and transformations, and return to Step 3
Build a multiple regression model to test the partial statistical association/relationship between the outcome variable and the main explanatory variable adjusted for the effect of a number of other independent/control/explanatory variables;

↺ Identify and perform any additionally required data cleaning and transformations, and return to Step 3
Provide a comprehensive but concise statistical interpretation of the regression results: (a) the direction and strength of the relationships, (b) the explanatory strength of the model, (c) the reliability and generalisability of the model estimates;
Identify the main limitations of the statistical analysis and discuss how they could be improved with more appropriate measurements and statistical techniques;
Provide a sociological discussion of how the statistical results, acknowledging the limitations of the analysis, contribute to addressing the research question.

Exercise 4.2: Do young people (under-24s) have lower social trust in post-pandemic Britain?

You will use the following three datasets, in turn, to model social phenomena in “post-pandemic” Britain:

Community Life Survey 2023-24 (gb-cls23-trust_nm.sav)
British Social Attitudes Survey 2024 (gb-bsa24-trust.sav)
British Social Attitudes Survey 2023 (gb-bsa23-trust.sav)

Exercise 4.2.1. Community Life Survey, 2023-2024

Exercise 4.2.1 Dataset

Download the gb-cls23-trust_nm.sav dataset and open it in JASP. The dataset is also available on the Data page.

Figure 12: “Social trust” operationalised in the 2023-2024 Community Life Survey

Read the research question carefully and identify what it does and doesn’t tell you about the variables, data selection and modelling methods that are needed to address the question as directly as possible. Consider the following questions in particular:

Which “age” variable would you choose if you had more than one option available, and what kind of transformation may be useful in order to answer your research question?
Do you need to transform your outcome/dependent variable, or can it be modelled as it is? (You may have different option here!)
If you wanted to fit a binary logistic regression model, how would you need to transform the outcome variable, if at all?
Can you fit two different types of regression models and compare the results?
Do you have any other eligible “control” variables at your disposal in the dataset? If so, what would be the benefit of including them in the model?

Task 1: Summarise the variables

Task 2: Univariate descriptive analysis

Task 3: Data transformations

Task 4: Bivariate descriptive analysis

Task 5: Simple regression model

Task 6: Multiple regression model

Task 7: Interpretation

Exercise 4.3: Which of the assignment research questions could be addressed using a logistic regression model?

Let’s look again at the assignment research questions. Some of these questions imply a dependent variable which is measured as a numeric scale or at least a long-ish (e.g. 7-point +) ordinal scale in one of the surveys we will use for the assignment (ESS10, WVS7, EVS2017). Other questions imply dependent variables that are more strictly categorical, and as such, we cannot model them using linear regression. For those, we may be able to apply a logistic regression model.

If your chosen outcome variable is categorical with more than two categories but fewer than at least seven, you will have to dichotomise it in order to use it as a dependent variable in a logistic regression. While there are specific modelling methods for “multinomial” type outcome variables, we are not covering them on this module.

In this exercise, explore the assignment dataset using the Variable search function on the website to identify any available variables for answering one/some of the questions below, and check how the implied dependent variable was measured:

Are religious people more satisfied with life?
Are older people more likely to see the death penalty as justifiable?
What factors are associated with opinions about future European Union enlargement among Europeans?
Is higher internet use associated with stronger anti-immigrant sentiments?
How does victimisation relate to trust in the police?
What factors are associated with belief in life after death?
Are government/public sector employees more inclined to perceive higher levels of corruption than those working in the private sector?

JASP solutions

Below you can download JASP files with solutions to some of the exercises in this worksheet:

Exercise 4.1.1: bsa10

References

Connelly, Roxanne, Vernon Gayle, and Paul S. Lambert. 2016. “Modelling Key Variables in Social Science Research: Introduction to the Special Section.” Methodological Innovations 9:2059799116637782. doi: 10.1177/2059799116637782.

Delhey, Jan, and Kenneth Newton. 2003. “Who Trusts?: The Origins of Social Trust in Seven Societies.” European Societies 5(2):93–137. doi: 10.1080/1461669032000072256.

Workshop 4 Worksheet

Learning outcomes

Intro

Exercise 4.1: How did sex-adjusted age shape social trust in Britain in 2010-2011?

Exercise 4.1.1. British Social Attitudes Survey, 2010

Modelling SocTrust

Task 1: Summarise the variables

Task 2: Univariate descriptive analysis

Task 3: Data transformations

Filtering data and setting custom missing values

Reverse ordering value labels

Manually re-coding values

Manually editing labels

Task 4: Bivariate descriptive analysis

Task 5: Simple regression model

Proportions, probabilities, odds and logits

Task 6: Multiple regression model

Modelling PeopTrs3

Task 1: Summarise the variables

Task 2: Univariate descriptive analysis

Task 3: Data transformations

Task 4: Bivariate descriptive analysis

Task 5: Simple regression model

Task 6: Multiple regression model

Task 7: Interpretation

Exercise 4.1.2. Citizenship Survey, 2010-2011

Task 1: Summarise the variables

Task 2: Univariate descriptive analysis

Task 3: Data transformations

Task 4: Bivariate descriptive analysis

Task 5: Simple regression model

Task 6: Multiple regression model

Task 7: Interpretation

Exercise 4.2: Do young people (under-24s) have lower social trust in post-pandemic Britain?

Exercise 4.2.1. Community Life Survey, 2023-2024

Task 1: Summarise the variables

Task 2: Univariate descriptive analysis

Task 3: Data transformations

Task 4: Bivariate descriptive analysis

Task 5: Simple regression model

Task 6: Multiple regression model

Task 7: Interpretation

Exercise 4.2.2. British Social Attitudes Survey, 2023

Task 1: Summarise the variables

Task 2: Univariate descriptive analysis

Task 3: Data transformations

Task 4: Bivariate descriptive analysis

Task 5: Simple regression model

Task 6: Multiple regression model

Task 7: Interpretation

Exercise 4.2.3. British Social Attitudes Survey, 2024

Task 1: Summarise the variables

Task 2: Univariate descriptive analysis

Task 3: Data transformations

Task 4: Bivariate descriptive analysis

Task 5: Simple regression model

Task 6: Multiple regression model

Task 7: Interpretation

Exercise 4.3: Which of the assignment research questions could be addressed using a logistic regression model?

Exercise 4.4 (advanced take-home practice): What factors can help explain social trust in high- and low-trust societies at the turn of the 21st century?

JASP solutions

References

Modelling `SocTrust`

Modelling `PeopTrs3`

Exercise 4.4 (advanced take-home practice): What factors can help explain social trust in high- and low-trust societies at the turn of the 21^st century?