Skip to content
small logo large logo
DFW Tutorial

DFW Tutorial

Identifying Student Success Insights Through Data Analysis

Identifying Student Success Insights Through Data Analysis

Part 3: High-Enrollment, High-DFW Diagnostic Tutorial

The percentage of students earning grades of D, F, or withdrawing in a course (the DFW rate) is a key performance indicator of student success. In particular, DFW rates in high-enrollment courses can highlight areas where a large proportion of students are struggling. This report identifies (1) which courses have the highest number of DFW grades, (2) whether D and F grades are more frequent than withdrawals, (3) how DFW rates vary across course sections, and (4) the impact of instructional modality on DFW outcomes.

This tutorial will demonstrate how to create key performance indicators and data visualizations that offer insights into DFW rates in critical, high-enrollment courses. An example report is available, as are example data that can be used to follow along.

Note

This article assumes that you have some familiarity with R and {tidyverse} packages.

Visualization Overview

We will create five figures that help identify patterns in the data and effectively communicate the insights to leadership.

  1. The first figure is a bar chart that displays the total number of DFW grades within each course from 2019 to 2024.

  2. The second figure is a stacked bar chart, a figure type that is generally used to display the relationship between two categorical variables. In this case, one variable is the course and the other is the percentage of students in each grade category (DF or W). This chart helps determine whether DF or W grades contribute more to the overall number of DFW grades.

  3. The third figure is a scatter plot with quadrants. A scatter plot displays the relationship between two continuous variables (one variable on the x-axis and one variable on the y-axis). Quadrants are used to help identify which of the courses have a relatively high average DFW rate and high variability (measured using standard deviation) across sections.

  4. The fourth figure is a dumbbell plot, which is used to show the difference in DFW rates between face-to-face and online/hybrid sections of the same course.

  5. The fifth figure is a lollipop plot and displays the odds ratios for receiving a DFW grade between face-to-face and online/hybrid instruction types. The odds ratio indicates how much more likely a student in an online/hybrid section is to receive a DFW grade compared to a student in a face-to-face section.

Initial Steps

Load the necessary packages and the data.

1. Load the packages

Show/Hide Code
library(janitor)
library(dplyr)
library(tidyr)  
library(ggplot2)  
library(ggrepel)  
library(forcats)  
library(ggtext)  
library(glue)  
library(stringr) 
library(scales)
library(patchwork)

2. Load the data

Show/Hide Code
dfw_raw <- read.csv("C:/Users/tfulton9/OneDrive - Georgia State University/R Projects/incubator_tutorials/dfw_tutorial/tutorial/dfw_tutorial_data.csv", na.strings = "") %>% clean_names()

head(dfw_raw, 10)
   year      course   crn total_enroll dfw df  w modality
1  2019 Course ADAD 81910           21   0  0  0     face
2  2019 Course AGAB 56634           12   4  4  0   on_hyb
3  2019 Course BCEJ 90732           11   1  1  0     face
4  2019 Course ADHF 95693           21   4  2  2     face
5  2019 Course AGGB 95823           40   1  1  0     face
6  2019 Course BEHA 88566           45  16 13  3     face
7  2019  Course EBC 22924          304  58 33 25   on_hyb
8  2019 Course BBAH 90819           26   5  4  1     face
9  2019  Course GAE 52466          102   2  2  0   on_hyb
10 2019 Course AFED 16499           21   1  1  0   on_hyb

The data set contains 10,000 (hypothetical) course sections from 2019 to 2024. Each row of the data represents one section and includes a value for the following variables:

  • year: The year the course/section was offered.
  • course: The course name.
  • crn: The course registration number (i.e., the section number).
  • total_enroll: The number of students that enrolled in, completed, and received a grade for the course.
  • dfw: The number of students that received a D, F or W grade in the section
  • df: The number of students that received a D or F grade in the section
  • w: The number of students that received a W grade in the section
  • modality: A categorical variable indicating the instruction modality for the section. (face = face-to-face; on_hyb = online or hybrid)

Identifying the High-Enrollment, High-DFW Courses

1. Analyze the data

The high-enrollment, high-DFW courses need to be identified before making any visualizations. The metric that will be used to identify the courses is the total number of DFW grades. First, use group_by() to aggregate by course and summarise() to calculate the DFW counts and a few other key metrics that will be used later in the tutorial. Next, use arrange() to order the courses from the highest to the lowest number of DFW grades. Last, use slice() to select the top 20 courses.

Show/Hide Code
dfw_by_course_top20 <- dfw_raw %>%
  select(course, total_enroll, dfw, df, w) %>% 
  group_by(course) %>% 
  summarise(total_enroll = sum(total_enroll), 
            dfw_count = sum(dfw),
            df_count = sum(df),
            w_count = sum(w)) %>%
  arrange(desc(dfw_count)) %>% 
  slice(1:20)

2. Render the bar chart

Next, display the top 20 courses along with the DFW counts using a bar chart. Use the ggplot() function along with various geom layers to create the chart, as outlined in the steps below the code chunk.

Show/Hide Code
dfw_by_course_top20 %>% 
1  ggplot(mapping = aes(x = dfw_count, y = reorder(course, dfw_count), label = prettyNum(dfw_count, big.mark = ","))) +
  theme_classic() +
2  theme(
    plot.title = element_markdown(size = 15, color = "black", face = "bold", hjust = 0.5, margin = margin(b = 15)),
    axis.title = element_blank(), 
    axis.text.x = element_blank(),
    axis.text.y = element_text(size = 12, color = "black", margin = margin(r = -15)),
    axis.line = element_blank(),
    axis.ticks = element_blank(),
    legend.position = "none"
  ) +
3  labs(title = "Courses With the Most DFW Grades Since 2019") +
4  geom_bar(
    stat = "identity", 
    fill = "#2B3555"
  ) +
5  geom_text(
    color = "white", 
    size = 5,
    nudge_x = -65
  )
1
Initialize the ggplot object using the dfw_by_course_top20 data and set the mapping aesthetics. Use y = reorder(course, dfw_count) to order the courses from top to bottom based on the DFW counts, and set the label argument to prettyNum(dfw_count, big.mark = ",") to display the DFW counts.
2
Customize the theme for a clean look and feel.
3
Add a descriptive title using the title argument inside the labs() function.
4
Add bars with a specific fill color.
5
Use geom_text() to add the DFW count label to each bar. Use nudge_x = -65 to move the labels inside the bar.

Creating a Stacked Bar Chart

In this section, we will create a stacked bar chart that shows the percentage of DF grades and W grades relative to the total number of DFW grades in each course. This will help determine whether DF grades or W grades contribute more to the overall number of DFW grades.

1. Prepare the data for the chart

First, use select() to choose only the course, df_count, and w_count columns from the dfw_by_course_top20 data frame. Then use pivot_longer() to transform the data from wide to long form.

Show/Hide Code
dfw_by_course_top20_long <- dfw_by_course_top20 %>% 
  select(course, df_count, w_count) %>% 
  pivot_longer(cols = c(df_count, w_count),
               names_to = "count_type", 
               values_to = "count")

Next, group by course and use mutate() to create two new columns: one column for the percentage of DF grades or W grades and another for the text label that will be used in the figure.

Show/Hide Code
dfw_by_course_top20_long <- dfw_by_course_top20_long %>% 
  group_by(course) %>% 
  mutate(percent = round((count/sum(count))*100),
         text_label = glue("{percent}%")) 

2. Render the stacked bar chart

Follow the steps below to create the chart. Note that a narrative title is used with color highlighting that matches the colors in the chart.

Show/Hide Code
dfw_by_course_top20_long %>% 
1  ggplot(mapping = aes(x = percent, y = reorder(course, count), fill = fct_rev(count_type), label = text_label)) +
  theme_classic() +
2  theme(
    plot.title = element_markdown(size = 15, color = "black", face = "bold", margin = margin(b = 15)),
    axis.title = element_blank(), 
    axis.text.x = element_blank(),
    axis.text.y = element_text(size = 12, color = "black", margin = margin(r = -15)),
    axis.line = element_blank(),
    axis.ticks = element_blank(),
    legend.position = "top"
  ) +
3  labs(
    title = "High DFWs are driven by a high percentage of 
            <span style = 'color:#0554A3;'>DFs</span> rather than 
            <span style = 'color: #26A5CA;'>Ws</span>",
  ) + 
4  geom_bar(
    position = "stack", 
    stat = "identity"
  ) +
5  geom_text(
    position = position_stack(vjust = 0.8), 
    color = "white", 
    size = 5,
  ) +
6  scale_fill_manual(
    values = c("#26A5CA", "#0554A3"), 
    labels = c("w_count" = "W", "df_count" = "DF")
  ) + 
7  guides(fill = guide_legend(title = NULL, reverse = TRUE))
1
Define the data for the plot and set the mapping. Set the fill argument to fct_rev(count_type) so that the DF and W bars have different fill colors, and reverse the plotting order so that DFs are on the left and Ws are on the right.
2
Customize the theme. Set plot.title to element_markdown() to enable markdown formatting in the title.
3
Customize the title by highlighting specific text with a color. Use the format <span style = 'color:HEXCOLOR;'>TEXT</span> replacing HEXCOLOR with the desired color’s hex code and TEXT with the content to highlight.
4
Add the geom_bar layer. Set the position argument to stack to created a stacked bar plot.
5
Add text labels and set vjust to 0.8 to position the text toward the right side of each bar.
6
Specify the fill colors and adjust the legend labels for better readability.
7
Remove the legend title using title = NULL and reverse the legend order to match the plot, with DF on the left and W on the right.

Creating a Scatter Plot

The courses with the most DFW grades consist of multiple sections, each of which have an individual DFW rate. To explore the variability of DFW rates across sections within a course, we will calculate the average DFW rate and the standard deviation for each course and display these metrics in a scatter plot. Standard deviation measures how spread out, or variable, the numbers in a group are from the average.

1. Prepare the data for the plot

Begin with the dfw_raw data frame which includes total enrollment and DFW counts for every section. Filter this data frame to include only the courses in the dfw_by_course_top20 data frame. Next, calculate the DFW rate for each section. Finally, group by course and calculate both the average DFW rate and the standard deviation.

Show/Hide Code
dfw_top20_rates <- dfw_raw %>% 
  filter(course %in% dfw_by_course_top20$course) %>% 
  mutate(dfw_rate = dfw/total_enroll*100) %>% 
  group_by(course) %>% 
  summarise(dfw_rate_avg = round(mean(dfw_rate), 1),
            dfw_rate_sd = round(sd(dfw_rate), 1)) %>% 
  arrange(course)

2. Prepare the scales for the plot

The plot should have quadrants of equal sizes, so we’ll center the x and y axes around the averages of the x and y variables. To accomplish this, it will be helpful to create a few new variables that will be used within the scaling and annotation functions in the next step.

Show/Hide Code
# Assigning the X axis mean, max, and min to new variables. 
# Will use these in the ifelse statement below and in a ggplot layer to set 
# the X axis scale limits
x_mean <- mean(dfw_top20_rates$dfw_rate_sd)
x_max <- max(dfw_top20_rates$dfw_rate_sd) + 1.5
x_min <- min(dfw_top20_rates$dfw_rate_sd) - 1.5

# Use ifelse statement to assign the range. 
# This lets the entire range of values show on the X axis and also 
# keeps the mean X value as the center of the X axis
x_range <- ifelse((x_max-x_mean) > (x_mean-x_min), (x_max-x_mean), (x_mean-x_min))


# Assigning the Y axis mean, max, and min to new variables. 
# Will use these in the ifelse statement below and in a ggplot layer to set 
# the Y axis scale limits
y_mean <- mean(dfw_top20_rates$dfw_rate_avg)
y_max <- max(dfw_top20_rates$dfw_rate_avg)
y_min <- min(dfw_top20_rates$dfw_rate_avg)

# Use ifelse statement to assign the range. 
# This lets the entire range of values show on the Y axis and also 
# keeps the mean Y value as the center of the Y axis
y_range <- ifelse((y_max-y_mean) > (y_mean-y_min), (y_max-y_mean), (y_mean-y_min))

3. Render the quadrant plot

Finally, create the plot using the steps outlined below the code chunk.

Show/Hide Code
dfw_top20_rates %>% 
1  ggplot(mapping = aes(x = dfw_rate_sd, y = dfw_rate_avg)) +
  theme_classic() +
  theme(
2    plot.title = element_text(size = 14, color = "black", face = "bold", hjust = 0.5, margin = margin(b = 15)),
    plot.subtitle = element_text(size = 12, color = "black", hjust = 0.5, margin = margin(b = 15)),
    axis.title.x = element_text(size = 12, color = "black", face = "bold", margin = margin(t = 10)),
    axis.title.y = element_text(size = 12, color = "black", face = "bold", margin = margin(r = 10)),
    axis.text = element_text(size = 11, color = "black"),
    axis.line = element_line(linewidth = 1),
    axis.ticks = element_line(linewidth = 1),
  ) +
3  labs(
    title = "Average DFW Rate and Standard Deviation, 2019-2024",
    subtitle = "Label: Course (Average DFW Rate; Standard Deviation)",
    x = "Variation (Standard Deviation Between Sections; %)", 
    y = "Average DFW Rate (%)",
  ) +
4  geom_hline(
    aes(yintercept = mean(dfw_rate_avg)), 
    color = "#58595B", 
    linetype = "dashed"
  ) +
  annotate(
    "text", 
    x = 7, 
    y = y_mean, 
    label = glue("Avg: {round(y_mean, 1)}%"),
    vjust = -0.5, 
    hjust = 0.6
  ) +
5  geom_vline(
    aes(xintercept = mean(dfw_rate_sd)), 
    color = "#58595B", 
    linetype = "dashed"
  ) + 
  annotate(
    "text", 
    x = x_mean, 
    y = 12, 
    label = glue("Avg: {round(x_mean, 1)}%"),
    angle = 90, 
    vjust = -0.5, 
    hjust = 0.5
  ) +
6  geom_point(
    color = "#0554A3",
    size = 6
  ) +
7  geom_label_repel(
    aes(label = glue("{course} ({round(dfw_rate_avg, 1)}%; {round(dfw_rate_sd, 1)}%)")),
    max.overlaps = 100,
    min.segment.length = 0, 
    size = 2.5,
    fontface = "bold"
  ) +
8  scale_x_continuous(
    limits = c((x_mean - x_range), (x_mean + x_range)),
    labels = percent_format(scale = 1)
  ) +
  scale_y_continuous(
    limits = c((y_mean - y_range), (y_mean + y_range)),
    labels = percent_format(scale = 1)
  )
1
Define the data used for the plot and set the mapping aesthetics: standard deviation on the x-axis and DFW rate on the y-axis.
2
Customize the theme.
3
Add the title, subtitle, and axis labels.
4
Add a horizontal dashed line with a y-axis intercept equal to the average DFW rate and include a text annotation of the average.
5
Add a vertical dashed line with an x-axis intercept equal to the average standard deviation and include a text annotation of the average.
6
Add the points, with a specific color and size.
7
Add labels to the points that include the course, the DFW rate, and the standard deviation.
8
Adjust the x and y scales.

Creating a Dumbbell Plot

The next visualization is a dumbbell plot, which will show the difference in DFW rates between face-to-face and online/hybrid instruction modality within the same course.

1. Prepare the data

First, filter the dfw_raw data frame to include only the courses in the dfw_by_course_top20 data frame and calculate the DFW rate for each section. Next, group the data by course and modality, and calculate the total enrollment, DFW count, and average DFW rate for each group. The resulting data frame provides a summary of the average DFW rate for each instruction modality for each course.

Show/Hide Code
dfw_by_modality_top20 <- dfw_raw %>% 
  filter(course %in% dfw_by_course_top20$course) %>% 
  mutate(dfw_rate = dfw/total_enroll*100) %>% 
  group_by(course, modality) %>% 
  summarise(total_enroll = sum(total_enroll),
            dfw_count = sum(dfw),
            dfw_rate = round(mean(dfw_rate), 1))

We are actually going to make two dumbbell plots and then combine them into a single graphic. One plot will display the courses where face-to-face instruction has a lower DFW rate compared to online/hybrid instruction, while the other will display courses where face-to-face instruction has a higher DFW rate. To achieve this, group by course and then use case_when() inside of mutate() to assign each course to one of the two categories. Additionally, we will create another variable, abc_count, which represents the total number of A, B, and C grades. This will be used in the next section for generating the lollipop plot.

Show/Hide Code
dfw_by_modality_top20 <- dfw_by_modality_top20 %>%
  group_by(course) %>% 
  mutate(lower_dfw = case_when(
    dfw_rate[modality == "face"] < dfw_rate[modality == "on_hyb"] ~ "face",
    dfw_rate[modality == "face"] > dfw_rate[modality == "on_hyb"] ~ "on_hyb",
    TRUE ~ "equal"),
    abc_count = total_enroll - dfw_count) %>%
  ungroup()

Next, create two separate data frames that will be used to create the two dumbbell plots. The first data frame, dfw_by_modality_top20_face, will include courses where face-to-face instruction has a lower or equal DFW rate compared to online/hybrid instruction. The second data frame, dfw_by_modality_top20_on_hyb, will contain courses where online/hybrid instruction has a lower DFW rate than face-to-face instruction.

Show/Hide Code
dfw_by_modality_top20_face <- filter(dfw_by_modality_top20, lower_dfw == "face" | lower_dfw == "equal")
dfw_by_modality_top20_on_hyb <- filter(dfw_by_modality_top20, lower_dfw == "on_hyb")   

2. Prepare the scale and titles

Before creating the plots, define a variable to ensure the x-axis scale is consistent across both plots. This ensures that visual comparisons between the two figures are accurate. Additionally, create narrative titles for each figure that include color highlights that match the plot’s colors.

Show/Hide Code
# Set the X scale max 
x_scale_max <- ifelse(max(dfw_by_modality_top20_face$dfw_rate) > max(dfw_by_modality_top20_on_hyb$dfw_rate), 
                      max(dfw_by_modality_top20_face$dfw_rate) + 5, 
                      max(dfw_by_modality_top20_on_hyb$dfw_rate) + 5)


# Create figure title text
face_title <- str_c("Courses with *lower* DFW Rates for <span style = 'color:", "#0554A3", ";'>Face-to-Face</span> instruction 
       compared <br> with <span style = 'color:", "#26A5CA", ";'>Online/Hybrid</span> instruction")

on_hyb_title <- str_c("Courses with *higher* DFW Rates for <span style = 'color:", "#0554A3", ";'>Face-to-Face</span> instruction 
       compared <br> with <span style = 'color:", "#26A5CA", ";'>Online/Hybrid</span> instruction")

3. Create each plot

The code below creates each plot and assigns each to a ggplot object. Since the only differences between the two plots are the data and the titles, the detailed steps are provided only for the face-to-face figure.

Show/Hide Code
face_to_face_figure <- dfw_by_modality_top20_face %>% 
1  ggplot(mapping = aes(x = dfw_rate, y = reorder(course, dfw_rate), color = modality, group = course)) +
  theme_classic() +
2  theme(
    plot.title = element_markdown(size = 15, color = "black", face = "bold", margin = margin(b = 15)),
    axis.title.x = element_text(size = 12, color = "black", face = "bold", margin = margin(t = 10)),
    axis.title.y = element_blank(),
    axis.text = element_text(size = 11, color = "black"),
    axis.line = element_line(linewidth = 1),
    axis.ticks = element_line(linewidth = 1),
  ) +
3  labs(
    title = face_title,
    x = "DFW Rate (%)"
  ) +
4  geom_line(
    size = 2, 
    color = "gray"
  ) +
5  geom_point(size = 7) +
6  scale_color_manual(values = c("#0554A3", "#26A5CA")) +
7  scale_x_continuous(
    limits = c(0, x_scale_max),
    labels = percent_format(scale = 1)
  ) +
8  guides(color = "none")


online_hybrid_figure <- dfw_by_modality_top20_on_hyb %>% 
  ggplot(mapping = aes(x = dfw_rate, y = reorder(course, dfw_rate), color = modality, group = course)) +
  theme_classic() +
  theme(
    plot.title = element_markdown(size = 15, color = "black", face = "bold", margin = margin(t = 20, b = 15)),
    axis.title.x = element_text(size = 12, color = "black", face = "bold", margin = margin(t = 10)),
    axis.title.y = element_blank(),
    axis.text = element_text(size = 11, color = "black"),
    axis.line = element_line(linewidth = 1),
    axis.ticks = element_line(linewidth = 1),
  ) +
  labs(
    x = "DFW Rate (%)", 
    title = on_hyb_title
  ) +
  geom_line(
    size = 2, 
    color = "gray"
  ) +
  geom_point(size = 7) +
  scale_color_manual(values = c("#0554A3", "#26A5CA")) +
  scale_x_continuous(
    limits = c(0, x_scale_max),
    labels = percent_format(scale = 1)
  ) +
  guides(color = "none")
1
Define the data and the mapping aesthetics for the plot. Use y = reorder(course, dfw_rate) to order the courses from highest to lowest DFW rate, color = modality to give distinct colors to each modality, and group = course to connect the DFW rate points for each course with a line.
2
Customize the theme.
3
Set the plot title and label the x-axis.
4
Add the geom_line layer with a specific size and color.
5
Add the geom_point layer.
6
Set the colors for the points.
7
Customize the x-axis scale using the predefined variable.
8
Remove the legend to reduce clutter.

4. Render the dumbbell plot

Finally, we’ll render the plots by adding them together and using plot_layout() from the {patchwork} package. Set ncol to 1 and nrow to 2 to arrange the plots in one column with two rows. Additionally, set the width to 4 for each plot. The height of the plots will be scaled proportionately based on the number of rows (i.e., number of courses) in each data frame, ensuring a balanced display for both figures.

Show/Hide Code
face_to_face_figure + online_hybrid_figure + plot_layout(ncol = 1, 
                                                         nrow = 2, 
                                                         widths = c(4, 4), 
                                                         heights = c((nrow(dfw_by_modality_top20_face)/2), (nrow(dfw_by_modality_top20_on_hyb)/2)))

Creating a Lollipop Plot

In this section, we’ll assess whether there is a difference in the likelihood of receiving a DFW grade based on instruction modality. This is accomplished by calculating the odds ratios and p-values that compare face-to-face versus online/hybrid instruction modalities. We’ll also create a lollipop plot to display the results.

1. Analyze the data

First, initialize the odds_df data frame, which will store the odds ratios and p-values for each course. Then initialize another data frame, temp_df, that is used to store counts of ABC and DFW grades for each instruction modality. Additionally, initialize a variable, i, to access the correct row in odds_df

Show/Hide Code
# Initialize a data frame to store the odds ratios and p-values
odds_df <- data.frame(course = unique(dfw_by_modality_top20$course), 
                      odds_ratio = NA,
                      p_value = NA)


# Initialize a data frame to store the frequencies, which will be added in the loop
temp_df <- data.frame(modality = as.factor(c("face", "face", "on_hyb", "on_hyb")),
                      result = as.factor(c("abc", "dfw", "abc", "dfw")),
                      freq = NA)

# Initialize i to start at 1 and use in the loop to access the row number of the odds_df data frame
i = 1

Use a for loop to iterate through each of the top 20 courses and calculate the odds ratios and p-values. For each iteration, filter dfw_by_modality_top20 to include data only for the current course. Once filtered, assign the grade frequencies for both modalities to the temp_df data frame. Next, fit the logistic regression with result (ABC v. DFW) as the dependent variable and modality (face-to-face v. online/hybrid) as the independent variable, with weights set to the freq variable. After running the model, extract the odds ratio and p-value from the summary and store them in the odds_df data frame. Note that the odds ratio is converted from log form by taking the exponent. Finally, increment the variable i by 1 to move to the next row of odds_df

Show/Hide Code
for (name in odds_df$course){
  
  # Create temp data frame with only one course
  filtered_df <- filter(dfw_by_modality_top20, course == name)
  
  # Add frequencies from the temp df to the logit table
  temp_df$freq <- c(filtered_df[[1,7]], filtered_df[[1,4]], filtered_df[[2,7]], filtered_df[[2,4]])
  
  # Run the glm
  temp_glm <- glm(result ~ modality, weights = freq, data = temp_df, family = binomial(logit))
  
  # Assign the summary of the glm to a variable that can be used in the next step to extract coefficients
  temp_summary <- summary(temp_glm)
  
  # Extract coefficients from the summary, covert from log form by taking the exponent,
  # round them to 2 decimals, and add to the odds_df data frame
  odds_df[i, 2] <-  round(exp(temp_summary$coefficients[2, 1]), 2)  # odds ratio
  
  odds_df[i, 3] <- round(temp_summary$coefficients[2, 4], 3)  # p-value

  i = i + 1
}

2. Prepare the data for the plot

The next step is to add a few variables to the odds_df data frame that will aid in visualizing the results. First, create a new variable, odds_ratio_category, to categorize the courses based on their p-value and odds ratio. Next, assign colors to each category and format p-values for better readability. Last, generate two variables that will be used to label the plot.

Show/Hide Code
odds_df <- odds_df %>% 
  mutate(odds_ratio_category = case_when((p_value < 0.05 & odds_ratio >= 1) ~ "above1",
                                         (p_value < 0.05 & odds_ratio < 1) ~ "below1",
                                         (p_value >= 0.05 ~ "neither")),
         color_id = case_when(odds_ratio_category == "above1" ~ "#0554A3",
                              odds_ratio_category == "below1" ~ "#26A5CA",
                              odds_ratio_category == "neither" ~ "gray"),
         p_value_label = case_when(p_value <= 0.001 ~ "<0.001",
                                  .default = as.character(p_value)),
         above1_label = case_when(odds_ratio >= 1 ~ glue("{odds_ratio} ({p_value_label})"),
                                  TRUE ~ ""),
         below1_label = case_when(odds_ratio < 1 ~ glue("{odds_ratio} ({p_value_label})"),
                                  TRUE ~ "")) 

3. Render the lollipop plot

Finally, create the plot.

Show/Hide Code
odds_df %>% 
1  ggplot(mapping = aes(x = odds_ratio, y = reorder(course, odds_ratio), color = color_id)) +
  theme_classic() +
2  theme(
    plot.title = element_markdown(size = 15, color = "black", face = "bold", margin = margin(b = 15)),
    plot.subtitle = element_markdown(size = 13, color = "black", margin = margin(b = 15)),
    plot.caption = element_text(size = 10, color = "black", hjust = 0.5, margin = margin(t = 10)),
    axis.title.x = element_text(size = 12, color = "black", face = "bold", margin = margin(t = 10)),
    axis.title.y = element_blank(),
    axis.text = element_text(size = 11, color = "black"),
    axis.line = element_line(linewidth = 1),
    axis.ticks = element_line(linewidth = 1),
  ) +
3 labs(
    title = "In most courses, students in Online/Hybrid sections are <span style = 'color: #0554A3;'>**more likely to DFW**</span> 
    <br> rather than <span style = 'color: #26A5CA;'>less likely to DFW</span> compared with Face-to-Face sections",
    subtitle = "In some courses, there is <span style = 'color: gray50;'>**no difference in DFW likelihood**</span>", 
    caption = "The text next to each point displays the Odds Ratio with the P-Value in parentheses",
    x = "Odds Ratio"
  ) +
4  geom_vline(
    aes(xintercept = 1), 
    color = "#58595B", 
    linetype = "dashed"
  ) +
5  geom_segment(
    aes(xend = 1, yend = course), 
    size = 1.5
  ) +
6  geom_point(size = 7) +
7  geom_text(
    aes(label = above1_label), 
    nudge_x = 0.35, 
    size = 4,
    na.rm = TRUE
  ) +
8  geom_text(
    aes(label = below1_label), 
    nudge_x = -0.35,
    size = 4
  ) +
9  scale_color_identity() +
10  scale_x_continuous(limits = c(0, (max(odds_df$odds_ratio) + 0.5)))
1
Define the mapping to plot the odds ratio on the x-axis and reorder courses by the odds ratio on the y-axis.
2
Customize the theme.
3
Set the plot title, subtitle, caption, and x-axis label with custom styling in the title and subtitle for color highlighting
4
Insert a dashed vertical line at x = 1 to visually separate odds ratios above and below 1.
5
Use geom_segment to connect each point to the vertical reference line at x = 1, with a specified line size.
6
Add the data points with a specified size.
7
Include text labels for points where the odds ratio is above 1, with a horizontal adjustment (nudge_x) and specified text size.
8
Include text labels for points where the odds ratio is below 1, with a horizontal adjustment (nudge_x) and specified text size.
9
Use scale_color_identity() to apply colors directly from the color_id variable in the data frame.
10
Set the limits of the x-axis to range from 0 to slightly beyond the maximum odds ratio, adding extra space for better visualization.