Skip to content
small logo large logo
Time to Degree Diagnostic Tutorial

Time to Degree Diagnostic Tutorial

Identifying Student Success Insights Through Data Analysis

Identifying Student Success Insights Through Data Analysis

Part 1: Time to Degree Diagnostic Tutorial

This tutorial will guide you through an example of how to operationalize data for student success. Hypothetical institutional data of first-time, full-time students will be used to create a set of easily digestible data visualizations that help identify majors with student progression challenges. An example report with the visualizations is available, as are the example data if you want to follow along.

Note

This article assumes that you have some familiarity with R and {tidyverse} packages that will be used to clean, analyze, and visualize the data.

Determining the Top 20 Majors

Our initial task is to determine the 20 majors that have the most students. The steps below outline that process.

1. Load packages

First, load the necessary packages.

packages <- c("readr", "janitor", "dplyr", "tidyr", "forcats", "glue", 
              "stringr", "gt", "ggplot2", "ggrepel", "scales")

lapply(packages, library, character.only = TRUE)

2. Read in the data

Next, read in the data file and examine the data/variables.

ttd_data <- read_csv("data/time_to_degree_tutorial_data.csv", 
                      show_col_types = FALSE)

head(ttd_data, 10)
# A tibble: 10 × 9
   cohort_year major    grad_year  grad terms_to_degree seven_or_fewer eight
         <dbl> <chr>        <dbl> <dbl>           <dbl>          <dbl> <dbl>
 1        2017 Major 44      2023     1               8              0     1
 2        2014 Major 64      2020     1               8              0     1
 3        2017 Major 47      2023     1               9              0     0
 4        2017 Major 12      2023     1               8              0     1
 5        2017 Major 60      2023     1               8              0     1
 6        2018 Major 30      2024     0              NA              0     0
 7        2016 Major 43      2022     0              NA              0     0
 8        2017 Major 8       2023     0              NA              0     0
 9        2018 Major 43      2024     0              NA              0     0
10        2016 Major 57      2022     0              NA              0     0
# ℹ 2 more variables: nine_or_ten <dbl>, eleven_or_twelve <dbl>

The data set contains 10,000 first-time, full-time (non-transfer) students from 75 majors. Each row of the data represents one student and includes a value for the following variables:

  • cohort_year: The year the student first enrolled at GSU.
  • major: The student’s first-declared major. Note that in this hypothetical data set, the majors are given a generic name such as Major 44 or Major 57.
  • grad_year: The year a student would need to graduate by to be classified as graduating within 6 years.
  • grad: A binary value indicating if a student graduated within 6 years (1 = yes; 0 = no).
  • terms_to_degree: The number of terms a student enrolled in.
  • seven_or_fewer: A binary value indicating if a student graduated in seven or fewer terms (1 = yes; 0 = no).
  • eight: A binary value indicating if a student graduated in eight terms (1 = yes; 0 = no).
  • nine_or_ten: A binary value indicating if a student graduated in nine or ten terms (1 = yes; 0 = no).
  • eleven_or_twelve: A binary value indicating if a student graduated in eleven or twelve terms (1 = yes; 0 = no).

3. Determine the top 20 majors

The raw data are at the student-level, but we want to aggregate the outcome variables, such as time to degree, at the major level. This is achieved using the group_by() function along with summarise() and mutate() from the {dplyr} package. The final function in the pipe, arrange, orders the aggregated data by the number of students in each first declared major, from most to least. Next, the data frame is subset to only include the 20 majors that have the most students, which in total is 8,087 students. The other 55 majors only have a total of 1,913 students, so the top 20 majors have over four times the number of students as the bottom 55 majors. Using evidence-based practices to improve the time to degree and graduation rate for this large group of students will be the most effective way to enhance student success.

# Create a new data frame grouped by major and calculate new variables 
ttd_by_major <- ttd_data %>% 
  group_by(major) %>% 
  summarise(grad_rate = mean(grad)*100,
            terms_to_degree = mean(terms_to_degree, na.rm = TRUE),
            count = n(),
            count_grad = sum(grad),
            count_seven_or_fewer = sum(seven_or_fewer, na.rm = TRUE),
            count_eight = sum(eight, na.rm = TRUE),
            count_nine_or_ten = sum(nine_or_ten, na.rm = TRUE),
            count_eleven_or_twelve = sum(eleven_or_twelve, na.rm = TRUE)) %>%
  mutate(percent_seven_or_fewer = round(count_seven_or_fewer/count_grad*100, 2),
         percent_eight = round(count_eight/count_grad*100, 2),
         percent_nine_or_ten = round(count_nine_or_ten/count_grad*100, 2),
         percent_eleven_or_twelve = round(count_eleven_or_twelve/count_grad*100, 2)) %>% 
  arrange(-count)

# Subset the data frame to only include the top 20 and assign to a new variable
ttd_by_major_top20 <- ttd_by_major[1:20,]

# View the data frame
ttd_by_major_top20
# A tibble: 20 × 13
   major    grad_rate terms_to_degree count count_grad count_seven_or_fewer
   <chr>        <dbl>           <dbl> <int>      <dbl>                <dbl>
 1 Major 8       55.7            8.59  1272        708                   96
 2 Major 12      50.6            8.75  1173        594                   46
 3 Major 62      49.7            8.95   503        250                   36
 4 Major 46      52.8            8.57   487        257                   20
 5 Major 13      42.2            8.65   434        183                   18
 6 Major 19      48.5            8.40   431        209                   31
 7 Major 64      60.2            8.10   394        237                   56
 8 Major 60      56.2            8.48   384        216                   32
 9 Major 53      56.4            8.32   376        212                   40
10 Major 31      53.5            8.48   357        191                   23
11 Major 49      47.4            8.59   327        155                   25
12 Major 11      46.6            8.70   262        122                   22
13 Major 44      60.4            8.38   245        148                   23
14 Major 6       48.5            8.71   229        111                   10
15 Major 14      66.5            7.92   215        143                   32
16 Major 47      45.1            8.58   215         97                   16
17 Major 58      61.9            8.34   215        133                   18
18 Major 54      48.8            8.10   209        102                   22
19 Major 43      53.9            8.02   180         97                   25
20 Major 48      46.9            8.30   179         84                   13
# ℹ 7 more variables: count_eight <dbl>, count_nine_or_ten <dbl>,
#   count_eleven_or_twelve <dbl>, percent_seven_or_fewer <dbl>,
#   percent_eight <dbl>, percent_nine_or_ten <dbl>,
#   percent_eleven_or_twelve <dbl>

Creating a Table

1. Prep the data for the table

In the previous section, we created a data frame of the top 20 first declared majors. Next, a table is made that displays the majors along with their student count, average terms to degree, and graduation rate. Since only specific variables from the data frame will be in the table, the select() function is used to get the variables needed. Additionally, the round() function is used inside mutate() to round the values for terms to degree and graduation rate, making them easier to read. We’ll also update the column names to be easier to read in the table.

# Create the data frame for the table
ttd_table <- ttd_by_major_top20 %>% 
  select(major, count, terms_to_degree, grad_rate) %>% 
  mutate(terms_to_degree = round(terms_to_degree, 1),
         grad_rate = round(grad_rate, 1)) %>% 
  arrange(-count)

# Update names to look nice in the table header
names(ttd_table) <- c("Major", "Number of Students", "Terms to Degree", "Graduation Rate (%)")

2. Render the table

Next, the gt() function from the {gt} package is used create the table. The majors will be arranged from top to bottom based on the number of students, but we also want to draw attention to the values for terms to degree. To accomplish that, color-coding is added using the data_color() function. Cells with longer terms to degree have a darker blue color, while cells with shorter terms to degree have a lighter blue color.

ttd_table %>%
  gt() %>%
  cols_align(
    align = "center", 
    columns = c("Number of Students", "Terms to Degree", "Graduation Rate (%)")
  ) %>% 
  tab_header(
    title = "Top 20 First Declared Majors by Number of Students, 2019-2024"
  ) %>%
  data_color(
    columns = c("Terms to Degree"),
    colors = scales::col_numeric(
      c("white", "#0554A3", "#2B3555"), 
      domain = c(7.7, 9.1)
    )
  ) %>% 
  tab_footnote("NOTE: Graduate Rate is the 6-year rate.")
Top 20 First Declared Majors by Number of Students, 2019-2024
Major Number of Students Terms to Degree Graduation Rate (%)
Major 8 1272 8.6 55.7
Major 12 1173 8.8 50.6
Major 62 503 9.0 49.7
Major 46 487 8.6 52.8
Major 13 434 8.7 42.2
Major 19 431 8.4 48.5
Major 64 394 8.1 60.2
Major 60 384 8.5 56.2
Major 53 376 8.3 56.4
Major 31 357 8.5 53.5
Major 49 327 8.6 47.4
Major 11 262 8.7 46.6
Major 44 245 8.4 60.4
Major 6 229 8.7 48.5
Major 14 215 7.9 66.5
Major 47 215 8.6 45.1
Major 58 215 8.3 61.9
Major 54 209 8.1 48.8
Major 43 180 8.0 53.9
Major 48 179 8.3 46.9
NOTE: Graduate Rate is the 6-year rate.

Visualizations

So far, we’ve identified the 20 most popular first-declared majors and determined two key performance indicators for each major: (1) the average terms to degree and (2) the graduation rate. Next, we will create three figures to help uncover patterns in the data and effectively communicate these insights to leadership.

  1. The first figure is a bubble plot with quadrants. A bubble plot is a type of scatter plot that is used to display 3 variables (one variable on the x-axis, one variable on the y-axis, and one variable that corresponds to the size of the points/bubbles). Quadrants are used to help identify which of the first-declared majors have a relatively longer time to degree and a lower graduation rate.

  2. The second figure is a stacked bar chart, which is used for displaying two categorical variables. In our case, one variable is the major and the other is the time to degree category. This chart will help determine how many students within each major graduate within different time to degree categories.

  3. The third figure is a dumbbell plot, which we will use to show the change in terms to degree between two different graduation years. It will help determine which majors are trending toward a faster time to degree and which are trending toward a slower time to degree.

Creating a Bubble Plot

1. Prep the data for the plot

The plot should have quadrants of equal sizes, so we’ll center the x and y axes around the averages of the x and y variables. To accomplish this, it will be helpful to create a few new variables that will be used within the scaling and annotation functions in the next step.

# Assigning the X axis mean, max, and min to new variables. 
# Will use these in the ifelse statement below and in a ggplot layer to set 
# the X axis scale limits
x_mean <- mean(ttd_by_major_top20$terms_to_degree)
x_max <- max(ttd_by_major_top20$terms_to_degree)
x_min <- min(ttd_by_major_top20$terms_to_degree)

# Use ifelse statement to assign the range. 
# This lets the entire range of values show on the X axis and also 
# keeps the mean X value as the center of the X axis
x_range <- ifelse((x_max-x_mean) > (x_mean-x_min), (x_max-x_mean), (x_mean-x_min))


# Assigning the Y axis mean, max, and min to new variables. 
# Will use these in the ifelse statement below and in a ggplot layer to set 
# the Y axis scale limits
y_mean <- mean(ttd_by_major_top20$grad_rate)
y_max <- max(ttd_by_major_top20$grad_rate)
y_min <- min(ttd_by_major_top20$grad_rate)

# Use ifelse statement to assign the range. 
# This lets the entire range of values show on the Y axis and also 
# keeps the mean Y value as the center of the Y axis
y_range <- ifelse((y_max-y_mean) > (y_mean-y_min), (y_max-y_mean), (y_mean-y_min))

2. Render the bubble plot

Use the ggplot() function along with various geom layers to create the quadrant plot, as outlined in the steps below the code chunk.

1ggplot(data = ttd_by_major_top20, mapping = aes(x = terms_to_degree, y = grad_rate)) +
2  theme_classic() +
  theme(
    plot.title = element_text(size = 16, face = "bold", color = "black", margin = margin(b = 15), hjust = 0.5),  
    plot.subtitle = element_text(size = 12, color = "black", margin = margin(b = 15), hjust = 0.5),              
    axis.title.x = element_text(size = 14, face = "bold", color = "black", margin = margin(t = 10)),               
    axis.title.y = element_text(size = 14, face = "bold", color = "black", margin = margin(r = 15)),             
    axis.text = element_text(size = 12, color = "black"),
    axis.line = element_line(linewidth = 1),
    axis.ticks = element_line(linewidth = 1),
    legend.position = "top",
    legend.title = element_text(face = "bold")
  ) +
3  labs(
    title = "Terms to Degree and Graduation Rate\nfor the Top 20 First Declared Majors, 2019-2024",
    subtitle = "Label: Major (Average Graduation Rate; Average Terms to Degree)",
    x = "Average Terms to Degree", 
    y = "Average 6-Year Graduation Rate",
    size = "Number of Students:"
  ) +
4  geom_vline(
    aes(xintercept = mean(terms_to_degree)), 
    color = "#58595B", 
    linetype = "dashed",
    size = 0.65
  ) + 
  annotate(
    "text", 
    x = x_mean, 
    y = 40, 
    label = glue("Avg: {round(x_mean, 2)}"),
    angle = 90, 
    vjust = -0.5, 
    hjust = 0.5
  ) +
5  geom_hline(
    aes(yintercept = mean(grad_rate)), 
    color = "#58595B", 
    linetype = "dashed",
    size = 0.65
  ) + 
  annotate(
    "text", 
    x = 7.89, 
    y = y_mean, 
    label = glue("Avg: {round(y_mean, 1)}%"),
    vjust = -0.5, 
    hjust = 0.6
  ) +
6  geom_point(
    aes(size = count),
    shape = 21, 
    color = "white",
    fill = "#0554A3",
  ) + 
7  geom_label_repel(
    aes(label = glue("{major} ({round(grad_rate, 1)}%; {round(terms_to_degree, 2)})")),
    max.overlaps = 50, 
    min.segment.length = 0,
    size = 3.5
  ) +
  scale_size(range = c(1, 18)) +
8  scale_x_continuous(
    limits = c((x_mean - x_range) - 0.1, (x_mean+ x_range) + 0.1),
    breaks = seq(8, 9.5, 0.5)
  ) +
  scale_y_continuous(
    labels = percent_format(scale = 1),
    limits = c((y_mean - y_range), (y_mean + y_range))
  )
1
Define the data used for the plot and set the mapping aesthetics (i.e., terms to degree on the x-axis and graduation rate on the y-axis).
2
Customize the theme.
3
Add the title and axis labels.
4
Add a vertical dashed line with an x-axis intercept equal to the average terms to degree and include a text annotation of the average.
5
Add a horizontal dashed line with a y-axis intercept equal to the average graduation rate and include a text annotation of the average.
6
Add the points, with their size set equal to the count variable, so majors with more students will have larger points.
7
Add labels to the points that include the major, the graduation rate, and the average terms to degree.
8
Adjust the x and y scales using the variables we created in the previous step.

Creating a Stacked Bar Plot

In this section, we will create a stacked bar plot to show the distribution of the time to degree categories for each major.

1. Prep the data for the bar plot

First, start with the ‘ttd_by_major_top20’ data frame, select the relevant columns, and pivot the data from wide to long format using the pivot_longer() function from the {tidyr} package. The resulting ‘long’ form data frame will now have a column called ‘term_percent’ containing all the percentage data, along with a corresponding column called ‘term_cat’ that indicates the term category (e.g., percent_eight).

ttd_by_major_top20_long <- ttd_by_major_top20 %>% 
  select(major, percent_seven_or_fewer, percent_eight, percent_nine_or_ten, percent_eleven_or_twelve) %>% 
  pivot_longer(cols = starts_with("percent"), names_to = "term_cat", values_to = "term_percent") 

Next, because we want the majors to be plotted in a certain order - based on the percentage of students graduating in seven or fewer terms - create a character vector with the majors arranged by this percentage. Next, use the factor() function and set the ‘levels’ argument to the order we just defined, and assign these factor levels to the major variable in the data frame created in the last step.

order <- ttd_by_major_top20_long %>%
  filter(term_cat == "percent_seven_or_fewer") %>%
  arrange(term_percent) %>% 
  pull(major)

ttd_by_major_top20_long$major <- factor(ttd_by_major_top20_long$major, levels = order)

Finally, assign a specific order to the ‘term_cat’ factor levels to ensure the categories will plot from the shortest to the longest from left to right.

ttd_by_major_top20_long$term_cat <- factor(ttd_by_major_top20_long$term_cat, levels = c("percent_seven_or_fewer", 
                                                                                        "percent_eight",  
                                                                                        "percent_nine_or_ten",  
                                                                                        "percent_eleven_or_twelve"))

2. Render the stacked bar plot

Now that the data are in the right format, create the plot.

1ggplot(
  data = ttd_by_major_top20_long, 
  mapping = aes(x = term_percent, y = major, fill = fct_rev(term_cat), label = round(term_percent))
  ) +                                        
  theme_classic() +
2  theme(
    plot.title = element_text(size = 16, face = "bold", color = "black", margin = margin(b = 15), hjust = 0.5),
    plot.caption = element_text(size = 10, color = "black", margin = margin(t = 15), hjust = 0.5),
    axis.title = element_blank(),
    axis.line = element_blank(),
    axis.ticks = element_blank(),
    axis.text.x = element_blank(),
    axis.text.y = element_text(size = 12, color = "black", margin = margin(r = -20)),
    legend.position = "top",
    legend.title = element_text(size = 11, face = "bold", color = "black", margin = margin(r = 15)),
  ) +
3  labs(
    title = "Percentage of Students Within Each\nTerms to Degree Category, 2019-2024",
    fill = "Terms to Degree:",
    caption = "NOTE: The sum of the percents within a major may be greater than 100 due to rounding"
  ) +
4  guides(fill = guide_legend(reverse = TRUE)) +
5  geom_bar(
    stat = "identity", 
    position = "stack"
  ) +
6  geom_text(
    position = position_stack(vjust = 0.5), 
    color = "white", 
    size = 5
  ) +
  scale_fill_manual(
    values = c("#2B3555", "#0554A3", "#26A5CA", "#58595B"), 
    labels = c("percent_seven_or_fewer" = "7 or Fewer Terms ", 
               "percent_eight" = "8 Terms ", 
               "percent_nine_or_ten" = "9 or 10 Terms ", 
7               "percent_eleven_or_twelve" = "11 or 12 Terms ")
  )
1
Define the data for the plot and set the mapping aesthetics. Set the fill argument to the terms_cat variable so each category has a different color, and the label argument to the term_percent value rounded to the nearest whole number.
2
Customize the theme.
3
Customize the title, caption, and the legend text (using the fill argument inside the lab() function).
4
Reverse the order of the legend to match with the order of the bars in the plot.
5
Add the geom_bar layer, setting the position argument to stack to created a stacked bar plot.
6
Add the text labels and set vjust to 0.5 so the labels appear in the middle of each bar.
7
Specify the colors for the fill and update the legend label text to be more readable.

Creating a Dumbbell Plot

The last plot we will create is a dumbbell plot showing the change in average terms to degree from 2019 to 2024 for each major.

1. Prep the data

The first step is to create a new data frame with an average terms to degree for each year and major. To do that, group by the major and year, and then calculate the terms to degree inside the summarise() function. Since only the top 20 majors and the years 2019 and 2024 are needed, filter the data frame to only include those majors and years. Finally, select the appropriate columns.

ttd_by_major_year_top20 <- ttd_data %>% 
  group_by(major, grad_year) %>% 
  summarise(terms_to_degree = mean(terms_to_degree, na.rm = TRUE)) %>% 
  filter(major %in% ttd_by_major_top20$major,
         grad_year %in% c("2019", "2024")) %>% 
  select(major, grad_year, terms_to_degree)

Next, pivot the data frame to wide format using the pivot_wider() function and rename the years. There are a few majors that don’t have data for 2019 or 2024, so filter those out and then calculate the difference, or change, from 2019 to 2024.

ttd_clean_by_major_year_wide <- ttd_by_major_year_top20 %>% 
  pivot_wider(names_from = grad_year, values_from = terms_to_degree) %>% 
  rename(year_2019 = "2019",
         year_2024 = "2024") %>% 
  filter(!is.na(year_2019),
         !is.na(year_2024)) %>% 
  mutate(ttd_diff = round(year_2024 - year_2019, 2))

2. Render the plot

Finally, create the plot.

1ggplot(
  data = ttd_clean_by_major_year_wide, 
  mapping = aes(x = year_2019, y = reorder(major, -year_2024), xend = year_2024+0.04, yend = major)
  ) +
  theme_classic() +
2  theme(
    plot.title = element_text(size = 16, face = "bold", color = "black", margin = margin(b = 15), hjust = 0.5),
    axis.title.x = element_text(size = 14, face = "bold", color = "black", margin = margin(t = 10)),
    axis.title.y = element_blank(),
    axis.text = element_text(size = 12, color = "black"),
    axis.line = element_line(size = 1),
    axis.ticks = element_line(size = 1)
  ) +
3  labs(
    title = "Change in Terms to Degree from 2019 to 2024",
    x = "Average Terms to Degree"
  ) +
4  geom_segment(
    color = "gray",
    linewidth = 1,
    arrow = arrow(type = "closed", length = unit(0.2, "cm")),
  ) + 
5  geom_point(
    color = "#2B3555",
    size = 6,
  ) +
6  geom_point(
    inherit.aes = FALSE,
    aes(x = year_2024, y = major),
    color = "#26A5CA",
    size = 6,
  ) +
7  geom_text(
    inherit.aes = FALSE,
    aes(x = year_2024, y = reorder(major, -year_2024), label = round(ttd_diff, 1)),
    color = "#26A5CA",
    fontface = "bold",
    size = 3.5, 
    nudge_x = -0.1
  ) +
8  annotate(
    "text", 
    x = 7.36, 
    y = "Major 43", 
    label = "2024",
    vjust = -1.5,
    color = "#26A5CA",
    size = 4, 
    fontface = "bold"
  ) +
  annotate(
    "text", 
    x = 8.31, 
    y = "Major 43", 
    label = "2019",
    vjust = -1.5,
    color = "#2B3555",
    size = 4,
    fontface = "bold"
  ) +
9  scale_x_continuous(
    limits = c(7.25, 9.6), 
    breaks = seq(7.5, 9.5, 0.5)
  ) +
  scale_y_discrete(expand = expansion(mult = c(0.05, 0.1)))
1
Define the data and the mapping aesthetics for the plot. Specify the xend and yend of the segment line that connects the 2019 and 2024 points. Next, because the end of the segment is an arrow that terminates just before the 2024 point, add 0.4 to the 2024 value (how much to add/subtract depends on the specific data, so this is just trial and error until it looks good). If any majors had an increase in terms to degree from 2019 to 2024, they would need to be plotted in separate geom layers with separate aesthetics.
2
Customize the theme.
3
Customize the plot title, x-axis title, and axis text.
4
Add the geom_segment layer, using the arrow argument to create an arrow at the end of the segment.
5
Add the geom_point layer for the 2019 points.
6
Add the geom_point layer for the 2024 points.
7
Add text labels, adjusting their position to the left the of the 2024 points by setting nudge_x to -0.1.
8
Add the text annotations.
9
Customize the x and y scales.