Time to Degree Diagnostic Tutorial
Identifying Student Success Insights Through Data Analysis
Part 1: Time to Degree Diagnostic Tutorial
This tutorial will guide you through an example of how to operationalize data for student success. Hypothetical institutional data of first-time, full-time students will be used to create a set of easily digestible data visualizations that help identify majors with student progression challenges. An example report with the visualizations is available, as are the example data if you want to follow along.
Determining the Top 20 Majors
Our initial task is to determine the 20 majors that have the most students. The steps below outline that process.
1. Load packages
First, load the necessary packages.
2. Read in the data
Next, read in the data file and examine the data/variables.
ttd_data <- read_csv("data/time_to_degree_tutorial_data.csv",
show_col_types = FALSE)
head(ttd_data, 10)
# A tibble: 10 × 9
cohort_year major grad_year grad terms_to_degree seven_or_fewer eight
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2017 Major 44 2023 1 8 0 1
2 2014 Major 64 2020 1 8 0 1
3 2017 Major 47 2023 1 9 0 0
4 2017 Major 12 2023 1 8 0 1
5 2017 Major 60 2023 1 8 0 1
6 2018 Major 30 2024 0 NA 0 0
7 2016 Major 43 2022 0 NA 0 0
8 2017 Major 8 2023 0 NA 0 0
9 2018 Major 43 2024 0 NA 0 0
10 2016 Major 57 2022 0 NA 0 0
# ℹ 2 more variables: nine_or_ten <dbl>, eleven_or_twelve <dbl>
The data set contains 10,000 first-time, full-time (non-transfer) students from 75 majors. Each row of the data represents one student and includes a value for the following variables:
- cohort_year: The year the student first enrolled at GSU.
- major: The student’s first-declared major. Note that in this hypothetical data set, the majors are given a generic name such as Major 44 or Major 57.
- grad_year: The year a student would need to graduate by to be classified as graduating within 6 years.
- grad: A binary value indicating if a student graduated within 6 years (1 = yes; 0 = no).
- terms_to_degree: The number of terms a student enrolled in.
- seven_or_fewer: A binary value indicating if a student graduated in seven or fewer terms (1 = yes; 0 = no).
- eight: A binary value indicating if a student graduated in eight terms (1 = yes; 0 = no).
- nine_or_ten: A binary value indicating if a student graduated in nine or ten terms (1 = yes; 0 = no).
- eleven_or_twelve: A binary value indicating if a student graduated in eleven or twelve terms (1 = yes; 0 = no).
3. Determine the top 20 majors
The raw data are at the student-level, but we want to aggregate the outcome variables, such as time to degree, at the major level. This is achieved using the group_by()
function along with summarise()
and mutate()
from the {dplyr} package. The final function in the pipe, arrange
, orders the aggregated data by the number of students in each first declared major, from most to least. Next, the data frame is subset to only include the 20 majors that have the most students, which in total is 8,087 students. The other 55 majors only have a total of 1,913 students, so the top 20 majors have over four times the number of students as the bottom 55 majors. Using evidence-based practices to improve the time to degree and graduation rate for this large group of students will be the most effective way to enhance student success.
# Create a new data frame grouped by major and calculate new variables
ttd_by_major <- ttd_data %>%
group_by(major) %>%
summarise(grad_rate = mean(grad)*100,
terms_to_degree = mean(terms_to_degree, na.rm = TRUE),
count = n(),
count_grad = sum(grad),
count_seven_or_fewer = sum(seven_or_fewer, na.rm = TRUE),
count_eight = sum(eight, na.rm = TRUE),
count_nine_or_ten = sum(nine_or_ten, na.rm = TRUE),
count_eleven_or_twelve = sum(eleven_or_twelve, na.rm = TRUE)) %>%
mutate(percent_seven_or_fewer = round(count_seven_or_fewer/count_grad*100, 2),
percent_eight = round(count_eight/count_grad*100, 2),
percent_nine_or_ten = round(count_nine_or_ten/count_grad*100, 2),
percent_eleven_or_twelve = round(count_eleven_or_twelve/count_grad*100, 2)) %>%
arrange(-count)
# Subset the data frame to only include the top 20 and assign to a new variable
ttd_by_major_top20 <- ttd_by_major[1:20,]
# View the data frame
ttd_by_major_top20
# A tibble: 20 × 13
major grad_rate terms_to_degree count count_grad count_seven_or_fewer
<chr> <dbl> <dbl> <int> <dbl> <dbl>
1 Major 8 55.7 8.59 1272 708 96
2 Major 12 50.6 8.75 1173 594 46
3 Major 62 49.7 8.95 503 250 36
4 Major 46 52.8 8.57 487 257 20
5 Major 13 42.2 8.65 434 183 18
6 Major 19 48.5 8.40 431 209 31
7 Major 64 60.2 8.10 394 237 56
8 Major 60 56.2 8.48 384 216 32
9 Major 53 56.4 8.32 376 212 40
10 Major 31 53.5 8.48 357 191 23
11 Major 49 47.4 8.59 327 155 25
12 Major 11 46.6 8.70 262 122 22
13 Major 44 60.4 8.38 245 148 23
14 Major 6 48.5 8.71 229 111 10
15 Major 14 66.5 7.92 215 143 32
16 Major 47 45.1 8.58 215 97 16
17 Major 58 61.9 8.34 215 133 18
18 Major 54 48.8 8.10 209 102 22
19 Major 43 53.9 8.02 180 97 25
20 Major 48 46.9 8.30 179 84 13
# ℹ 7 more variables: count_eight <dbl>, count_nine_or_ten <dbl>,
# count_eleven_or_twelve <dbl>, percent_seven_or_fewer <dbl>,
# percent_eight <dbl>, percent_nine_or_ten <dbl>,
# percent_eleven_or_twelve <dbl>
Creating a Table
1. Prep the data for the table
In the previous section, we created a data frame of the top 20 first declared majors. Next, a table is made that displays the majors along with their student count, average terms to degree, and graduation rate. Since only specific variables from the data frame will be in the table, the select()
function is used to get the variables needed. Additionally, the round()
function is used inside mutate()
to round the values for terms to degree and graduation rate, making them easier to read. We’ll also update the column names to be easier to read in the table.
# Create the data frame for the table
ttd_table <- ttd_by_major_top20 %>%
select(major, count, terms_to_degree, grad_rate) %>%
mutate(terms_to_degree = round(terms_to_degree, 1),
grad_rate = round(grad_rate, 1)) %>%
arrange(-count)
# Update names to look nice in the table header
names(ttd_table) <- c("Major", "Number of Students", "Terms to Degree", "Graduation Rate (%)")
2. Render the table
Next, the gt()
function from the {gt} package is used create the table. The majors will be arranged from top to bottom based on the number of students, but we also want to draw attention to the values for terms to degree. To accomplish that, color-coding is added using the data_color()
function. Cells with longer terms to degree have a darker blue color, while cells with shorter terms to degree have a lighter blue color.
ttd_table %>%
gt() %>%
cols_align(
align = "center",
columns = c("Number of Students", "Terms to Degree", "Graduation Rate (%)")
) %>%
tab_header(
title = "Top 20 First Declared Majors by Number of Students, 2019-2024"
) %>%
data_color(
columns = c("Terms to Degree"),
colors = scales::col_numeric(
c("white", "#0554A3", "#2B3555"),
domain = c(7.7, 9.1)
)
) %>%
tab_footnote("NOTE: Graduate Rate is the 6-year rate.")
Top 20 First Declared Majors by Number of Students, 2019-2024 | |||
---|---|---|---|
Major | Number of Students | Terms to Degree | Graduation Rate (%) |
Major 8 | 1272 | 8.6 | 55.7 |
Major 12 | 1173 | 8.8 | 50.6 |
Major 62 | 503 | 9.0 | 49.7 |
Major 46 | 487 | 8.6 | 52.8 |
Major 13 | 434 | 8.7 | 42.2 |
Major 19 | 431 | 8.4 | 48.5 |
Major 64 | 394 | 8.1 | 60.2 |
Major 60 | 384 | 8.5 | 56.2 |
Major 53 | 376 | 8.3 | 56.4 |
Major 31 | 357 | 8.5 | 53.5 |
Major 49 | 327 | 8.6 | 47.4 |
Major 11 | 262 | 8.7 | 46.6 |
Major 44 | 245 | 8.4 | 60.4 |
Major 6 | 229 | 8.7 | 48.5 |
Major 14 | 215 | 7.9 | 66.5 |
Major 47 | 215 | 8.6 | 45.1 |
Major 58 | 215 | 8.3 | 61.9 |
Major 54 | 209 | 8.1 | 48.8 |
Major 43 | 180 | 8.0 | 53.9 |
Major 48 | 179 | 8.3 | 46.9 |
NOTE: Graduate Rate is the 6-year rate. |
Visualizations
So far, we’ve identified the 20 most popular first-declared majors and determined two key performance indicators for each major: (1) the average terms to degree and (2) the graduation rate. Next, we will create three figures to help uncover patterns in the data and effectively communicate these insights to leadership.
The first figure is a bubble plot with quadrants. A bubble plot is a type of scatter plot that is used to display 3 variables (one variable on the x-axis, one variable on the y-axis, and one variable that corresponds to the size of the points/bubbles). Quadrants are used to help identify which of the first-declared majors have a relatively longer time to degree and a lower graduation rate.
The second figure is a stacked bar chart, which is used for displaying two categorical variables. In our case, one variable is the major and the other is the time to degree category. This chart will help determine how many students within each major graduate within different time to degree categories.
The third figure is a dumbbell plot, which we will use to show the change in terms to degree between two different graduation years. It will help determine which majors are trending toward a faster time to degree and which are trending toward a slower time to degree.
Creating a Bubble Plot
1. Prep the data for the plot
The plot should have quadrants of equal sizes, so we’ll center the x and y axes around the averages of the x and y variables. To accomplish this, it will be helpful to create a few new variables that will be used within the scaling and annotation functions in the next step.
# Assigning the X axis mean, max, and min to new variables.
# Will use these in the ifelse statement below and in a ggplot layer to set
# the X axis scale limits
x_mean <- mean(ttd_by_major_top20$terms_to_degree)
x_max <- max(ttd_by_major_top20$terms_to_degree)
x_min <- min(ttd_by_major_top20$terms_to_degree)
# Use ifelse statement to assign the range.
# This lets the entire range of values show on the X axis and also
# keeps the mean X value as the center of the X axis
x_range <- ifelse((x_max-x_mean) > (x_mean-x_min), (x_max-x_mean), (x_mean-x_min))
# Assigning the Y axis mean, max, and min to new variables.
# Will use these in the ifelse statement below and in a ggplot layer to set
# the Y axis scale limits
y_mean <- mean(ttd_by_major_top20$grad_rate)
y_max <- max(ttd_by_major_top20$grad_rate)
y_min <- min(ttd_by_major_top20$grad_rate)
# Use ifelse statement to assign the range.
# This lets the entire range of values show on the Y axis and also
# keeps the mean Y value as the center of the Y axis
y_range <- ifelse((y_max-y_mean) > (y_mean-y_min), (y_max-y_mean), (y_mean-y_min))
2. Render the bubble plot
Use the ggplot()
function along with various geom layers to create the quadrant plot, as outlined in the steps below the code chunk.
1ggplot(data = ttd_by_major_top20, mapping = aes(x = terms_to_degree, y = grad_rate)) +
2 theme_classic() +
theme(
plot.title = element_text(size = 16, face = "bold", color = "black", margin = margin(b = 15), hjust = 0.5),
plot.subtitle = element_text(size = 12, color = "black", margin = margin(b = 15), hjust = 0.5),
axis.title.x = element_text(size = 14, face = "bold", color = "black", margin = margin(t = 10)),
axis.title.y = element_text(size = 14, face = "bold", color = "black", margin = margin(r = 15)),
axis.text = element_text(size = 12, color = "black"),
axis.line = element_line(linewidth = 1),
axis.ticks = element_line(linewidth = 1),
legend.position = "top",
legend.title = element_text(face = "bold")
) +
3 labs(
title = "Terms to Degree and Graduation Rate\nfor the Top 20 First Declared Majors, 2019-2024",
subtitle = "Label: Major (Average Graduation Rate; Average Terms to Degree)",
x = "Average Terms to Degree",
y = "Average 6-Year Graduation Rate",
size = "Number of Students:"
) +
4 geom_vline(
aes(xintercept = mean(terms_to_degree)),
color = "#58595B",
linetype = "dashed",
size = 0.65
) +
annotate(
"text",
x = x_mean,
y = 40,
label = glue("Avg: {round(x_mean, 2)}"),
angle = 90,
vjust = -0.5,
hjust = 0.5
) +
5 geom_hline(
aes(yintercept = mean(grad_rate)),
color = "#58595B",
linetype = "dashed",
size = 0.65
) +
annotate(
"text",
x = 7.89,
y = y_mean,
label = glue("Avg: {round(y_mean, 1)}%"),
vjust = -0.5,
hjust = 0.6
) +
6 geom_point(
aes(size = count),
shape = 21,
color = "white",
fill = "#0554A3",
) +
7 geom_label_repel(
aes(label = glue("{major} ({round(grad_rate, 1)}%; {round(terms_to_degree, 2)})")),
max.overlaps = 50,
min.segment.length = 0,
size = 3.5
) +
scale_size(range = c(1, 18)) +
8 scale_x_continuous(
limits = c((x_mean - x_range) - 0.1, (x_mean+ x_range) + 0.1),
breaks = seq(8, 9.5, 0.5)
) +
scale_y_continuous(
labels = percent_format(scale = 1),
limits = c((y_mean - y_range), (y_mean + y_range))
)
- 1
- Define the data used for the plot and set the mapping aesthetics (i.e., terms to degree on the x-axis and graduation rate on the y-axis).
- 2
- Customize the theme.
- 3
- Add the title and axis labels.
- 4
- Add a vertical dashed line with an x-axis intercept equal to the average terms to degree and include a text annotation of the average.
- 5
- Add a horizontal dashed line with a y-axis intercept equal to the average graduation rate and include a text annotation of the average.
- 6
- Add the points, with their size set equal to the count variable, so majors with more students will have larger points.
- 7
- Add labels to the points that include the major, the graduation rate, and the average terms to degree.
- 8
- Adjust the x and y scales using the variables we created in the previous step.
Creating a Stacked Bar Plot
In this section, we will create a stacked bar plot to show the distribution of the time to degree categories for each major.
1. Prep the data for the bar plot
First, start with the ‘ttd_by_major_top20’ data frame, select the relevant columns, and pivot the data from wide to long format using the pivot_longer()
function from the {tidyr} package. The resulting ‘long’ form data frame will now have a column called ‘term_percent’ containing all the percentage data, along with a corresponding column called ‘term_cat’ that indicates the term category (e.g., percent_eight).
Next, because we want the majors to be plotted in a certain order - based on the percentage of students graduating in seven or fewer terms - create a character vector with the majors arranged by this percentage. Next, use the factor()
function and set the ‘levels’ argument to the order we just defined, and assign these factor levels to the major variable in the data frame created in the last step.
Finally, assign a specific order to the ‘term_cat’ factor levels to ensure the categories will plot from the shortest to the longest from left to right.
2. Render the stacked bar plot
Now that the data are in the right format, create the plot.
1ggplot(
data = ttd_by_major_top20_long,
mapping = aes(x = term_percent, y = major, fill = fct_rev(term_cat), label = round(term_percent))
) +
theme_classic() +
2 theme(
plot.title = element_text(size = 16, face = "bold", color = "black", margin = margin(b = 15), hjust = 0.5),
plot.caption = element_text(size = 10, color = "black", margin = margin(t = 15), hjust = 0.5),
axis.title = element_blank(),
axis.line = element_blank(),
axis.ticks = element_blank(),
axis.text.x = element_blank(),
axis.text.y = element_text(size = 12, color = "black", margin = margin(r = -20)),
legend.position = "top",
legend.title = element_text(size = 11, face = "bold", color = "black", margin = margin(r = 15)),
) +
3 labs(
title = "Percentage of Students Within Each\nTerms to Degree Category, 2019-2024",
fill = "Terms to Degree:",
caption = "NOTE: The sum of the percents within a major may be greater than 100 due to rounding"
) +
4 guides(fill = guide_legend(reverse = TRUE)) +
5 geom_bar(
stat = "identity",
position = "stack"
) +
6 geom_text(
position = position_stack(vjust = 0.5),
color = "white",
size = 5
) +
scale_fill_manual(
values = c("#2B3555", "#0554A3", "#26A5CA", "#58595B"),
labels = c("percent_seven_or_fewer" = "7 or Fewer Terms ",
"percent_eight" = "8 Terms ",
"percent_nine_or_ten" = "9 or 10 Terms ",
7 "percent_eleven_or_twelve" = "11 or 12 Terms ")
)
- 1
-
Define the data for the plot and set the mapping aesthetics. Set the
fill
argument to theterms_cat
variable so each category has a different color, and thelabel
argument to theterm_percent
value rounded to the nearest whole number. - 2
- Customize the theme.
- 3
-
Customize the title, caption, and the legend text (using the
fill
argument inside thelab()
function). - 4
- Reverse the order of the legend to match with the order of the bars in the plot.
- 5
-
Add the
geom_bar
layer, setting theposition
argument tostack
to created a stacked bar plot. - 6
-
Add the text labels and set
vjust
to 0.5 so the labels appear in the middle of each bar. - 7
- Specify the colors for the fill and update the legend label text to be more readable.
Creating a Dumbbell Plot
The last plot we will create is a dumbbell plot showing the change in average terms to degree from 2019 to 2024 for each major.
1. Prep the data
The first step is to create a new data frame with an average terms to degree for each year and major. To do that, group by the major and year, and then calculate the terms to degree inside the summarise()
function. Since only the top 20 majors and the years 2019 and 2024 are needed, filter the data frame to only include those majors and years. Finally, select the appropriate columns.
Next, pivot the data frame to wide format using the pivot_wider()
function and rename the years. There are a few majors that don’t have data for 2019 or 2024, so filter those out and then calculate the difference, or change, from 2019 to 2024.
2. Render the plot
Finally, create the plot.
1ggplot(
data = ttd_clean_by_major_year_wide,
mapping = aes(x = year_2019, y = reorder(major, -year_2024), xend = year_2024+0.04, yend = major)
) +
theme_classic() +
2 theme(
plot.title = element_text(size = 16, face = "bold", color = "black", margin = margin(b = 15), hjust = 0.5),
axis.title.x = element_text(size = 14, face = "bold", color = "black", margin = margin(t = 10)),
axis.title.y = element_blank(),
axis.text = element_text(size = 12, color = "black"),
axis.line = element_line(size = 1),
axis.ticks = element_line(size = 1)
) +
3 labs(
title = "Change in Terms to Degree from 2019 to 2024",
x = "Average Terms to Degree"
) +
4 geom_segment(
color = "gray",
linewidth = 1,
arrow = arrow(type = "closed", length = unit(0.2, "cm")),
) +
5 geom_point(
color = "#2B3555",
size = 6,
) +
6 geom_point(
inherit.aes = FALSE,
aes(x = year_2024, y = major),
color = "#26A5CA",
size = 6,
) +
7 geom_text(
inherit.aes = FALSE,
aes(x = year_2024, y = reorder(major, -year_2024), label = round(ttd_diff, 1)),
color = "#26A5CA",
fontface = "bold",
size = 3.5,
nudge_x = -0.1
) +
8 annotate(
"text",
x = 7.36,
y = "Major 43",
label = "2024",
vjust = -1.5,
color = "#26A5CA",
size = 4,
fontface = "bold"
) +
annotate(
"text",
x = 8.31,
y = "Major 43",
label = "2019",
vjust = -1.5,
color = "#2B3555",
size = 4,
fontface = "bold"
) +
9 scale_x_continuous(
limits = c(7.25, 9.6),
breaks = seq(7.5, 9.5, 0.5)
) +
scale_y_discrete(expand = expansion(mult = c(0.05, 0.1)))
- 1
-
Define the data and the mapping aesthetics for the plot. Specify the
xend
andyend
of the segment line that connects the 2019 and 2024 points. Next, because the end of the segment is an arrow that terminates just before the 2024 point, add 0.4 to the 2024 value (how much to add/subtract depends on the specific data, so this is just trial and error until it looks good). If any majors had an increase in terms to degree from 2019 to 2024, they would need to be plotted in separategeom
layers with separate aesthetics. - 2
- Customize the theme.
- 3
- Customize the plot title, x-axis title, and axis text.
- 4
-
Add the
geom_segment
layer, using thearrow
argument to create an arrow at the end of the segment. - 5
-
Add the
geom_point
layer for the 2019 points. - 6
-
Add the
geom_point
layer for the 2024 points. - 7
-
Add text labels, adjusting their position to the left the of the 2024 points by setting
nudge_x
to -0.1. - 8
- Add the text annotations.
- 9
- Customize the x and y scales.