HOPE Tutorial
Identifying Student Success Insights Through Data Analysis
Part 2: HOPE Scholarship Diagnostic Tutorial
Introduction
The HOPE scholarship is awarded to Georgia residents who demonstrate academic excellence and offers financial support to students by covering part of their tuition at eligible higher education institutions. However, many students who begin their studies with the HOPE scholarship eventually lose it. Moreover, students who lose the scholarship graduate at much lower rates compared to students who retain the scholarship. This pattern highlights the importance of the HOPE scholarship as an “access” scholarship, providing key financial support that helps students stay enrolled and graduate. Analyzing data to determine (1) how many students lose the scholarship at different credit hour checkpoints, (2) gaps in scholarship retention between different student groups, and (3) differences in graduation rates between HOPE and non-HOPE students can provide valuable insights into this key student success area.
This tutorial will guide you through the creation of multiple data visualizations that provide an overview of outcomes for HOPE scholarship students. An example report featuring these visualizations is available, along with the corresponding example data. While this tutorial focuses on HOPE scholarship data, similar analyses can be applied to a wide range of scholarship programs.
Initial Steps
Before creating the visualizations, load the necessary packages and data.
1. Load packages
2. Load data
Show/Hide Code
cohort pell hope_start hope_30 hope_60 hope_90 grad
1 Fall 2016 0 1 0 0 NA 0
2 Fall 2015 1 1 1 0 0 1
3 Fall 2015 1 1 0 1 1 1
4 Fall 2016 0 0 0 0 0 0
5 Fall 2014 0 1 1 1 1 1
6 Fall 2015 1 1 1 1 1 1
7 Fall 2017 0 0 0 0 NA 0
8 Fall 2018 1 0 0 0 1 1
9 Fall 2014 1 0 0 NA NA 0
10 Fall 2013 0 1 1 1 1 1
The data set contains 10,000 first-time freshmen who enrolled in Fall terms from 2013 to 2018 (6-year graduation years of 2019 to 2024). Each row of the data represents one student and includes a value for the following variables:
- cohort: The fall term the student first enrolled at the institution.
- pell: A binary value indicating if the student was eligible to receive a Pell Award at the time of enrollment (1 = yes; 0 = no).
- hope_start: A binary value indicating if the student had the HOPE scholarship at the time of enrollment (1 = yes; 0 = no).
- hope_30: A binary value indicating if the student had the HOPE scholarship at the 30 credit hour checkpoint (1 = yes; 0 = no, NA = not enrolled).
- hope_60: A binary value indicating if the student had the HOPE scholarship at the 60 credit hour checkpoint (1 = yes; 0 = no, NA = not enrolled).
- hope_90: A binary value indicating if the student had the HOPE scholarship at the 90 credit hour checkpoint (1 = yes; 0 = no, NA = not enrolled).
- grad: A binary value indicating if a student graduated within 6 years (1 = yes; 0 = no).
Visualizations
We will create four figures that help identify patterns in the data and effectively communicate the insights to leadership.
The first figure is a donut plot. This type of figure is used to display the relative proportions of different categories. In our case, we will use it to show the number of students who enroll with the HOPE scholarship compared to those who enroll without it.
The second figure is a bar chart. We will use it to compare the number of students with the HOPE scholarship at different checkpoints to determine how many students maintain the scholarship over time.
The third figure is a line chart, which is useful for identifying trends over time. We will use it to examine potential equity gaps in HOPE scholarship retention across the credit hour checkpoints.
The fourth figure is a dumbbell plot, which will illustrate the difference in graduation rates between HOPE and non-HOPE students for each cohort.
Creating the Donut Plot
Clean the data
The first figure will show the proportion of students who enroll with the HOPE scholarship. To calculate those numbers, use the logic in the following code chunk. First, group the data by ‘hope_start’ and use tally()
to count the number of students in each ‘hope_start’ group. Next, inside mutate()
, format the counts with commas, calculate the percentage of each group relative to the total count, create a text label for each group to include the formatted count and percentage, and calculate the vertical position for each label location.
Show/Hide Code
hope_donut <- hope_raw %>%
group_by(hope_start) %>%
tally() %>%
mutate(n_pretty = prettyNum(n, big.mark = ",", scientific = FALSE),
percent = round(n/sum(n)*100, 0),
label = ifelse(hope_start == "1", glue("HOPE\n{n_pretty}\n({percent}%)"), glue("Non-HOPE\n{n_pretty}\n({percent}%)")),
label_y_location = ifelse(hope_start == '0', sum(n)-(n/2), n/2))
Render the donut plot
Next, follow the steps outlined below to create the plot. An important component is coord_polar()
, where setting theta = "y"
will apply polar coordinates to the y-axis and transform the standard bar chart into a donut plot.
Show/Hide Code
hope_donut %>%
1 ggplot(aes(x = 1.5, y = n, fill = as.factor(hope_start))) +
2 theme_void() +
theme(plot.title = element_text(size = 13, color = "black", face = "bold",
hjust = 0.5, margin = margin(t = 10, b = 5))) +
3 labs(title = "HOPE Status Distribution in First Fall Term\nAmong First-Time Freshmen, 2013-2018 Entering Cohorts") +
4 geom_bar(stat = "identity", color = "white", width = 0.8) +
5 coord_polar(theta = "y") +
6 xlim(c(0, 2.5)) +
7 scale_fill_manual(values = c("#58595B", "#0554A3")) +
guides(fill = "none") +
8 geom_text(
aes(x = 2.5, y = label_y_location, label = label),
fontface = "bold",
color = c("#58595B", "#0554A3"),
size = 4
)
- 1
- Initialize the ggplot object using the ‘hope_donut’ data and set the mapping aesthetics.
- 2
-
Use
theme_void()
to remove most of the non-data figure pieces for a clean look, and then customize the plot title withtheme()
. - 3
- Add the title text.
- 4
-
Add the bar with white borders and a specific width. Uee
stat = "identity"
to make the bar height proportional to the value of the y aesthetic (in this case the value of ‘n’). - 5
- Convert the bar chart into a donut plot by applying polar coordinates with the y-axis as the angular axis.
- 6
- Set the x-axis limits.
- 7
- Manually set the fill colors for the two ‘hope_start’ categories.
- 8
- Add text labels to the plot using the y location we calculated in the previous step and with colors that match the fill colors of the bar.
Creating the Bar Chart
Clean the data
The next figure will show the number of students with the HOPE scholarship at different checkpoints. Inside summarise()
, use across()
to sum the occurrences where a column value equals 1 for each column that starts with ‘hope_’, and then transform the data frame to long format. The resulting data frame will have one column with four time points and one column with counts indicating the number of students with the HOPE scholarship at each time point.
Next, inside mutate()
, calculate the number and the percentage of students who have lost the scholarship at each checkpoint (relative to the starting count), and create a new factor column with ordered labels for each time point.
Show/Hide Code
hope_bar <- hope_bar %>%
mutate(running_loss = count[1] - count,
running_loss_percent = round(running_loss/count[1]*100),
checkpoint = factor(c("Start with\n HOPE", "HOPE at\n30 Credits", "HOPE at\n60 Credits", "HOPE at\n90 Credits"),
levels = c("Start with\n HOPE", "HOPE at\n30 Credits", "HOPE at\n60 Credits", "HOPE at\n90 Credits")))
Render the bar chart
Follow the steps below to create the plot.
Show/Hide Code
hope_bar %>%
1 ggplot(mapping = aes(x = checkpoint, y = count, label = prettyNum(count, big.mark = ",", scientific = FALSE))) +
2 theme_void() +
theme(
plot.title = element_text(size = 13, color = "black", face = "bold", hjust = 0.5, margin = margin(t = 10, b = 10)),
axis.text.x = element_text(size = 12, face = "bold", color = "black", margin = margin(t = -5))
) +
3 labs(title = "HOPE Counts at Checkpoints Among First-Time Freshmen,\n 2013-2018 Entering Cohorts") +
4 geom_bar(
fill = "#0554A3",
stat = "identity",
width = 0.8
) +
5 geom_text(
color = "white",
vjust = 2,
size = 5,
fontface = "bold"
) +
6 geom_text(
data = hope_bar %>% filter(time != "hope_start"),
aes(x = checkpoint, y = count, label = glue("(-{running_loss_percent}%)")),
color = "white",
vjust = 4.5,
size = 4,
fontface = "bold"
)
- 1
-
Initialize the ggplot object using the ‘hope_bar’ data and set the mapping aesthetics. In the donut plot example above, the count labels were formatted with commas inside the data frame. Here, we’ll format the labels directly in the aesthetics by setting the label equal to
prettyNum(count, big.mark = ",", scientific = FALSE)
. - 2
-
Apply a clean theme using
theme_void()
and then customize the plot’s title and x-axis text usingtheme()
. - 3
- Add the plot title text.
- 4
- Add bars with a specific fill color and width.
- 5
-
Add text labels that show the counts. Use
vjust = 2
to position the labels inside the bars. - 6
-
Add text labels that show the percent loss. We don’t want to display a percent loss at the first time point, so filter out the ‘hope_start’ row. Use
glue()
to concatenate the label components (parentheses, negative sign, percent, and percent symbol).
Creating the Line Chart
The third figure will show HOPE scholarship retention across the three credit hour checkpoints among two student groups: Pell-eligible and non-Pell eligible students. If additional data are available at your institution, comparisons can also be made for other groups, such as first-generation status, race, ethnicity, gender, and so forth.
Clean the data
First, group the data by ‘pell’ and then use across()
with sum()
to count the number of students with HOPE at each time point (i.e., each column that starts with ‘hope_’). Since percentages are more effective for comparing the two groups, the next step is to calculate the percentage of students with HOPE at each time point. Use across()
inside of mutate()
to do this, and create new column names that start with ‘percent_’ using the ‘names’ argument. Because we initially grouped by ‘pell’ the resulting data frame will contain both the counts and the percentages for Pell and non-Pell students at each of the four time points.
Second, select the ‘pell’ column along with all columns that start with ‘percent’ from the ‘hope_by_pell’ data frame. Then, use pivot_longer()
to transform the data frame from a wide to a long format.
Third, convert the ‘checkpoint’ and ‘pell’ columns to factors with specific levels and labels, which will be used in the figure.
Show/Hide Code
hope_by_pell_long$checkpoint <- factor(hope_by_pell_long$checkpoint,
levels = c("percent_hope_start", "percent_hope_30", "percent_hope_60", "percent_hope_90"),
labels = c("All HOPE\nAwardees at\nFirst Term", "HOPE at\n30 Credits", "HOPE at\n60 Credits", "HOPE at\n90 Credits"))
hope_by_pell_long$pell <- factor(hope_by_pell_long$pell,
levels = c(1, 0),
labels = c("Pell", "non-Pell"))
Lastly, we’ll create a custom theme to use in the plot. This approach can be helpful if a figure type will be used multiple times, as it prevents the need to repeat the theme code each time the figure is rendered. For example, if data for other student groups are available to examine additionally equity gaps in HOPE retention rates, the same theme can be reused across all plots.
Show/Hide Code
custom_plot_theme <- function() {
theme_classic() +
theme(
plot.title = element_text(size = 14, face = "bold", color = "black", hjust = 0.5),
axis.title.x = element_blank(),
axis.title.y = element_text(size = 12, face = "bold", color = "black", margin = margin(r = 15)),
axis.text.x = element_text(size = 12, color = "black", margin = margin(t = 5)),
axis.text.y = element_text(size = 12, color = "black"),
axis.line = element_line(linewidth = 1),
axis.ticks.x = element_blank(),
axis.ticks.y = element_line(linewidth = 1),
legend.title = element_blank(),
legend.position = "top",
legend.text = element_text(size = 10)
)
}
Render the line plot
Now we are ready to create the plot using the steps below.
Show/Hide Code
hope_by_pell_long %>%
1 ggplot(aes(x = checkpoint, y = percent, label = percent, group = pell, fill = pell, color = pell)) +
2 custom_plot_theme() +
3 labs(
title = "Percentage of Pell and non-Pell Students who Retain HOPE,\n2013-2018 Entering Cohorts",
y = "Percent of Students"
) +
4 geom_line(linewidth = 1) +
5 geom_label(
color = "white",
size = 5
) +
6 scale_y_continuous(
limits = c(40, 102),
breaks = seq(40, 100, 20),
labels = function(x) paste0(x, "%")
) +
7 scale_fill_manual(values = c("#0554A3", "#58595B")) +
8 scale_color_manual(values = c("#0554A3", "#58595B")) +
9 guides(fill = "none", color = guide_legend(override.aes = list(linewidth = 3)))
- 1
-
Initialize the ggplot object using the ‘hope_by_pell_long’ data and set the mapping aesthetics. Be sure to set the
group
,fill
, andcolor
arguments all equal to ‘pell’. - 2
- Apply the custom theme function that was defined in the last section.
- 3
- Add the plot title and y-axis text.
- 4
- Add lines to the plot with a specific width.
- 5
- Add text labels that show the percentages.
- 6
-
Customize the y-axis scale. Because the label box at the first time point has a y-value of 100, set the max limit to 102 to make sure the label box doesn’t get cut off. Additionally,
breaks = seq(40, 100, 20)
defines where tick marks will appear, andlabels = function(x) paste0(x, "%")
formats the labels as percentages. - 7
- Manually sets the fill colors for the pell groups.
- 8
- Manually sets the line colors for the pell groups.
- 9
-
Customizes the legend. Since both color and fill set to the same colors, we don’t don’t need both in the legend, so use
fill = "none"
to remove the fill legend. Then usecolor = guide_legend(override.aes = list(linewidth = 3))
to increase the line width in the legend for better visibility.
Creating the Dumbbell Chart
The final figure will show the difference in graduation rates between HOPE and non-HOPE students for each cohort.
Clean the data
First, filter the data to include only students who are enrolled at the 60 credit hour checkpoint (column ‘hope_60’). This is done by excluding any rows where ‘hope_60’ is NA, because these represent students who are no longer enrolled. Next, group the data by ‘cohort’ and ‘hope_60’ and calculate the total number of students enrolled at this checkpoint, the number of these students who graduate, and the percentage of students who graduate (i.e., the graduation rate).
Render the dumbell plot
Follow the steps below to create the final plot.
Show/Hide Code
hope_grad_60 %>%
1 ggplot(mapping = aes(x = cohort, y = grad_rate, group = cohort, color = factor(hope_60), label = grad_rate)) +
2 custom_plot_theme() +
theme(
axis.title.x = element_text(size = 12, face = "bold", color = "black", margin = margin(t = 10)),
) +
3 labs(
title = "Six-Year Graduation Rates by HOPE Status at 60 Credit Hours",
x = "Graduation Year",
y = "Graduation Rate"
) +
4 geom_line(
color = "gray70",
linewidth = 1
) +
5 geom_point(size = 9) +
6 geom_text(
color = "white",
size = 4,
) +
7 scale_color_manual(
values = c("#58595B", "#0554A3"),
labels = c("Non-HOPE", "HOPE")
) +
8 scale_x_discrete(labels = seq(2019, 2024, by = 1)) +
9 scale_y_continuous(
limits = c(20, 100),
breaks = seq(20, 100, 20),
labels = function(x) paste0(x, "%")
) +
10 guides(color = guide_legend(reverse = TRUE, override.aes = list(size = 4)))
- 1
-
Initialize the ggplot object using the ‘hope_grad_60’ data and set the mapping aesthetics. Because we are creating a dumbbell for each cohort, setting
group = 'cohort'
will ensure the lines connect the graduation rate points for each cohort. - 2
- Apply the custom theme function and adjust the formatting of the X-axis title.
- 3
- Add text for the plot title and axis titles.
- 4
- Add lines with a light gray color and a specific width.
- 5
- Add points to represent the graduation rates.
- 6
- Add text labels to show the graduation rates.
- 7
- Manually set the colors for the points.
- 8
- Update the x-axis text labels to reflect the 6-year graduation year.
- 9
-
Customize the y-axis scale. Use
breaks = seq(20, 100, 20)
to display tick marks every 20 percent from 20 to 100, and uselabels = function(x) paste0(x, "%")
to format the labels as percentages. - 10
- Reverse the legend order to list the HOPE group first, and increase the size of the legend points for better visibility.