Creating data visualisation beyond default
The purpose of this take-home exercise, is to apply skills that were acquired in Lesson 1 and Hands-on Exercise 1 to reveal the demographic of the city of Engagement, Ohio USA by using appropriate static statistical graphics methods. The data should be processed by using appropriate tidyverse family of packages and the statistical graphics must be prepared using ggplot2 and its extensions.
For the analysis, we will explore both the Participants attributes and their financial background. We will first extract required dataset from the financialjournal.csv and subsequently perform a join to merge the two csv files.
We will explore the following analysis:
We need to install any new packages we plan to load in an R markdown document. The following libraries will be incorporated to construct the required interactive charts:
library(tidyverse) #for plotting and summarizing
library(ggiraph) #dynamic ggplot graphs.
library(patchwork) #combine separate ggplots into the same graphic
library(ggrepel) #provides geoms for ggplot2 to repel overlapping text labels
library(plotly) #create interactive web-based graphs via plotly's JavaScript graphing library, plotly.js
To install and load the new packages every time we load the R project, we will execute the following code before commencing on the analysis and constructing of graphs.
packages = c('tidyverse', 'ggrepel', 'ggiraph', 'patchwork', 'plotly')
for(p in packages){
if(!require(p,character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
The dataset provided was retrieved from Vast Challenge 2022 and it
consist of multiple csv files. For this exercise, we will only focus on
two csv files mainly: Participants.csv
and
FinancialJournal.csv
.
To import the csv files, we will use the read_csv
command instead of read.csv
command to safeguard the
integrity of the data.
Since we are only interested in yearly Wages and Expenses per participants, we will perform sum the total amount earn/spent based on the different categories and subsequently sum the different expenses together.
financeJ_1 <- financeJ %>% #load the financeJ data table
select(c("participantId","category", "amount")) %>% #choose the columns to subset
group_by(participantId,category)%>%
summarise(amount = round(sum(amount),2)) %>% #sum all the amount based on their ID and Category rounding off to 2 decimal place
pivot_wider(names_from = "category",values_from = "amount") #pivot the table to have the categories in columns instead of rows
financeJ_1[is.na(financeJ_1)] = 0 #input a 0 value to all N.A field in the data table
financeJ_2 <- financeJ_1 %>%
mutate(Expenses = Education + Food + Recreation + Shelter + RentAdjustment) %>% #create new column to sum all the different categories of expenses
select(participantId, Wage, Expenses)
Once both csv files has been cleaned, we will combine the two csv
files using the left_join
function found in
dpylr library
.
final_list <- left_join(financeJ_2,participantsD, by = "participantId") #join the finance dataset with the participant details referencing the participantID
We will input a new column named agegroup which reference the participant age and insert a string based on their age.
age_group1 <- 18:25 #create a vector of age
age_group2 <- 26:30
age_group3 <- 31:35
age_group4 <- 36:40
age_group5 <- 41:45
age_group6 <- 46:50
age_group7 <- 51:55
age_group8 <- 56:60
final_list_1 <- final_list %>% #We will check the age of the participant and input the relevant string
mutate(agegroup= case_when(
age %in% age_group1 ~ "18-25",
age %in% age_group2 ~ "26-30",
age %in% age_group3 ~ "31-35",
age %in% age_group4 ~ "36-40",
age %in% age_group5 ~ "41-45",
age %in% age_group6 ~ "46-50",
age %in% age_group7 ~ "51-55",
age %in% age_group8 ~ "56-60")) %>%
select(-age)
An interactive bar chart will be plotted to show the average wage and expenses by the different age group.
We will first extract the average wage and expenses from the data
table based on age group, and then mutate the tooltip_text
to the data table that will be shown when hovering over the bar
chart.
final_list_bar <- final_list_1 %>%
group_by(agegroup) %>%
summarise(Wage = mean(Wage), Expenses = mean(Expenses)) %>% #reduces multiple values down to a single summary.
mutate(
tooltip_text = paste0(toupper(agegroup), "\n", #tooltip for Wage bar chart
"$", round(Wage,2) )
) %>%
mutate(
tooltip_text1 = paste0(toupper(agegroup), "\n", #tooltip for Expenses bar chart
"$", round(-Expenses,2) )
)
Once the data is ready, we will use ggplot
and
geom_col_interactive
to plot the respective graph.
geom_col_interactive
is use to construct interactive bar
chart. The ggiraph
library will be use to use the
girafe
function. We will reorder the Wage in desecending
order using the reorder function.
Wage_bar <- ggplot(final_list_bar,
aes(x = reorder(agegroup, Wage),
y = Wage,
tooltip = tooltip_text, data_id = agegroup #<<
)) +
geom_col_interactive(color = "black", fill="#0072B2", size = 0.5) + #<<
theme_minimal() +
theme(axis.text=element_text(size = 6)) + #<<
labs(title = "Average Wage (USD) of Ohio Population by Age Group",
subtitle = "Data from Vast Challenge 2022"
) +
ylab("") +
xlab("") +
coord_flip()
Expenses_bar <- ggplot(final_list_bar, aes(x = reorder(agegroup, -Expenses), y = -Expenses, tooltip = tooltip_text1, data_id = agegroup)) +
geom_col_interactive(color = "black", fill="#0072B2", size = 0.5) +
theme_minimal() +
theme(axis.text=element_text(size = 6)) +
labs(title = "Average Expenses (USD) of Ohio Population by Age Group",
subtitle = "Data from Vast Challenge 2022"
) +
ylab("") +
xlab("") +
coord_flip()
Once the bar graph has been created, we will use the
giraph
function to display the graph. We will stack the
graph on top of one another using the /
instead of
+
inside the print
function.
girafe(code = print(Wage_bar / Expenses_bar),
width_svg = 10, height_svg = 6) %>%
girafe_options(opts_hover(css = "fill:cyan;"))
We will use the expenses pyramid to visualise the distribution of expenses by the age group and whether the participants have kids.
To be able to illustrate the pyramid, we will need to have negative values and use breaks and labels to redefine the axis. We will first create another data table that consist of the average wages and expenses
We will now plot the pyramid chart using ggplot
and
geom_col_interactive()
. Girafe
will be use to
stack the two charts above one another.
expenses_chart <- ggplot(final_pyramid, aes(x = Expenses, y = agegroup, fill = haveKids, tooltip = ifelse(haveKids == FALSE,round(Expenses,2),round(-Expenses,2) ))) +
geom_col_interactive() +
scale_x_continuous(breaks = seq(-30000, 30000, 10000),
labels = paste0("$",as.character(c(seq(30, 0, -10), seq(10, 30, 10))),"k")) +
labs (x = "Annual Average Expenses (USD)", y = "Age Group", title='Ohio Annual Average Expenses based on Age Group') +
theme_bw() +
theme(axis.ticks.y = element_blank()) +
scale_fill_manual(values = c("TRUE" = "lightblue", "FALSE" = "lightpink"))
wages_chart <- ggplot(final_pyramid, aes(x = Wage, y = agegroup, fill = haveKids, tooltip = ifelse(haveKids == FALSE,round(Wage,2),round(-Wage,2) ))) +
geom_col_interactive() +
scale_x_continuous(breaks = seq(-60000, 60000, 30000),
labels = paste0("$",as.character(c(seq(60, 0, -30), seq(30, 60, 30))),"k")) +
labs (x = "Annual Average Wage (USD)", y = "Age Group", title='Ohio Annual Average Wages based on Age Group') +
theme_bw() +
theme(axis.ticks.y = element_blank()) +
scale_fill_manual(values = c("TRUE" = "lightblue", "FALSE" = "lightpink"))
girafe(code = print(wages_chart/expenses_chart),
width_svg = 12, height_svg = 7)
We will be using the scatter plot to examine the relationship between
the participants which have kids on their expenses and wages in relation
to their joviality. plotly
will be used to construct the
interactive scatter plot. We will also use
scale_x_continuous
to to set the scale limit of the x
axis.
wages_scatter <- final_list_1 %>%
filter (haveKids == TRUE) %>%
ggplot(aes(Wage,joviality, color = interestGroup)) +
geom_point() +
facet_wrap(~ educationLevel) +
theme(
strip.text.x = element_text(margin = margin(2, 0, 2, 0))
) +
scale_x_continuous(name ="Average Wage (USD)",
limits =c(0,60000))
ggplotly(wages_scatter)
expenses_scatter <- final_list_1 %>%
filter (haveKids == TRUE) %>%
ggplot(aes(-Expenses,joviality, color = interestGroup)) +
geom_point() +
labs (title = 'Participants Expenses by Jovaility', y = 'Joviality', x = 'Expenses (USD)') +
facet_wrap(~ educationLevel) +
theme(
strip.text.x = element_text(margin = margin(2, 0, 2, 0))
)
ggplotly(expenses_scatter)
A violin plot combines the same visualisation a box plot and a density plot. Therefore it allows to understand with a blink of an eye the minimum, max, 1st quartile, mean, median, 3rd quartile and frequency of multiple variables in a data frame. The shape of the violin density plot expresses the frequency of the observations. We will use the violin plot to analyse the relationship between joviality and the various categorical data.
plotly
will be used to construct the interactive violin
plot.
We will first plot the violin plot of Education Level against Joviality.
Next, we will plot the violin plot of householdSize against Joviality.
Lastly, we will plot the haveKids against joviality.
We observed that the older age group tend to have the highest average expenses but lower average wage. This could be due the need to spend more on recreational activities and/or shelter. From the scatter plot, we observed that there is no relation between joviality against expenses and wages. This could mean that joviality might be affected by other factors. Lastly, from the violin plot we observed that there are more Graduates that falls in the higher Joviality zone compared to the other educational level. Participants who have kids also express more joviality despite having almost equal mean and median in their joviality.