Take-Home Exercise 6

Visualising Social Network.

Ong Zhi Rong Jordan https://example.com/norajones (Spacely Sprockets)https://example.com/spacelysprokets
2022-06-05

OVERVIEW

Introduction

With reference to bullet point 2 of Challenge 1 of VAST Challenge 2022, I will be revealing the:

Methodology

To analyse the social interaction of the city of Engagement, I will be using the data from the Social Network journal to identify the interaction based on the different types of days and to use the Participant attributes to identify if there are any relation between their interactions and attributes.

Data Preparation

Installing and Loading of Packages

The following code chunks will install and load the required packages.

Show code
packages = c('igraph', 'tidygraph', 
             'ggraph','lubridate', 'clock',
             'tidyverse', 'ggmap', 'ggstatsplot', 'ggside', 'ggdist', 'patchwork')
for(p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

Loading Raw Data Set

The read_csv function is used to imports data into R as a tibble instead of read.csv that imports a regular old R data frame instead.

Show code
Part_nodes <- read_csv("raw_data/Participants.csv")
Social_edge <- read_csv("raw_data/SocialNetwork.csv")

Data Wrangling

The columns and values of participants file are renamed with below code chunk for better formatting and ease of reading. Credits to Leslie for the code chunk below!

Show code
# rename columns
Part_nodes <- Part_nodes %>%
  rename('Participant_ID' = 'participantId', 
         'Household_Size' = 'householdSize', 
         'Have_Kids' = 'haveKids', 
         'Age' = 'age', 
         'Education_Level' = 'educationLevel', 
         'Interest_Group' = 'interestGroup', 
         'Joviality' = 'joviality')

#rename value 
Part_nodes$Education_Level <- sub('HighSchoolOrCollege', 
                                    'High School or College',
                                    Part_nodes$Education_Level)

brks <- c(17, 20, 25, 30, 35, 40, 45, 50, 55, 60)
grps <- c('20 & Below', '21-25', '26-30', '31-35', '36-40', '41-45', 
          '46-50', '51-55', '56-60')

Part_nodes$Age_Group <- cut(Part_nodes$Age, breaks=brks, labels = grps)

To identify the different day types for the analysis, the lubridate package is used. the wday function with label = TRUE and abbr = FALSE return the full day name (etc. Monday, Tuesday), the month function return the numeric number of the month and the week function return the numeric week number of the year.

Once we find the weekday type, we will classify the different days by Working Days or Non-Working Days using the case_when function.

Show code
work_day <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")

Social_edge_selected <- Social_edge %>%
  mutate (Weekday = wday(timestamp,
                         label = TRUE,
                         abbr = FALSE)) %>%
  mutate (month = month(timestamp,
                        label = FALSE)) %>%
   mutate (week = lubridate::week(timestamp)) %>%
  mutate (work_day = case_when(
    Weekday %in% work_day ~ "Working Days",
    TRUE ~ "Non-Working Days"
  ))

Preparation of Edges

Since we are plotting two different social network graph, two different edges data table must be created. From the above edge file created, the first edge file will consist of Non-Working Days in the 2nd week of March, and using the summarise function to find out the number of times the interaction takes place. Using the filter function to only retrieve edges that has more than 1 interaction. The same preparation will be done for the edges that consists of Working Days

Show code
Social_edge_aggregated <- Social_edge_selected %>% 
  filter (work_day == "Non-Working Days", month == 3, week == 10) %>%
  group_by(participantIdFrom,participantIdTo) %>%
  summarise(Weight = n()) %>%
  filter (participantIdFrom != participantIdTo) %>%
  filter (Weight > 1) %>%
  ungroup

Social_edge_workday <- Social_edge_selected %>% 
  filter (work_day == "Working Days", month == 3, week == 10) %>%
  group_by(participantIdFrom,participantIdTo) %>%
  summarise(Weight = n()) %>%
  filter (participantIdFrom != participantIdTo) %>%
  filter (Weight > 1) %>%
  ungroup

Preparation of Nodes

Since we are plotting two different social network graph, two different nodes data table must be created. The nodes within the node file must be unique and therefore the filter function is used to extract only the nodes that appear based on the source and target.

Show code
Part_nodes_aggregated <- Part_nodes %>%
  filter (Participant_ID  %in% c(Social_edge_aggregated$participantIdFrom, Social_edge_aggregated$participantIdTo))

Part_nodes_workday <- Part_nodes %>%
  filter (Participant_ID  %in% c(Social_edge_workday$participantIdFrom, Social_edge_workday$participantIdTo))

saveRDS(Part_nodes_aggregated, "Part_nodes_aggregated.rds")
saveRDS(Part_nodes_workday, "Part_nodes_workday.rds")

Distribution of Nodes

From the plots below, there is no significant difference in the participants Age Group, Joviality and Education Level. We can also identify that most participants attained a High School or College in Education Level. This might skew the overall outcome subsequently.

Show code
weekend_p1 <- Part_nodes_aggregated %>%
  mutate(Education= fct_infreq(Education_Level)) %>%
  ggplot(aes(x= Education)) +
  geom_bar(fill= '#808de8') +
  labs(y= 'No. of\nResidents', subtitle = "Distribution of Residents' Education Level") +
  theme(axis.title.y= element_text(angle=0), axis.ticks.x= element_blank(),
        panel.background= element_blank(), axis.line= element_line(color= 'bisque3'),
        plot.subtitle = element_text(color = "dimgrey", size = 12, face = "bold", hjust=0.5))

weekend_p2 <- ggplot(data= Part_nodes_aggregated, 
       aes(x= Age_Group)) +
  geom_bar(fill= '#6eba6a') +
  ylim(0, 150)  +
  labs(y= 'No. of\nResidents', x= 'Age Group',
       subtitle = "Distribution of Residents' Age") +
  theme(axis.title.y= element_text(angle=0), axis.ticks.x= element_blank(),
        panel.background= element_blank(), axis.line= element_line(color= 'bisque3'),
        plot.subtitle = element_text(color = "dimgrey", size = 12, face = "bold", hjust=0.5))

weekend_p3 <-  ggplot(data= Part_nodes_aggregated, 
       aes(x= Joviality)) +
  geom_histogram (binwidth=0.1, fill="#69b3a2", color="#e9ecef", alpha=0.9) +
  labs(y= 'No. of\nResidents', x= 'Joviality',
       subtitle = "Distribution of Residents' Joviality") +
  theme(axis.title.y= element_text(angle=0), axis.ticks.x= element_blank(),
        panel.background= element_blank(), axis.line= element_line(color= 'bisque3'),
        plot.subtitle = element_text(color = "dimgrey", size = 12, face = "bold", hjust=0.5))

(weekend_p1+weekend_p2)/weekend_p3 + plot_annotation(
    title ="Distribution of Weekend Participant's Attribute",
    caption = 'Vast Challenge 2022'
  ) &
    theme(plot.title = element_text(size = 14, face = 'bold'))

Show code
weekday_p1 <- Part_nodes_workday %>%
  mutate(Education= fct_infreq(Education_Level)) %>%
  ggplot(aes(x= Education)) +
  geom_bar(fill= '#808de8') +
  labs(y= 'No. of\nResidents', subtitle = "Distribution of Residents' Education Level") +
  theme(axis.title.y= element_text(angle=0), axis.ticks.x= element_blank(),
        panel.background= element_blank(), axis.line= element_line(color= 'bisque3'),
        plot.subtitle = element_text(color = "dimgrey", size = 12, face = "bold", hjust=0.5))

weekday_p2 <- ggplot(data= Part_nodes_workday, 
       aes(x= Age_Group)) +
  geom_bar(fill= '#6eba6a') +
  ylim(0, 150)  +
  labs(y= 'No. of\nResidents', x= 'Age Group',
       subtitle = "Distribution of Residents' Age") +
  theme(axis.title.y= element_text(angle=0), axis.ticks.x= element_blank(),
        panel.background= element_blank(), axis.line= element_line(color= 'bisque3'),
        plot.subtitle = element_text(color = "dimgrey", size = 12, face = "bold", hjust=0.5))

weekday_p3 <-  ggplot(data= Part_nodes_workday, 
       aes(x= Joviality)) +
  geom_histogram (binwidth=0.1, fill="#69b3a2", color="#e9ecef", alpha=0.9) +
  labs(y= 'No. of\nResidents', x= 'Joviality',
       subtitle = "Distribution of Residents' Joviality") +
  theme(axis.title.y= element_text(angle=0), axis.ticks.x= element_blank(),
        panel.background= element_blank(), axis.line= element_line(color= 'bisque3'),
        plot.subtitle = element_text(color = "dimgrey", size = 12, face = "bold", hjust=0.5))

(weekday_p1+weekday_p2)/weekday_p3 + plot_annotation(
    title ="Distribution of Weekday Participant's Attribute",
    caption = 'Vast Challenge 2022'
  ) &
    theme(plot.title = element_text(size = 14, face = 'bold'))

Creating the graph dataframe

Before plotting the graph, we need to create a graph data frame using the nodes and edges data frame. The graph_from_data_frame function creates an igraph graph from one or two data frames containing the (symbolic) edge list and edge/vertex attributes.

Show code
cgraph <- graph_from_data_frame (Social_edge_aggregated,
                           vertices = Part_nodes_aggregated) %>%
  as_tbl_graph()


cgraph_workday <- graph_from_data_frame (Social_edge_workday,
                           vertices = Part_nodes_workday) %>%
  as_tbl_graph()

Visulisation and Analysis

Plotting of Social Network using Eigenvector Centrality

Social Interaction on a Non-Working Day

Due to the vast amount of edges within the interaction, we will focus only the top 10% vertices based on their eigenvector centrality score (top 10% influential people). The quantile function is used to identify the quantile based on the eigen_centrality of the graph. Once we identify the vertices, we will remove the other vertices using the delete_vertices function.

For the graph, we will use the The Fruchterman-Reingold layout algorithm and set.seed function to ensure the plot remains the same.

Show code
quantile_graph <- quantile(eigen_centrality(cgraph)$vector,
         probs = seq(0, 1, 1/10)
         )
V(cgraph)$size = eigen_centrality(cgraph)$vector

cgraph_aggregated <- delete_vertices(cgraph, V(cgraph)[size < quantile_graph[10]])


set.seed (1234)
layout1 <- layout_with_fr(cgraph_aggregated)

quantile_graph_aggregated <- quantile(V(cgraph_aggregated)$size, #identify top 10% of the new vertices
         probs = seq(0, 1, 1/10)
         )


V(cgraph_aggregated)$color <- ifelse (V(cgraph_aggregated)$size > quantile_graph_aggregated[10], "darkgoldenrod3", "azure3") #color yellow if vertices is top 10%
E(cgraph_aggregated)$color <- "grey"
V(cgraph_aggregated)$size <- V(cgraph_aggregated)$size/0.065 
#Increase the size of nodes based on their centrality score, only those with high score will be visible

V(cgraph_aggregated)$label <- ifelse (V(cgraph_aggregated)$size*0.065 > quantile_graph_aggregated[10],V(cgraph_aggregated)$name,NA)
#label the vertices if vertices belongs to the top 10%

plot(cgraph_aggregated, edge.arrow.size=0.25,edge.arrow.mode = "-", vertex.label = V(cgraph_aggregated)$label, vertex.label.cex = 0.65, vertex.label.font = 2, main = "Which Participant has the most influence on the weekends?" ) 
Show code
#Font size of 0.65 with Bond font using vertex.label cex and font

Vertices Attributes

Below, we examine the attributes of the top 10% vertices based on the updated EigenVector Centrality. We will only display the index with the eigenvalue more than the 90th percentile.

Show code
vertex_attr(cgraph_aggregated, index = V(cgraph_aggregated)$size*0.065 > quantile_graph_aggregated[10]) 
$name
[1] "255" "292" "330" "374" "378" "417" "484" "503" "640"

$Household_Size
[1] 3 3 2 2 2 2 1 1 2

$Have_Kids
[1]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

$Age
[1] 59 27 25 57 60 38 42 24 24

$Education_Level
[1] "High School or College" "Bachelors"             
[3] "Graduate"               "High School or College"
[5] "High School or College" "High School or College"
[7] "Bachelors"              "High School or College"
[9] "Bachelors"             

$Interest_Group
[1] "H" "B" "G" "F" "C" "G" "B" "A" "D"

$Joviality
[1] 0.8521878 0.6863749 0.7738118 0.8541722 0.9918102 0.6958947
[7] 0.9365472 0.8924825 0.6869823

$Age_Group
[1] "56-60" "26-30" "21-25" "56-60" "56-60" "36-40" "41-45" "21-25"
[9] "21-25"

$size
[1] 11.18398 11.96378 12.73078 12.54330 15.38462 12.19438 12.58637
[8] 10.60798 12.00635

$color
[1] "darkgoldenrod3" "darkgoldenrod3" "darkgoldenrod3"
[4] "darkgoldenrod3" "darkgoldenrod3" "darkgoldenrod3"
[7] "darkgoldenrod3" "darkgoldenrod3" "darkgoldenrod3"

$label
[1] "255" "292" "330" "374" "378" "417" "484" "503" "640"
Show code
#Check vertice attributes based on the index condition

Statistical Plot of Eigenvector Score against Participant’s attributes

To plot the statistical plot, we first need to convert the graph data frame to the normal data frame. The as.data.frame function is used to perform the conversion.

Through the vertices attributes of the top 10% of Eigenvector Centrality Score, we see that most participants who have high influence over the weekends attained an Education Level of High School or College" "Bachelors and the half of the participants fall within the age range of 56-60 and 21-25. We will now analyse to see if these 2 factors affect how influential a participant is during the weekend. A non-parametric test is conducted to provided statistical evidence whether to reject or accept the null hypothesis.

To combine the two plots together, we will use the patchwork library that expands the API to allow for arbitrarily complex composition of plots by, among others, providing mathematical operators for combining multiple plots.

Show code
c_graph_analysis <- as.data.frame(cgraph)

p1 <- ggbetweenstats(
  data    = c_graph_analysis,
  x       = Education_Level,
  y       = size,
  xlab = "Education Level",
  ylab = "EV Centrality \nScore",
  title = "Will your Influence increase based on Education Level?",
  type = "np", #conduct non-parametric test
  conf.level = 0.95,
  mean.ci = TRUE,
  package = "ggsci",
  palette = "default_jco"
) +
  ggplot2::theme(axis.title.y= element_text(angle=0, size = 9),axis.title.x= element_text(size = 9),
                 plot.title = element_text(color = "dimgrey", size = 12, hjust=0.5))

p2<-  ggbetweenstats(
  data    = c_graph_analysis,
  x       = Age_Group,
  y       = size,
  xlab = "Age Group",
  ylab = "EV Centrality \nScore",
  title = "Will your Influence increase based on Age?",
  type = "np", #conduct non-parametric test
  conf.level = 0.95,
  mean.ci = TRUE,
  package = "ggsci",
  palette = "nrc_npg"
)+
  ggplot2::theme(axis.title.y= element_text(angle=0, size = 9),axis.title.x= element_text(size = 9),
                 plot.title = element_text(color = "dimgrey", size = 12, hjust=0.5)) 


p1/plot_spacer()/p2 + plot_annotation(
    title ="Higher Qualification or Age have better influence?",
    caption = 'Vast Challenge 2022',
    theme = theme(plot.title = element_text(size = 14))
  ) &
    theme(plot.title = element_text(face = 'bold')) #apply bold to all plot.title

Cluster Detection on a Non-Working Day

By using the cluster_edge_betweenness function, we will be able to detect any possible cluster based on the betweeness score of the vertices. From the graph below, we can see that there are 11 clusters found within the interaction.

Show code
GNC <- cluster_edge_betweenness(cgraph_aggregated, weights = NULL)
V(cgraph_aggregated)$color <-membership(GNC)              #Plot setting specifying the coloring of vertices by community
cgraph_aggregated$palette <- diverging_pal(length(GNC)) 
plot(cgraph_aggregated, edge.arrow.size=0.25,edge.arrow.mode = "-", vertex.label = NA, main = "How many Social Groups within the community?")

Analysis of Plot

From the above plots, we can infer that the 9 participants who are deemed influential based on their EigenVector Score have no relationship based on their attributes. From the statisical plot, both plots shows a P value of more than 0.05 and therefore we fail to reject the null hypothesis that there is no relationship between influential and participants attributes.

We can also infer that these participants might be working in pubs/restaurants and therefore have high interaction with people since the previous exercise concludes the high social activity in pubs and restaurants during the weekend. With the clusters, we can possibly infer that these people might be working in different pubs/restaurants or are in different social circles.

Social Interaction on a Working Day (Mon-Fri)

The preparation of graph will be similar to the preparation of the Non-Working Day graph.

Show code
quantile_graph_workday <- quantile(eigen_centrality(cgraph_workday)$vector,
         probs = seq(0, 1, 1/10)
         )

V(cgraph_workday)$size = eigen_centrality(cgraph_workday)$vector

cgraph_aggregated_workday <- delete_vertices(cgraph_workday, V(cgraph_workday)[size < quantile_graph_workday[10]])


set.seed (1234)
layout1 <- layout_with_fr(cgraph_workday)

quantile_graph_aggregated_workday <- quantile(V(cgraph_aggregated_workday)$size,
         probs = seq(0, 1, 1/10)
         )


V(cgraph_aggregated_workday)$color <- ifelse (V(cgraph_aggregated_workday)$size > quantile_graph_aggregated_workday[10], "darkgoldenrod3", "azure3") 
E(cgraph_aggregated_workday)$color <- "grey" #all Edges to be grey
V(cgraph_aggregated_workday)$size <- V(cgraph_aggregated_workday)$size/0.075
V(cgraph_aggregated_workday)$label <- ifelse (V(cgraph_aggregated_workday)$size*0.075 > quantile_graph_aggregated_workday[10],V(cgraph_aggregated_workday)$name,NA)

plot(cgraph_aggregated_workday, edge.arrow.size=0.25,edge.arrow.mode = "-", vertex.label = V(cgraph_aggregated_workday)$label, vertex.label.cex = 0.65, vertex.label.font = 2 )

Vertices Attributes

Below, we examine the attributes of the top 10% vertices based on the updated EigenVector Centrality. We will only display the index with the eigenvalue more than the 90th percentile.

Show code
vertex_attr(cgraph_aggregated_workday, index = V(cgraph_aggregated_workday)$size*0.075 > quantile_graph_aggregated_workday[10])
$name
[1] "266" "374" "378" "417" "484" "503" "624" "659" "991"

$Household_Size
[1] 3 2 2 2 1 1 2 2 1

$Have_Kids
[1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

$Age
[1] 32 57 60 38 42 24 31 49 21

$Education_Level
[1] "Graduate"               "High School or College"
[3] "High School or College" "High School or College"
[5] "Bachelors"              "High School or College"
[7] "Bachelors"              "Graduate"              
[9] "Low"                   

$Interest_Group
[1] "F" "F" "C" "G" "B" "A" "C" "H" "C"

$Joviality
[1] 0.9726299 0.8541722 0.9918102 0.6958947 0.9365472 0.8924825
[7] 0.9161637 0.7418347 0.9704926

$Age_Group
[1] "31-35" "56-60" "56-60" "36-40" "41-45" "21-25" "31-35" "46-50"
[9] "21-25"

$size
[1] 10.43792 11.11102 13.33333 10.97837 11.27556 11.83022 11.48375
[8] 10.63215 11.02669

$color
[1] "darkgoldenrod3" "darkgoldenrod3" "darkgoldenrod3"
[4] "darkgoldenrod3" "darkgoldenrod3" "darkgoldenrod3"
[7] "darkgoldenrod3" "darkgoldenrod3" "darkgoldenrod3"

$label
[1] "266" "374" "378" "417" "484" "503" "624" "659" "991"

Statistical Plot of Eigenvector Score against Participant’s attributes (Working Day)

To plot the statistical plot, we first need to convert the graph data frame to the normal data frame. The as.data.frame function is used to perform the conversion.

Through the vertices attributes of the top 10% of Eigenvector Centrality Score, we see that most participants who have high influence over the working dats do not have kids and most of them have high joviality. We will now analyse to see if these 2 factors affect how influential a participant is during working days. To combine the two plots together, we will use the patchwork library that expands the API to allow for arbitrarily complex composition of plots by, among others, providing mathematical operators for combining multiple plots.

Show code
c_graph_analysis_workday <- as.data.frame(cgraph_workday)

p1 <- ggbetweenstats(
  data    = c_graph_analysis_workday,
  x       = Have_Kids,
  y       = size,
  xlab = "Have Kids",
  ylab = "EV Centrality \nScore",
  title = "Will Kids affect your Influence?",
  type = "np",
  conf.level = 0.95,
  mean.ci = TRUE,
  package = "ggsci",
  palette = "default_jco"
) +
  ggplot2::theme(axis.title.y= element_text(angle=0, size = 9),axis.title.x= element_text(size = 9),
                 plot.title = element_text(color = "dimgrey", size = 12, hjust=0.5))

p2<-  ggscatterstats(
  data    = c_graph_analysis_workday,
  x       = Joviality,
  y       = size,
  title = "Does happiness affect your Influence?",
  xlab = "Joviality",
  ylab = "EV Centrality \nScore",
)+
  ggplot2::theme(axis.title.y= element_text(angle=0, size = 9),axis.title.x= element_text(size = 9),
                 plot.title = element_text(color = "dimgrey", size = 12, hjust=0.5))

p1/plot_spacer()/p2 + plot_annotation(
    title ="Does having Kids or Being Happy affect your Influence?",
    caption = 'Vast Challenge 2022',
    theme = theme(plot.title = element_text(size = 14))
  ) &
    theme(plot.title = element_text(face = 'bold')) #Apply Bold to all plot.title

Cluster Detection on a Working Day

By using the cluster_edge_betweenness function, we will be able to detect any possible cluster based on the betweeness score of the vertices. From the graph below, we can see that there are 3 clusters found within the interaction.

Show code
GNC1 <- cluster_edge_betweenness(cgraph_aggregated_workday, weights = NULL)
V(cgraph_aggregated_workday)$color <-membership(GNC1)              #Plot setting specifying the coloring of vertices by community
cgraph_aggregated_workday$palette <- diverging_pal(length(GNC1))   
plot(cgraph_aggregated_workday, edge.arrow.size=0.25,edge.arrow.mode = "-", vertex.label = NA, main = "How many influential industries within \n City of Engagement?")

Analysis of Plot

From the above plots, we can infer that the 9 participants who are deemed influential based on their EigenVector Score have high joviality and without kids. From the statistical plot, the plot of have_kids shows a p-value of more than 0.05 and therefore we fail to reject the null hypothesis that there is no relationship between this two variables. For the plot on Joviality since the p-value is less than 0.05, we reject the null hypothesis and conclude that there is a relationship between Joviality and Eigenvector Centrality score. The higher the joviality, the higher your influence is on the weekday.

Influence on weekdays can also highlight the person influence in their workplace. There is a possibility that these participants are holding on the a management level position within the workplace to affect certain level of influence.

From the cluster detection, we can also possibly identify 3 different industries that exist within the City that hold the top 10% Eigenvector Centrality Score. These industries might play an influential role or key businesses within the City of Engagement due to how much influence they have during the Working Days.

Learning Points

Throughout this exercise, I was able to utilize the many different plots (network and statistic) to identify trends/pattern through the social network of the Participants. Instead of using the full vertices and edges, one must understand the outcome of the social network plot and reduce the network size based on the required analysis.

To further improve the analysis, income can be considered to validate if the influential participants are high-earners and therefore have a higher chance that they are holding on to a managerial position in the company.