Take-Home Exercise 5

Putting Visual Analytics into Practical Use.

Ong Zhi Rong Jordan https://example.com/norajones (Spacely Sprockets)https://example.com/spacelysprokets
2022-05-29

OVERVIEW

Introduction

With reference to bullet point 1 and 2 of Challenge 2 of VAST Challenge 2022, I will be revealing the:

Methodology

To analyse the social areas of the city of Engagement, I will be using the data from the Check In journal to analyse the frequency of visit to the different Venue Type (Pub or Restaurants) and the Pubs and Restaurants Attribute data to identify the location of these venues to identify a pattern in their establishments. Subsequently, I will use the monthly status log to identify the travel pattern of the participants to identify choke points (high intensity visit) on various location throughout the month.

Data Preparation

Installing and Loading of Packages

The following code chunks will install and load the required packages.

Show code
packages = c('sf', 'tmap', 'tidyverse', 
             'lubridate', 'clock', 
             'sftime', 'rmarkdown', "binr")
for (p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p,character.only = T)
}

Loading Raw Data Set

The read_sf function will be used instead of read_csv for the various data set to read simple features and retrieve their geometry types.

Show code
pubs <- read_sf("rawdata/Pubs.csv", 
                   options = "GEOM_POSSIBLE_NAMES=location")
buildings <- read_sf("rawdata/Buildings.csv", 
                   options = "GEOM_POSSIBLE_NAMES=location")
restaurants <- read_sf("rawdata/Restaurants.csv", 
                   options = "GEOM_POSSIBLE_NAMES=location")
schools <- read_sf("rawdata/Schools.csv", 
                   options = "GEOM_POSSIBLE_NAMES=location")
apartments <- read_sf("rawdata/Apartments.csv", 
                   options = "GEOM_POSSIBLE_NAMES=location")
employers <- read_sf("rawdata/Employers.csv", 
                   options = "GEOM_POSSIBLE_NAMES=location")
check_in <- read.csv("rawdata/CheckinJournal.csv")

Data Cleaning and Wrangling

To bind two different data frame into one, there is a need for a similar key and columns. For this analysis, we will use the Venue Id as the key. The rbind function is use to bind the the two different data frame rows together.

Show code
pubs <- pubs %>%
  select(c("pubId", "buildingId")) %>%
  rename("venueId" = "pubId") 

restaurants <- restaurants %>%
  select(c("restaurantId", "buildingId")) %>%
  rename("venueId" = "restaurantId") 

pubs$venueId <- as.numeric(as.character(pubs$venueId))
restaurants$venueId <- as.numeric(as.character(restaurants$venueId))

pubs$buildingtype <- "Pubs"
restaurants$buildingtype <- "Restaurants"


buildingType <- bind_rows(pubs,restaurants)
Show code
apartments$rentalCost <- as.numeric(as.character(apartments$rentalCost))

cuts <- bins(apartments$rentalCost, target.bins = 5, max.breaks = 5)
cuts$breaks <- bins.getvals(cuts)
cuts$binct

lowRental <- as.double(c(300:600))
midRental <- as.double(c(601:800))
highRental <- as.double(c(801:1000))
suhighRental <- as.double(c(1001:1605))

apartments <- apartments %>%
  mutate (Rental_Type = case_when(
    round(rentalCost) %in% lowRental ~ "Low Rental",
    round(rentalCost) %in% midRental ~ "Mid Rental",
    round(rentalCost) %in% highRental ~ "High Rental",
    round(rentalCost) %in% suhighRental ~ "Super High Rental"
  )) 

Filter based on Venue

Next, we will filter the check-in data to only show pubs and restaurants by using the filter function. Since there is a missing building Id for the venue types, we will use a full_join the insert the required building Id to the check-in journal.

Show code
check_in <- check_in %>%
  filter(venueType == "Pub" | venueType == "Restaurant")

clean_data <- full_join(check_in,buildingType, by = "venueId")

Insert day of the week

To find out the day of the week, we will use the lubridate function wday to insert a new column. We will then use case_when to run through all the rows and identify the difference types of weekday. Visit count will be calculated using the n() function and filtering the duplicated rows using the unique function.

Show code
clean_data_selected <- clean_data %>%
   mutate (Timestamp = date_time_parse(timestamp,
                                      zone ="",
                                      format = "%Y-%m-%dT%H:%M:%S")) %>% 
  mutate (day = wday(Timestamp, label = TRUE))

weekday <- c("Mon", "Tue", "Wed", "Thu", "Fri")
weekend <- c("Sat", "Sun")

clean_data_selected <- clean_data_selected %>%
  mutate (Weektype = case_when(
    day %in% weekday ~ "Weekday",
    day %in% weekend ~ "Weekend"
  )) %>% 
  select (-day)


new_data_weekday <- clean_data_selected %>%
  select(c("venueType", "buildingId", "location","Weektype")) %>%
  group_by(buildingId, Weektype) %>%
  mutate(n = n()) 

#new_data_weekday <- unique(new_data_weekday)

Visulisation and Analysis

Plotting of Venue using tmap

Distribution of Building Type around the City

For the plot, we will use different colours of polygon to identify possible trends in the way the city was divided based on the distribution of building types. The first plot shows the overall city building type distribution using tm_polygons colors to seperate the different types.

Show code
tm_shape(buildings)+
tm_polygons(col = "buildingType",
           size = 1,
           title = "Building Type",
           border.col = "black",
           border.lwd = 1) +
  tm_layout(main.title= 'Distribution of Building Type in Ohio', 
            main.title.position = c('left'),
            main.title.size = 1.3, legend.outside = TRUE,)

Through the plot, we can identify the southern region is heavily commercialise while the central region of the city can be identified as a Heartlands due to the high volumn of residential area. The northern part of the city has a even distribution of residential and commerical area.

Distribution of Schools around the City

We will also look at the distribution of schools and their cost required to see if the area affect the school costs.

Show code
schools$monthlyCost <- round(as.numeric(as.character(schools$monthlyCost)),2)

tm_shape(buildings)+
tm_polygons(col = "buildingType",
           size = 1,
           title = "Building Type",
           border.col = "black",
           border.lwd = 1) +
tm_shape(schools) +
  tm_symbols(size = 0.5, col = "monthlyCost", style = "cont", title.col = "Monthly Cost"
             ) +
  tm_layout(main.title= 'Schools Location in Ohio', 
            main.title.position = c('left'),
            main.title.size = 1.5, legend.outside = TRUE,)
Show code
tmap_mode("plot")

From the plot, we observed that the school located within the commercialise area in the south has a lower monthly cost compared to the other schools which is located at the region with more residential area surronding it. This could be due to the high volume of demand for the schools within that area due to its vicinity around the residential area.

Distribution of Apartment in the City of Engagement

Next, we will analyse the distribution of Apartment based on the rental cost to identify any possible trends in the city.

We will plot two different apartment plots, one is for the number of rooms of the apartment, the other is the max occupancy of the apartment.

Plot of Apartment based on Max Occupancy

Show code
tm_shape(buildings)+
tm_polygons(col = "grey",
           size = 1,
           border.col = "black",
           border.lwd = 1) +
tm_shape(apartments) +
  tm_symbols(size = 0.2, col = "maxOccupancy", title.col = "Max Occupancy"
             ) +
  tm_facets(by = "Rental_Type", nrow = 2, free.coords = FALSE) +
  tm_layout(main.title= 'Apartment Location in Ohio \nbased on Rental Cost', 
            main.title.position = c('left'),
            main.title.size = 1)
Show code
tmap_mode("plot")

Plot of Apartment based on Number of Rooms

Show code
tm_shape(buildings)+
tm_polygons(col = "grey",
           size = 1,
           border.col = "black",
           border.lwd = 1) +
tm_shape(apartments) +
  tm_symbols(size = 0.2, col = "numberOfRooms", title.col = "Number of Rooms"
             ) +
  tm_facets(by = "Rental_Type", nrow = 2, free.coords = FALSE) +
  tm_layout(main.title= 'Apartment Location in Ohio \nbased on Rental Cost', 
            main.title.position = c('left'),
            main.title.size = 1)
Show code
tmap_mode("plot")

From the two plots, we observed that the rental cost is not affected by the number of rooms of the apartment but the occupancy number of the apartment. Occupancy rate of 3 and 4 tend to have higher rental cost and concentrated within the southern and central region. Whereas the low rental apartments can be found mostly at the northern region of the city.

Monthly Visit Count of Venue

For the plot, we use the shape to visualise different venue types and the color to identify the density of the visit count. The first plot shows one month visit count of the various venue.

Show code
tm_shape(buildings)+
tm_polygons(col = "grey60",
           size = 1,
           border.col = "black",
           border.lwd = 1) +
tm_shape(new_data) +
  tm_symbols(size = 0.5, shape = "venueType", shapes.labels = c("Pub", "Restaurant"), title.shape = "Venue Type",
             col = "n", style = "cont", title.col = "Visit Count"
            ) +
  tm_layout(title= 'Visit Count of Venue by \nResidents in Ohio', 
            title.position = c('right', 'top'),
            title.size = 1)
Show code
tmap_mode("plot")

Monthly Visit Count of Venue by Weekday

Next, by using tm_facet, we are able to segregate the data by the type of weekday. We now plot the visit based on Weekdays and Weekends to identify any patterns.

Show code
tm_shape(buildings)+
tm_polygons(col = "grey60",
           size = 1,
           border.col = "black",
           border.lwd = 1) +
tm_shape(new_data_weekday) +
  tm_symbols(size = 0.5, shape = "venueType", shapes.labels = c("Pub", "Restaurant"), title.shape = "Venue Type",
             col = "n", style = "cont", title.col = "Visit Count"
            )  +
  tm_facets(by = "Weektype", nrow = 1, free.coords = FALSE) +
  tm_layout(main.title= 'Visit Count of Venue by Resident in Ohio', 
            main.title.position = c('center'),
            main.title.size = 1, legend.outside.size = 0.2)
Show code
tmap_mode("plot")

Analysis of Plot

From our 1st plot, we observed that Pubs in the central region have a higher visit count than the pubs located in the outskirt of the city. Whereas the restaurants located away from the central region have a slightly higher visit count that the restaurants located within the central region. This could be due to the commercial region in the North and South of the city which would bring in more visits throughout the month. This also shows that residents would visit pubs that are located within the residential area instead of the commericalise area.

When we look at the different weekday types, we observed that in general, during the weekdays, restaurants have a higher visit count than pubs. Restaurants in the commercial areas (located in the north and south) have a slightly higher visit count that restaurants located in the central region. Whereas during the weekends, Pubs in the central region is definitely the social gathering point for the people in the city of Engagement.

Despite the even distribution of pubs and restaurants throughout the city, high social activities seems to be concentrated within the central region regardless of the weekday types.

Plotting of bottlenecks using Hexbin

We will combine 5 status log to gather 1 month worth of data for analysis. Similar to methods used to prepare the Pubs and Restaurant data.

Timings for Analysis

For this analysis, we will evaluate the Morning and Evening Rush Hour timings on Sat and Mon to identify bottlenecks/congestion within the city. Morning Peak hour is define as 0600hrs - 0900hrs where Evening Rush Hour is defined as 1700hrs - 2000hrs.

Show code
morningRH <- c(6:9)
eveningRH <- c(17:20)

logs_selected <- logs_selected %>%
  mutate (dayType = case_when(
    hour %in% morningRH ~ "Morning Peak Hour",
    hour %in% eveningRH ~ "Evening Peak Hour"
  )) %>% 
  select (-hour)

Cleaning of Data

Similar to the data cleaning of check-in journal, the additional steps taken for the status log is to convert the data frame to sf format using st_as_sf before we combine the dataframe with the hex dataframe.

Show code
logs_selected <- st_as_sf(logs_selected_mon_mornpeak)
logs_selected_weekday <- st_as_sf(logs_selected_mon_evenpeak)
logs_selected_weekend <- st_as_sf(logs_selected_sat_mornpeak)
logs_selected_weekend <- st_as_sf(logs_selected_sat_evenpeak)

Creating the Hex data

By using the polygon data from building, the hexagon data is created.

Show code
hex <- st_make_grid(buildings, 
                    cellsize=70, 
                    square=FALSE) %>%
  st_sf() %>%
  rowid_to_column('hex_id')
Show code
points_in_hex_mon_morn <- st_join(logs_selected_mon_mornpeak, 
                        hex) %>%
  st_set_geometry(NULL) %>%
  count(name='pointCount', hex_id)

points_in_hex_mon_even <- st_join(logs_selected_mon_evenpeak, 
                        hex) %>%
  st_set_geometry(NULL) %>%
  count(name='pointCount', hex_id)


points_in_hex_sat_morn <- st_join(logs_selected_sat_mornpeak, 
                        hex) %>%
  st_set_geometry(NULL) %>%
  count(name='pointCount', hex_id)

points_in_hex_sat_even <- st_join(logs_selected_sat_evenpeak, 
                        hex) %>%
  st_set_geometry(NULL) %>%
  count(name='pointCount', hex_id)
Show code
hex_combined_mon_morn <- hex %>%
  left_join(points_in_hex_mon_morn, 
            by = 'hex_id') %>%
  replace(is.na(.), 0)

hex_combined_mon_even <- hex %>%
  left_join(points_in_hex_mon_even, 
            by = 'hex_id') %>%
  replace(is.na(.), 0)

hex_combined_sat_morn <- hex %>%
  left_join(points_in_hex_sat_morn, 
            by = 'hex_id') %>%
  replace(is.na(.), 0)

hex_combined_sat_even <- hex %>%
  left_join(points_in_hex_sat_even, 
            by = 'hex_id') %>%
  replace(is.na(.), 0)

Plotting of bottleneck/congestion using tmap

Plotting and Analysis of Hexbin Plot

To plot the map, we will use the tm_shape function to call the hexagon shape and filter to filter all the hexagon which has no visit count. tm_fill is used to identify the density of visit based on quantile. The plot will consist of the traffic condition on a monthly basis and both weekday and weekend.

Show code
w1 <- tm_shape(buildings)+
tm_polygons(col = "buildingType",
           size = 1,
           title = "Building Type",
           border.col = "black",
           border.lwd = 1, legend.show = FALSE) +
  tm_shape(hex_combined_mon_morn %>%
           filter(pointCount > 0))+
  tm_fill("pointCount",
          n = 5,
          title = "Visit Count",
          style = "quantile", alpha = 0.8) +
  tm_borders(alpha = 0.1) +
  tm_layout(main.title= 'Traffic Condition on Mon \nduring Morning Rush Hour', 
            main.title.position = c('center'),
            main.title.size = 0.8)

w2 <- tm_shape(buildings) +
tm_polygons(col = "buildingType",
           size = 1,
           title = "Building Type",
           border.col = "black",
           border.lwd = 1, legend.show = FALSE) +
  tm_shape(hex_combined_sat_morn %>%
           filter(pointCount > 0),)+
  tm_fill("pointCount",
          n = 5,
          title = "Visit Count",
          style = "quantile", alpha = 0.8) +
  tm_borders(alpha = 0.1) +
  tm_layout(main.title= 'Traffic Condition on Sat \nduring Morning Rush Hour', 
            main.title.position = c('center'),
            main.title.size = 0.8)

tmap_arrange(w1, w2)
Show code
tmap_mode("plot")
Show code
w1 <- tm_shape(buildings)+
tm_polygons(col = "buildingType",
           size = 1,
           title = "Building Type",
           border.col = "black",
           border.lwd = 1, legend.show = FALSE) +
  tm_shape(hex_combined_mon_even %>%
           filter(pointCount > 0))+
  tm_fill("pointCount",
          n = 5,
          title = "Visit Count",
          style = "quantile", alpha = 0.8) +
  tm_borders(alpha = 0.1) +
  tm_layout(main.title= 'Traffic Condition on Mon \nduring Evening Rush Hour', 
            main.title.position = c('center'),
            main.title.size = 0.7)

w2 <- tm_shape(buildings)+
tm_polygons(col = "buildingType",
           size = 1,
           title = "Building Type",
           border.col = "black",
           border.lwd = 1, legend.show = FALSE) +
  tm_shape(hex_combined_sat_even %>%
           filter(pointCount > 0),)+
  tm_fill("pointCount",
          n = 5,
          title = "Visit Count",
          style = "quantile", alpha = 0.8) +
  tm_borders(alpha = 0.1) +
  tm_layout(main.title= 'Traffic Condition on Sat \nduring Evening Rush Hour', 
            main.title.position = c('center'),
            main.title.size = 0.7)

tmap_arrange(w1, w2)
Show code
tmap_mode("plot")

Analysis of Plot

From the plot, we can identify the congestion of the city based on the frequency of visit by the participants.

During the Morning Rush Hour, the narrow streets along the city are definitely congested regardless of the Weekday or Weekend. For Weekday, we observed that there is an increased congestion within the central region residential area compared to weekends.

For the evening Rush hour, it seems that there is no difference between Mon or Sat with a congested narrow path and high volume of traffic at the northern region.

Learning Points

Throughout this exercise, I was able to utilize the tmap function and customize various inputs such as the shape and fill colour. My own reflection is that tmap is similar to ggplot and therefore is user-friendly for those who are proficient in the layers of ggplot. Hexbin Maps provide a good visulisation of the density in an area instead of a point but sometimes without an overlay would be difficult to understand or provide good insights. Even if the plot were to overlay with the hexbin map, the plot would be unreadable due to the size and the structure of a hexagon.