The goal of this project is to analyze Social Network techniques. Its process is complex and it is performed by several steps, which I will analyze below in this report.
For this project I use R
(programming language), but also an open-source software for network visualization and analysis: Gephi
.
The first phase of Social Network Analysis includes collecting and selecting datasets that can be useful for further analysis. In my case, I analyzed and considered from kaggle.com dataframes regarding the Marvel universe, all comic books and their respective heroes within them.
Everything presented within this file, can be easily reproduced with the R file (in /code folder), and can be viewable in Gephi (all graphs can be taken in the /gephi folder).
With this project, I wanted to analyze the network of the heroes, that is, the encounters they had within each individual comic book. I decided to analyze the social network of the Marvel universe, using 2 datasets (in /data folder):
edges.csv
: represents the heroes' paths and relationships within the social network, consisting of hero (hero) and comic (comic in which he appears).hero.csv
: represents the encounters (relationships) between heroes specifically, composed of hero1 and hero2.
I uploaded dataset:
edges_df <- read.csv(file = "~/path/edges.csv")
hero_df <- read.csv(file = "~/path/hero-network.csv")
Below I have loaded all the libraries I needed for the project:
library(ggthemes)
library(dplyr)
library(tidyverse)
library(igraph)
library(threejs)
I filtered the edges_df
table, selecting the hero
variable, grouping it and summing all the times the hero variable had the same value (hero), so as to count the number of times the particular hero appears within all comics. We then turned it into a data frame because it was easier to study. By sorting them in a descending manner previously, we took the top 20 heroes with the most appearances and from that we created a graph.
edges_df_su <- edges_df %>%
select(hero) %>%
group_by(hero) %>%
summarize(count=n()) %>%
arrange(desc(count))
edges_df_su <- as.data.frame(edges_df_su)
edges_df_su1 <- as.data.frame(edges_df_su[1:20,])
I show the graph with the most featured heroes.
g <- ggplot(edges_df_su1, mapping = aes(x = reorder(hero, count), count, fill=hero))+
geom_bar(stat="identity")+
geom_text(aes(label=count), hjust=1.3, vjust=0.8, color="black", size=3.5)+
theme_minimal()+
coord_flip()
g
The heroes with the most apparances within the analyzed comic books are Spider-Man and Captain America.
For the construction of the social network I used the hero_df dataframe. In R (following command) to create a graph from dataframe I used graph_from_data_frame
function from igraph
package.
hero_n <- graph_from_data_frame(hero_df, directed = F)
In parallel I loaded the dataset into Gephi
, so that I visualize the graph. I set "Repulsion Strenght" = 10000.0
in Force Atlas section, because in this way the nodes are spaced out and it make the graph more visualizable.
Simultaneously with R I was able to translate the graph into statistics: the graph has 6426 nodes and from 574467 links.
V(hero_n)
## + 6426/6426 vertices, named, from fcea7df:
## [1] LITTLE, ABNER BLACK PANTHER/T'CHAL STEELE, SIMON/WOLFGA
## [4] RAVEN, SABBATH II/EL IRON MAN IV/JAMES R. IRON MAN/TONY STARK
## [7] ERWIN, CLYTEMNESTRA PRINCESS ZANDA CARNIVORE/COUNT ANDR
## [10] GHOST ZIMMER, ABE FU MANCHU
## [13] SHANG-CHI SMITH, SIR DENIS NAY STARSHINE II/BRANDY
## [16] MAN-THING/THEODORE T TARR, BLACK JACK WU, LEIKO
## [19] JACKSON, STEVE RESTON, CLIVE ROM, SPACEKNIGHT
## [22] DOCTOR DREDD MYSTIQUE/RAVEN DARKH HYBRID/JAMES JIMMY M
## [25] DESTINY II/IRENE ADL ROGUE / AVALANCHE/DOMINIC PE
## [28] PYRO/ALLERDYCE JOHNN TORPEDO III/BROCK JO CLARK, SARAH
## + ... omitted several vertices
E(hero_n)
## + 574467/574467 edges from fcea7df (vertex names):
## [1] LITTLE, ABNER --PRINCESS ZANDA
## [2] LITTLE, ABNER --BLACK PANTHER/T'CHAL
## [3] BLACK PANTHER/T'CHAL--PRINCESS ZANDA
## [4] LITTLE, ABNER --PRINCESS ZANDA
## [5] LITTLE, ABNER --BLACK PANTHER/T'CHAL
## [6] BLACK PANTHER/T'CHAL--PRINCESS ZANDA
## [7] STEELE, SIMON/WOLFGA--FORTUNE, DOMINIC
## [8] STEELE, SIMON/WOLFGA--ERWIN, CLYTEMNESTRA
## [9] STEELE, SIMON/WOLFGA--IRON MAN/TONY STARK
## [10] STEELE, SIMON/WOLFGA--IRON MAN IV/JAMES R.
## + ... omitted several edges
Analyzing a network where there are many characters in comics, the network can be defined as a weighted network
, where the weight of each link is very important. Dataframe didn't have a weight of each link, in fact I created it. I gave each relationship between characters (each row of the dataframe) a weight = 1
. Within the igraph
package is the simplify()
function, which helped us significantly in grouping the same type of relationship.
E(hero_n)$weight <- 1
hero_n <- igraph::simplify(hero_n, remove.loops = TRUE, remove.multiple = TRUE, edge.attr.comb = list(weight="sum", "ignore"))
I visualized the graph in Gephi
, using a color scale based on weight links, from blue (less weight) to red (more weight)
In R, I transformed the graph into a dataframe with the as_data_frame()
function (again within the igraph
package), I discover that the link weight, and thus the strongest link between two heroes is between "PATRIOT/JEFF MACE" and "MISS AMERICA/MADELIN" with a weight of 1894.
a <- igraph::as_data_frame(hero_n)
a$from[order(a$weight, decreasing = T)][1]
## [1] "PATRIOT/JEFF MACE"
a$to[order(a$weight, decreasing = T)][1]
## [1] "MISS AMERICA/MADELIN"
a$weight[order(a$weight, decreasing = T)][1]
## [1] 1894
With some functions in igraph
, I analyze if graph had identified my weights, and if it had been simplified with the simplify()
function:
is.weighted(hero_n)
## [1] TRUE
is.simple(hero_n)
## [1] TRUE
With other functions, I analyzed and discover density and average distance. Density is defined as the number of link a hero has, divided by the total number of possible link he could have. Instead, the average distance of a graph is the average of the elements of its graph distance matrix.
dens <- edge_density(hero_n)
dens
## [1] 0.008099731
dist <- mean_distance(hero_n, directed = FALSE)
dist
## [1] 2.638427
In graph theory, the Erdos-Rényi model is one of the most reliable models for generating random graphs. This process is used to generate a random distribution as a standard model to be followed to compare results.
n<-20
gl <- vector('list', n)
for(i in 1:n){
gl[[i]] <- erdos.renyi.game(n = gorder(hero_n), p.or.m = dens, type = "gnp")
}
I compare the random graphs with the distribution of the mean distance
gl.apl <- lapply(gl, mean_distance, directed = FALSE)
gl.aplu <- unlist(gl.apl)
hist(gl.aplu)
abline(v = dist, lty = 3, lwd=2)
This part calculates whether the average distance of random graphs is shorter than observation or not. If most of the average distances generated by random graphs are greater than observed, it means that the observed result is shorter than normal; otherwise, it is greater than normal.
sum(gl.aplu < dist)/n
## [1] 0
From this result it can understand that the result I observed is shorter than normal.
When I finished the study with links, let me move on to the study of nodes. Initially, I made use of a graph, to see how many relevant nodes there were. In fact, some of them had a strong impact, most did not.
hist(graph.strength(hero_n),col="blue",xlab="VertexStrength",ylab="Frequency",main="", breaks = 1000, xlim=c(0, 1000))
After having made this "general" analysis, let me go into more detail. In fact, nodes can be considered relevant through many aspects: by degree
, betweenness
, eigen centrality
and closeness
. In fact, I analyzed with R
the different values and made a ranking of the top 5 most important nodes (heroes) and I realized with Gephi
, the different graphs with the various features.
Degree (degree centrality) is the simplest measure of centrality to calculate. A node's degree is simply a count of how many social connections (links) it has. Degree centrality for a node is simply its degree. A node with 10 social connections would have a degree centrality of 10. A node with 1 arc would have a degree centrality of 1.
In this case is relevant Captain America with a degree of 1906
all_deg <- degree(hero_n, mode = c("all"))
all_deg[order(all_deg, decreasing = T)][1:5]
## CAPTAIN AMERICA SPIDER-MAN/PETER PAR IRON MAN/TONY STARK
## 1906 1737 1522
## THING/BENJAMIN J. GR MR. FANTASTIC/REED R
## 1416 1379
Betweenness is a way to detect the amount of influence a node has on the flow of information in a graph (it is a measure of a node's influence in a network).
betw <- betweenness(hero_n, directed = F)
betw[order(betw, decreasing = T)][1:5]
## SPIDER-MAN/PETER PAR CAPTAIN AMERICA IRON MAN/TONY STARK
## 768886.0 639944.8 530577.5
## HAVOK/ALEX SUMMERS WOLVERINE/LOGAN
## 517604.5 481495.9
Once again, Spider Man was relevant in our case with a betweenness of 768886.0
Eigen centrality is a measure of the influence of a node in a network. Relative scores are assigned to all nodes in the network based on the concept that connections to high-scoring nodes contribute more to that node's score than equal connections to low-scoring nodes. A high eigenvector score means that a node is connected to many n.
g.ec <- eigen_centrality(hero_n)
g.ec$vector[order(g.ec$vector, decreasing = T)][1:5]
## CAPTAIN AMERICA THING/BENJAMIN J. GR HUMAN TORCH/JOHNNY S
## 1.0000000 0.8527414 0.8378977
## MR. FANTASTIC/REED R IRON MAN/TONY STARK
## 0.8232495
In this case Captain America ranks first in order of importance with a maximum score of 1.0000000
Closeness assigns a score to each node based on their "closeness" to all other nodes in the network. This measure calculates the shortest paths among all nodes, then assigns each node a score based on the sum of the shortest paths and is used to find the nodes that are in the best position to affect the entire network most quickly.
In this case, unexpectedly, after several first places by Captain America and Spider Man in other measures, in the closeness ranks first Living Mummy with a score of 6.334727e-05
igraph::is.connected(hero_n)
## [1] FALSE
hero_cc <- components(hero_n,mode = 'strong')
BigComp <- which.max(hero_cc$csize)
Main_heron <- induced_subgraph(hero_n, which(hero_cc$membership == BigComp))
cl.n <- closeness(Main_heron)
cl.n[order(cl.n, decreasing = T)][1:5]c
## LIVING MUMMY IRON MAN/TONY STARK AJAK/TECUMOTZIN [ETE
## 6.334727e-05 6.321512e-05 6.303978e-05
## HUMAN TORCH/JOHNNY S THING/BENJAMIN J. GR
## 6.290891e-05 6.279435e-05
To conclude the social network analysis, I decided to search within it whether there were any "communities", whether any node was more closely connected with one rather than another.
In R I used the walktrap.community()
function. It is an approach based on random walks. The general idea is that if random walks are performed on the graph, the walks are more likely to stay within the same community because there are only a few edges leading outside a given community. walktrap.community()
performs short random walks of 3-4-5 steps (depending on the parameters set) and uses the results of these random walks to join separate communities in a bottom-up approach. It is somewhat slower than the fastgreedy.community()
approach but more accurate than the latter (according to the original publication).
The condition for which several communities were created is 5. With r which.max(sizes(net_comm)
I know the largest community that has been created within network.
net_comm<-walktrap.community(hero_n, steps = 5)
sizes(net_comm)
## Community sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 976 2141 51 90 111 118 9 4 141 14 23 28 9 4 535 11
## 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
## 219 119 11 50 44 53 113 191 17 42 18 9 2 16 130 28
## 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
## 13 20 13 51 4 9 5 6 4 20 9 10 5 28 6 24
## 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
## 2 20 19 3 15 10 5 15 37 2 25 13 5 4 5 7
## 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
## 17 4 5 5 8 14 5 3 2 10 3 3 23 5 8 5
## 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
## 3 4 5 8 5 13 10 7 5 6 5 3 3 22 5 4
## 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112
## 8 9 10 13 7 5 4 5 5 2 7 5 4 3 8 4
## 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128
## 5 9 7 8 3 9 3 4 2 4 2 2 2 2 3 3
## 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
## 6 3 4 2 8 4 4 3 6 3 4 6 4 5 3 3
## 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160
## 5 4 2 10 6 2 7 3 4 2 5 3 3 2 5 3
## 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176
## 4 3 5 4 7 4 8 4 7 8 9 6 4 6 3 2
## 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192
## 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1
## 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 241
## 1
net_c <- set_vertex_attr(hero_n, "community", value = membership(net_comm))
which.max(sizes(net_comm))
## 2
## 2
I then visualized it, also in Gephi
:
With these commands, I then built a "community graph", meaning that the nodes that appeared in the plot corresponded to a particular community, represented with a number (as we can see below).
c_f<-as.factor(membership(net_comm))
hero_n_c<- contract.vertices(hero_n, membership(net_comm))
hero_n_c<-simplify(hero_n_c)
lab<-sort(unique(membership(net_comm)))
plot(hero_n_c, vertex.label=lab)
From different communities, I decided to analyze community number 118, that is to see within it how many nodes (heroes) there were, and how many links (relation) there were between them.
m<-118
sum(vertex_attr(net_c)$community==m)
## [1] 9
subam <- subgraph.edges(graph=net_c, eids=which(vertex_attr(net_c)$community==m), delete.vertices = TRUE)
sum(vertex_attr(net_c)$community==m)==gsize(subam)
## [1] TRUE
plot(subam, vertex.label.color = "black",edge.color = 'black', layout= layout.fruchterman.reingold(subam))
V(subam)
## + 10/10 vertices, named, from 267ba20:
## [1] IRON MAN IV/JAMES R. CAPRICORN/WILLARD WE LIBRA/GUSTAV BRANDT
## [4] SAGITTARIUS/HARLAN V CANCER/JACK KLEVENO GEMINI/JOSHUA LINK
## [7] PISCES/NOAH PERRICON SCORPIO II ARIES II/GROVER RAYM
## [10] VIRGO/ELAINE MCLAUGH
E(subam)
## + 9/9 edges from 267ba20 (vertex names):
## [1] IRON MAN IV/JAMES R.--CAPRICORN/WILLARD WE
## [2] IRON MAN IV/JAMES R.--LIBRA/GUSTAV BRANDT
## [3] IRON MAN IV/JAMES R.--SAGITTARIUS/HARLAN V
## [4] IRON MAN IV/JAMES R.--CANCER/JACK KLEVENO
## [5] IRON MAN IV/JAMES R.--GEMINI/JOSHUA LINK
## [6] IRON MAN IV/JAMES R.--PISCES/NOAH PERRICON
## [7] IRON MAN IV/JAMES R.--SCORPIO II
## [8] IRON MAN IV/JAMES R.--ARIES II/GROVER RAYM
## [9] IRON MAN IV/JAMES R.--VIRGO/ELAINE MCLAUGH
In order to have a better visualization of the graph, I transformed the graph into a dataframe, and the latter into a .csv file, and then I could also analyze this community on the Gephi
viewer.
eroi <- c("IRON MAN IV/JAMES R.", "CAPRICORN/WILLARD WE", "LIBRA/GUSTAV BRANDT",
"SAGITTARIUS/HARLAN V", "CANCER/JACK KLEVENO", "GEMINI/JOSHUA LINK",
"PISCES/NOAH PERRICON", "SCORPIO II", "ARIES II/GROVER RAYM", "VIRGO/ELAINE MCLAUGH")
a <- hero_df[hero_df$hero1 %in% eroi & hero_df$hero2 %in% eroi,]
write.csv(a, file = "comm118.csv", row.names = FALSE)