diff --git a/utils/joss_paper/images/tt_1.png b/utils/joss_paper/images/tt_1.png index 83056b2..0abbda6 100644 Binary files a/utils/joss_paper/images/tt_1.png and b/utils/joss_paper/images/tt_1.png differ diff --git a/utils/joss_paper/images/tt_2.png b/utils/joss_paper/images/tt_2.png index 6cfd3d1..e884852 100644 Binary files a/utils/joss_paper/images/tt_2.png and b/utils/joss_paper/images/tt_2.png differ diff --git a/utils/joss_paper/images/tt_3.png b/utils/joss_paper/images/tt_3.png deleted file mode 100644 index 3fa4aef..0000000 Binary files a/utils/joss_paper/images/tt_3.png and /dev/null differ diff --git a/utils/joss_paper/images/tt_4.png b/utils/joss_paper/images/tt_4.png deleted file mode 100644 index a7dd8d5..0000000 Binary files a/utils/joss_paper/images/tt_4.png and /dev/null differ diff --git a/utils/joss_paper/images/tt_5.png b/utils/joss_paper/images/tt_5.png deleted file mode 100644 index 2762f10..0000000 Binary files a/utils/joss_paper/images/tt_5.png and /dev/null differ diff --git a/utils/joss_paper/paper.Rmd b/utils/joss_paper/paper.Rmd index 900e19d..85f1124 100644 --- a/utils/joss_paper/paper.Rmd +++ b/utils/joss_paper/paper.Rmd @@ -38,11 +38,13 @@ options(tinytex.verbose = TRUE) # Summary -{ig.degree.betweenness} is an R package which allows users to implement the "Smith-Pittman" community detection algorithm on networks and sociograms constructed and/or loaded with the {igraph} package. {ig.degree.betweenness} also offers utility functions which enable neater plotting of densely connected networks with high number of edges and a low number of nodes and the relevant preparation of unlabeled graphs for the Smith-Pittman algorithm's present implementation in the R programming language. There presently do not exist other implementations of this algorithm which are ready to use which are compatible in the {igraph} ecosystem. As a result, this contribution is welcome by {igraph} users interested in exploring and applying the Smith-Pittman algorithm in SNA settings. +{ig.degree.betweenness} is an R package which enables users to implement the "Smith-Pittman" community detection algorithm on networks and sociograms constructed or loaded with the {igraph} package. {ig.degree.betweenness} also provides utility functions to enable neater plotting of densely connected networks and to provide relevant preparation for unlabeled graphs to accommodate its present implementation of the Smith-Pittman algorithm in the R programming language. Since this algorithm is relatively new, there presently does not exist other implementations of it which are ready to use and are compatible in the {igraph} ecosystem. As a result, this contribution is welcome by {igraph} users interested in exploring and applying the Smith-Pittman algorithm in social network analysis (SNA) settings. # Statement of Need -{igraph} [@igraph_article] offers a suite functions and tools for interacting with graph data and engaging in social network analysis (SNA). A major area of study in SNA is the identification node clusters through methods referred to as "community detection algorithms" [@rostami2023community]. {igraph} allows users to employ a variety of popular community detection algorithms, including Girvan-Newman^[https://r.igraph.org/reference/cluster_edge_betweenness.html] [@Girvan_Newman_2002], Louvain^[https://r.igraph.org/reference/cluster_louvain.html] [@louvain_paper] and others^[For the full list of available community detection algorithms in the {igraph} R package, see the {igraph} reference manual: https://r.igraph.org/reference/index.html#community-detection]. In densely connected complex networks it has been noted by Smith, Pittman and Xu [@sp_paper] that considering the number of connections possessed by each individual node in a given network (degree centrality) along with edge-betweeness (as done by [@Girvan_Newman_2002]) offers an approach for identifying clusters which are more descriptive in certain settings. {ig.degree.betweenness} is an R package that contains a ready-to-use implementation of the Smith-Pittman community detection algorithm. +{igraph} [@igraph_article] offers a suite functions and tools for interacting with graph data and engaging with SNA. A major area of study and application in SNA is the identification node clusters through methods broadly referred to as "community detection algorithms". There is no specific model which describes exactly what a ”community” is. Generally, community detection algorithms employ specific optimization strategies to partition a large-scale complex network into a set of disjoint and compact subgroups, often (but not always) without prior knowledge regarding the number of subgroups and their sizes [@rostami2023community]. + +{igraph} supports a range of popular community detection algorithms, including Girvan-Newman^[https://r.igraph.org/reference/cluster_edge_betweenness.html] [@Girvan_Newman_2002], Louvain^[https://r.igraph.org/reference/cluster_louvain.html] [@louvain_paper] and others^[For the full list of available community detection algorithms in the {igraph} R package, see the {igraph} reference manual: https://r.igraph.org/reference/index.html#community-detection]. For densely connected, complex networks, research by Smith, Pittman and Xu [@sp_paper] that combining node degree (degree centrality) with with edge-betweeness (as utilized by [@Girvan_Newman_2002]) can enhance cluster identification in certain contexts. The {ig.degree.betweenness} package offers {igraph} users a ready-to-use implementation of the Smith-Pittman community detection algorithm in R [@base2022]. # The Smith-Pittman Algorithm @@ -52,7 +54,7 @@ The steps for the algorithm are: 1. Identify the node with the highest degree-centrality in the network. -2. Select the subgraph of the node with the highest degree centrality. Remove the edge possessing the highest calculated in the subgraph. +2. Select the subgraph of the node with the highest degree centrality. Remove the edge possessing the highest calculated (network-wide) edge-betweenness in the subgraph. 3. Recalculate the degree centrality for all nodes in the network and the betweenness for the remaining edge in the network, @@ -69,231 +71,154 @@ Conceptually, this algorithm (similar to Girvan-Newman and Louvain) can be speci The dataset commonly referred to as "Zachary's karate club network" [@zachary1977information] is a social network between members of a university club led by president John A. and karate instructor Mr. Hi (pseudonyms). At the beginning of the study there was an initial conflict between the club president, John A., and Mr. Hi over the price of karate lessons. As time passed, the entire club became divided over this issue. After a series of increasingly sharp factional confrontations over the price of lessons, the officers of the club, led by John A., fired Mr. Hi. The supporters of Mr. Hi retaliated by resigning and forming a new organization headed by Mr. Hi. Figure 2 shows the karate club network where the nodes signify individuals in the club and the edges signifies the existence of a relationship between two members. The node color indicates which group the members associated with post-split. -Since the division of the club and its members is known, this social network is a classic example dataset used and studied. In the context of community detection, the object of interest is seeing if the split could be identified based on the relationships between members. When applied in an unsupervised setting, the Girvan-Newman and Louvain algorthims identify communities of nodes which optimize modularity according to their approaches. However, the communities identified do not appear to identify a possible division in the group which is contextually informative or interpretative. The Smith-Pittman algorithm identifies 3 communities which could can be understood as individuals who would certainly associate with John A. or Mr. Hi and an uncertain group. Figure 3 shows the comparison between the three algorithms. +Since the division of the club and its members is known, this social network is a classic example dataset used and studied. The data is available in the {igraphdata} package [@igraphdatapackage]. In the context of community detection, the object of interest is seeing if the split could be identified based on the relationships between members. When applied in an unsupervised setting, the Girvan-Newman and Louvain algorthims identify communities of nodes which optimize modularity according to their approaches. However, the communities identified do not appear to identify a possible division in the group which is contextually informative or interpretative. The Smith-Pittman algorithm identifies 3 communities which could can be understood as individuals who would certainly associate with John A. or Mr. Hi and an uncertain group. Figure 3 shows the comparison between the three algorithms. -![The Zachary karate club network with the true split between members defined by node colors. John A. and Mr. Hi are denoted by 'J' and 'H', with other members being listed as numbers](./images/karate_network.png){width=60%} +The code for reproducing figures 2 and 3 are: -![Unsupervised Community Detection by (a) Girvan-Newman, (b) Louvain and (c) Smith-Pittman for the karate network.](./images/algorithm_comparison_karate.png) +```{r eval=FALSE} +# Install relevant packages +# install.packages(c("igraph","igraphdata","ig.degree.betweenness")) +library(igraphdata) +# Attach the Karate Club dataset +# Data from {igraphdata} +data("karate") +# Plot the initial network (Figure 1) +plot(karate) +# Girvan-Newman Clustering (Figure 2 (a)) +# Function from {igraph} +gn_karate <- ig.degree.betweeness::cluster_edge_betweenness(karate) -## TidyTuesday - "Monster Movies" Dataset +# Louvain Clustering (Figure 2 (b)) +# Function from {igraph} +louvain_karate <- igraph::cluster_louvain(karate) + +# Smith-Pittman Clustering (Figure 2 (c)) +# Function from {ig.degree.betweenness} +sp_karate <- igraph::cluster_degree_betweenness(karate) + +# Plot 3 plots next to eachother +par(mfrow= c(1,3),mar=c(0,0,0,0)+1) + +plot(gn_karate, karate, main = "(a)") + +plot(louvain_karate, karate, main = "(b)") + +plot(sp_karate, karate, main = "(c)") +``` + +![The Zachary karate club network with the true split between members defined by node colors. John A. and Mr. Hi are denoted by 'J' and 'H', with other members being listed as numbers](./images/karate_network.png){width=70%} - -The first visual is the constructed network. The other visuals are clusters based on the Girvan Newman (Edge Betweenness), Louvain (Direct Modularity Maximization) and Smith(thats me!)-Pittman (Node Degree + Edge Betweenness). +![Unsupervised Community Detection by (a) Girvan-Newman, (b) Louvain and (c) Smith-Pittman for the karate network.](./images/algorithm_comparison_karate.png) -Girvan Newman doesn't tell any story (clustering everything in one group isn't much of a story). Louvain might be telling us something in terms of strength of clustering but doesn't necessarily speak about the reality of "monster" movie genre interactions. Smith-Pittman clustering tells the best story (albeit biased) with popular genres forming the primary working group followed by more ambivalent smaller subgroups and outlier nodes. +## TidyTuesday - "Monster Movies" Dataset -This aligns with the degree (popularity) distribution (the bar graph) of the nodes as well (which is what our working paper asserts as well for certain contexts). +The "Monster Movies" dataset, made available by the TidyTuesday project [@Rfordatascience] presents an interesting example for applying SNA and the Smith-Pitman algorithm to interaction between genres in "monster" titled movies. Figure 4 shows the plotted "simplified" network with node sizes corresponding to node degree (i.e. the number of connections a given genre shares with other genres) and edge thickness and annotated numbers corresponding to the number of edges shared between listed genres. Figure 5 shows the genre clusters in the network as preformed by Girvan-Newman, Louvain and Smith-Pittman. -```{r fig.cap='Monster Movie genre network. Node size corresponds to the node degree and edge thickness and numbers corespond the number of connections shared between generes from the dataset.', echo=FALSE, message=FALSE, warning=FALSE} -library(tidyverse) -library(tidygraph) -library(igraph) -library(ig.degree.betweenness) +Girvan Newman doesn't tell any story (clustering everything in one group isn't much of a story). Louvain might be telling us something in terms of strength of clustering but doesn't necessarily speak about the reality of "monster" movie genre interactions. Smith-Pittman clustering tells the best story with popular genres forming the primary working group followed by more ambivalent smaller subgroups and outlier nodes. -monster_movie_genres <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2024/2024-10-29/monster_movie_genres.csv') +The R code for doing this is the following: +```{r eval=FALSE} +# Install relevant libraries +# pkgs <- c("dplyr","tibble","tidyr","tidygraph", +# "igraph","ig.degree.betweenness") +# install.packages(pkgs) +# Load data +tuesdata <- tidytuesdayR::tt_load(2024, week = 44) +monster_movie_genres <- tuesdata$monster_movie_genres -dummy_matrix<- monster_movie_genres |> - dplyr::mutate( - value=1 - )|> +# Prepare data for adjacency matrix +dummy_matrix <- monster_movie_genres |> + dplyr::mutate(value = 1) |> tidyr::pivot_wider( id_cols = tconst, - names_from=genres, + names_from = genres, values_from = value, - values_fill=0 - )|> - dplyr::select(-tconst)|> + values_fill = 0 + ) |> + dplyr::select(-tconst) |> as.matrix.data.frame() - -genre_adj_matrix <-t(dummy_matrix) %*% dummy_matrix -# Setting self referrals to zero +# Construct adjacency matrix +genre_adj_matrix <- t(dummy_matrix) %*% dummy_matrix +# Setting self loops to zero diag(genre_adj_matrix) <- 0 - -movie_graph<- genre_adj_matrix|> - as.data.frame()|> - tibble::rownames_to_column()|> - tidyr::pivot_longer(cols=Comedy:War)|> - dplyr::rename( - c( - from=rowname, - to = name - ) - )|> - dplyr::rowwise()|> - dplyr::mutate( - combo = paste0(sort(c(from, to)), collapse = ",") - )|> - dplyr::arrange(combo)|> - dplyr::distinct(combo,.keep_all = TRUE)|> - dplyr::select(-combo)|> - tidyr::uncount(value)|> +# Construct the graph +movie_graph <- genre_adj_matrix |> + as.data.frame() |> + tibble::rownames_to_column() |> + tidyr::pivot_longer(cols = Comedy:War) |> + dplyr::rename(c(from = rowname, to = name)) |> + dplyr::rowwise() |> + dplyr::mutate(combo = paste0(sort(c(from, to)), collapse = ",")) |> + dplyr::arrange(combo) |> + dplyr::distinct(combo, .keep_all = TRUE) |> + dplyr::select(-combo) |> + tidyr::uncount(value) |> tidygraph::as_tbl_graph() +# Resize nodes based on node degree -# Resize nodes based on degree - -VS <- igraph::degree(movie_graph)*0.1 - -par(mar=c(0,0,0,0)+0.1) +VS <- igraph::degree(movie_graph) * 0.1 +# Figure 4 ig.degree.betweenness::plot_simplified_edgeplot( movie_graph, - vertex.size=VS, - edge.arrow.size= 0.001 + vertex.size = VS, + edge.arrow.size = 0.001 ) - -``` - -```{r echo=FALSE, message=FALSE, warning=FALSE, fig.cap='Girvan-Newman communities. Unable to identify any communities.'} # Cluster Nodes - -gn_cluster <- movie_graph|> - igraph::as.undirected()|> +gn_cluster <- movie_graph |> + igraph::as.undirected() |> igraph::cluster_edge_betweenness() -louvain_cluster <- movie_graph|> - igraph::as.undirected()|> +louvain_cluster <- movie_graph |> + igraph::as.undirected() |> igraph::cluster_louvain() -sp_cluster <- movie_graph|> - igraph::as.undirected()|> +sp_cluster <- movie_graph |> + igraph::as.undirected() |> ig.degree.betweenness::cluster_degree_betweenness() -# Visualize -par(mar=c(0,0,0,0)+0.1) + +# Figure 5 +par(mfrow = c(1, 3), mar = c(0, 0, 0, 0) + 1) ig.degree.betweenness::plot_simplified_edgeplot( movie_graph, gn_cluster, - vertex.size=VS, - edge.arrow.size=0.001 + main = "(a)", + vertex.size = VS, + edge.arrow.size = 0.001 ) -``` - -```{r echo=FALSE, message=FALSE, warning=FALSE, fig.cap="Louvain communities. Communities identified, but do not provide descriptive value."} -par(mar=c(0,0,0,0)+0.1) - ig.degree.betweenness::plot_simplified_edgeplot( movie_graph, louvain_cluster, - main = "Louvain", - vertex.size=VS, - edge.arrow.size=0.001 + main = "(b)", + vertex.size = VS, + edge.arrow.size = 0.001 ) -``` - -```{r echo=FALSE, message=FALSE, warning=FALSE, fig.cap="Smith-Pittman communities. Communities identified provide descriptive value based on popular genres followed by a less popular genres, followed by isolated genres with litte to no interaction with other genres in the network."} -par(mar=c(0,0,0,0)+0.1) - ig.degree.betweenness::plot_simplified_edgeplot( - movie_graph, - sp_cluster, - main = "Smith-Pittman", - vertex.size=VS, - edge.arrow.size=0.001 + movie_graph, + sp_cluster, + main = "(c)", + vertex.size = VS, + edge.arrow.size = 0.001 ) ``` -## Other Utility Functions - - -### Preparing Unlabeled Graphs - -```{r simulated_network} +![Monster movie genre network. Node size corresponds to the node degree and edge thickness and numbers corespond the number of connections shared between generes in "monster" titled movies.](./images/tt_1.png) -# Set parameters -# Number of nodes (adjust as needed) -num_nodes <- 15 -# Starting edges for preferential attachment -initial_edges <- 1 +![Communities identified in the monster movie genre network via community detection. (a) is Girvan-Newman, (b) is Louvain and (c) is Smith-Pittman. Communities are selected based on maximized modularity.](./images/tt_2.png) -# Create a directed, scale-free network using the Barabási-Albert model -g <- igraph::sample_pa(n = num_nodes, m = initial_edges, directed = TRUE) - -# Introduce additional edges to high-degree nodes to accentuate popularity differences -num_extra_edges <- 350 # Additional edges to create more popular nodes -set.seed(123) # For reproducibility - -for (i in 1:num_extra_edges) { - # Sample nodes with probability proportional to their degree (to reinforce popularity) - from <- sample( igraph::V(g), 1, prob = igraph::degree(g, mode = "in") + 1) # +1 to avoid zero probabilities - to <- sample( igraph::V(g), 1) - - # Ensure we don't add the same edge repeatedly unless intended, allowing self-loops - g <- igraph::add_edges(g, c(from, to)) -} - -# Add self-loops to a subset of nodes -num_self_loops <- 5 -for (i in 1:num_self_loops) { - node <- sample( igraph::V(g), 1) - g <- igraph::add_edges(g, c(node, node)) -} - -g -``` - -```{r unnamed_graph} -igraph::vertex.attributes(g)$name -``` - - -```{r} -g_ <- ig.degree.betweenness::prep_unlabeled_graph(g) - -igraph::vertex.attributes(g_)$name -``` - - -### Plotting Simplified Edgeplots - - -```{r} - par(mar=c(0,0,0,0)+.1) - -plot( - g_, - edge.arrow.size = 0.2, - main = "Default Network" - ) -``` - - -```{r} - par(mar=c(0,0,0,0)+.1) - - -ig.degree.betweenness::plot_simplified_edgeplot( - g_, - edge.arrow.size = 0.2, - main = "Simplified Network" - ) -``` - -```{r} -par(mar=c(0,0,0,0)+.1) - -sp_communities <- ig.degree.betweenness::cluster_degree_betweenness(g_) - -plot(sp_communities, - g_, - edge.arrow.size = 0.2, - main = "Default Network") - -ig.degree.betweenness::plot_simplified_edgeplot( - g_, - communities = sp_communities, - main = "Simplified Network") -``` +# Licensing and Availability +{ig.degree.betweenness} is licensed under a MIT lisence. It is available on CRAN, and can be installed using `install.packages("ig.degree.betweenness")`. All code is open-source and hosted on +GitHub, and bugs can be reported at https://github.com/benyamindsmith/ig.degree.betweenness/issues/. # References diff --git a/utils/joss_paper/paper.bib b/utils/joss_paper/paper.bib index 4cc7ee1..1a7fbe9 100644 --- a/utils/joss_paper/paper.bib +++ b/utils/joss_paper/paper.bib @@ -88,3 +88,19 @@ @article{ year = {1977}, doi = {10.1086/jar.33.4.3629752} } + +@Manual{ + igraphdatapackage, + title = {igraphdata: A Collection of Network Data Sets for the 'igraph' Package}, + author = {Gabor Csardi}, + year = {2015}, + note = {R package version 1.0.1}, + url = {https://CRAN.R-project.org/package=igraphdata}, + } + +@misc{Rfordatascience, +title={GitHub - rfordatascience/tidytuesday: Official repo for the #tidytuesday project}, +url={https://github.com/rfordatascience/tidytuesday}, +journal={GitHub}, +author={Rfordatascience} +} diff --git a/utils/joss_paper/paper.log b/utils/joss_paper/paper.log index ece93d4..d04df40 100644 --- a/utils/joss_paper/paper.log +++ b/utils/joss_paper/paper.log @@ -1,4 +1,4 @@ -This is XeTeX, Version 3.141592653-2.6-0.999996 (TeX Live 2024) (preloaded format=xelatex 2024.11.4) 5 NOV 2024 23:18 +This is XeTeX, Version 3.141592653-2.6-0.999996 (TeX Live 2024) (preloaded format=xelatex 2024.11.4) 6 NOV 2024 21:34 entering extended mode restricted \write18 enabled. %&-line parsing enabled. @@ -990,20 +990,16 @@ Package fancyhdr Warning: \headheight is too small (62.59596pt): (fancyhdr) \addtolength{\topmargin}{-0.95425pt}. LaTeX Font Info: Font shape `TU/lmss/m/it' in size <8> not available -(Font) Font shape `TU/lmss/m/sl' tried instead on input line 360. +(Font) Font shape `TU/lmss/m/sl' tried instead on input line 359. [1 ] -Underfull \hbox (badness 1454) in paragraph at lines 378--378 +Underfull \hbox (badness 1454) in paragraph at lines 384--384 [][][]\TU/lmr/m/n/8 For a more formal definition of modularity, see: [][]$[][][][][] [] [] [] [][] [] [][][][][][][][][] [] [][][] [] [][][][] [] [][][][][][][][][][] [] [] File: ./images/sp_viz2.png Graphic file (type bmp) <./images/sp_viz2.png> -File: ./images/karate_network.png Graphic file (type bmp) -<./images/karate_network.png> -File: ./images/algorithm_comparison_karate.png Graphic file (type bmp) -<./images/algorithm_comparison_karate.png> File: C:/Users/ben29/AppData/Local/R/win-library/4.2/rticles/rmarkdown/templates/joss/resources/JOSS-logo.png Graphic file (type bmp) @@ -1027,16 +1023,12 @@ Package fancyhdr Warning: \headheight is too small (62.59596pt): (fancyhdr) \addtolength{\topmargin}{-0.95425pt}. [3] -File: paper_files/figure-latex/unnamed-chunk-2-1.pdf Graphic file (type pdf) - -File: paper_files/figure-latex/unnamed-chunk-3-1.pdf Graphic file (type pdf) - -File: paper_files/figure-latex/unnamed-chunk-4-1.pdf Graphic file (type pdf) - -File: paper_files/figure-latex/unnamed-chunk-5-1.pdf Graphic file (type pdf) - LaTeX Font Info: Font shape `TU/lmtt/bx/n' in size <10> not available -(Font) Font shape `TU/lmtt/b/n' tried instead on input line 498. +(Font) Font shape `TU/lmtt/b/n' tried instead on input line 432. +File: ./images/karate_network.png Graphic file (type bmp) +<./images/karate_network.png> +File: ./images/algorithm_comparison_karate.png Graphic file (type bmp) +<./images/algorithm_comparison_karate.png> File: C:/Users/ben29/AppData/Local/R/win-library/4.2/rticles/rmarkdown/templates/joss/resources/JOSS-logo.png Graphic file (type bmp) @@ -1060,54 +1052,14 @@ Package fancyhdr Warning: \headheight is too small (62.59596pt): (fancyhdr) \addtolength{\topmargin}{-0.95425pt}. [5] - -File: C:/Users/ben29/AppData/Local/R/win-library/4.2/rticles/rmarkdown/templates/joss/resources/JOSS-logo.png Graphic file (type bmp) - - -Package fancyhdr Warning: \headheight is too small (62.59596pt): -(fancyhdr) Make it at least 63.55022pt, for example: -(fancyhdr) \setlength{\headheight}{63.55022pt}. -(fancyhdr) You might also make \topmargin smaller to compensate: -(fancyhdr) \addtolength{\topmargin}{-0.95425pt}. - -[6] -Overfull \hbox (16.91139pt too wide) in paragraph at lines 529--529 -[]\TU/lmtt/m/n/10 ## [1] 2-> 1 3-> 2 4-> 1 5-> 1 6-> 1 7-> 1 8-> 6 9-> 3 10-> 7 11-> 7[] +Underfull \vbox (badness 10000) detected at line 592 [] -Overfull \hbox (16.91139pt too wide) in paragraph at lines 530--530 -[]\TU/lmtt/m/n/10 ## [11] 12-> 1 13-> 7 14-> 1 15-> 1 1->15 7->14 8->10 3-> 6 3-> 5 8->14[] +Underfull \vbox (badness 10000) detected at line 592 [] -Overfull \hbox (16.91139pt too wide) in paragraph at lines 531--531 -[]\TU/lmtt/m/n/10 ## [21] 10-> 9 1->11 7-> 5 7->11 2->12 15-> 9 14-> 3 11->10 7->10 8->14[] - [] - - -Overfull \hbox (16.91139pt too wide) in paragraph at lines 532--532 -[]\TU/lmtt/m/n/10 ## [31] 9-> 4 1-> 1 6-> 7 10->12 1->10 10-> 7 1-> 9 14-> 7 15->12 12-> 7[] - [] - - -Overfull \hbox (16.91139pt too wide) in paragraph at lines 533--533 -[]\TU/lmtt/m/n/10 ## [41] 1->11 7-> 6 11-> 2 10-> 5 1->12 7->13 10-> 1 3->11 10-> 6 14->15[] - [] - - -Overfull \hbox (16.91139pt too wide) in paragraph at lines 534--534 -[]\TU/lmtt/m/n/10 ## [51] 1->15 7->12 14-> 4 7-> 6 6-> 8 1-> 6 8->15 4-> 1 1-> 2 6-> 2[] - [] - - -Overfull \hbox (16.91139pt too wide) in paragraph at lines 535--535 -[]\TU/lmtt/m/n/10 ## [61] 6->13 5-> 6 10-> 3 2-> 4 15-> 9 15->14 4-> 7 12-> 8 13->14 1->15[] - [] - -File: paper_files/figure-latex/unnamed-chunk-7-1.pdf Graphic file (type pdf) - - File: C:/Users/ben29/AppData/Local/R/win-library/4.2/rticles/rmarkdown/templates/joss/resources/JOSS-logo.png Graphic file (type bmp) @@ -1118,9 +1070,16 @@ Package fancyhdr Warning: \headheight is too small (62.59596pt): (fancyhdr) You might also make \topmargin smaller to compensate: (fancyhdr) \addtolength{\topmargin}{-0.95425pt}. -[7] -File: paper_files/figure-latex/unnamed-chunk-8-1.pdf Graphic file (type pdf) - +[6] +File: ./images/tt_1.png Graphic file (type bmp) +<./images/tt_1.png> +File: ./images/tt_2.png Graphic file (type bmp) +<./images/tt_2.png> + +Underfull \hbox (badness 1688) in paragraph at lines 612--617 +\TU/lmr/m/n/10 and can be installed using \TU/lmtt/m/n/10 install.packages("ig.degree.betweenness")\TU/lmr/m/n/10 . All + [] + File: C:/Users/ben29/AppData/Local/R/win-library/4.2/rticles/rmarkdown/templates/joss/resources/JOSS-logo.png Graphic file (type bmp) @@ -1132,12 +1091,7 @@ Package fancyhdr Warning: \headheight is too small (62.59596pt): (fancyhdr) You might also make \topmargin smaller to compensate: (fancyhdr) \addtolength{\topmargin}{-0.95425pt}. -[8] -File: paper_files/figure-latex/unnamed-chunk-9-1.pdf Graphic file (type pdf) - -File: paper_files/figure-latex/unnamed-chunk-9-2.pdf Graphic file (type pdf) - - +[7] File: C:/Users/ben29/AppData/Local/R/win-library/4.2/rticles/rmarkdown/templates/joss/resources/JOSS-logo.png Graphic file (type bmp) @@ -1148,7 +1102,7 @@ Package fancyhdr Warning: \headheight is too small (62.59596pt): (fancyhdr) You might also make \topmargin smaller to compensate: (fancyhdr) \addtolength{\topmargin}{-0.95425pt}. -[9] +[8] File: C:/Users/ben29/AppData/Local/R/win-library/4.2/rticles/rmarkdown/templates/joss/resources/JOSS-logo.png Graphic file (type bmp) @@ -1159,24 +1113,24 @@ Package fancyhdr Warning: \headheight is too small (62.59596pt): (fancyhdr) You might also make \topmargin smaller to compensate: (fancyhdr) \addtolength{\topmargin}{-0.95425pt}. -[10] (./paper.aux) +[9] (./paper.aux) *********** LaTeX2e <2024-11-01> L3 programming layer <2024-11-02> *********** Package rerunfilecheck Info: File `paper.out' has not changed. -(rerunfilecheck) Checksum: 2FC832B5CC9A66095AC7830099F77D3C;1704. +(rerunfilecheck) Checksum: ECFD01E60CB06C2103435598F29B749C;1282. Package logreq Info: Writing requests to 'paper.run.xml'. \openout1 = `paper.run.xml'. ) Here is how much of TeX's memory you used: - 36785 strings out of 475994 - 758706 string characters out of 5777285 - 1626881 words of memory out of 5000000 - 59022 multiletter control sequences out of 15000+600000 - 564965 words of font info for 87 fonts, out of 8000000 for 9000 + 36715 strings out of 475994 + 755159 string characters out of 5777285 + 1626446 words of memory out of 5000000 + 58960 multiletter control sequences out of 15000+600000 + 564957 words of font info for 86 fonts, out of 8000000 for 9000 14 hyphenation exceptions out of 8191 84i,13n,87p,1002b,850s stack positions out of 10000i,1000n,20000p,200000b,200000s -Output written on paper.pdf (10 pages). +Output written on paper.pdf (9 pages). diff --git a/utils/joss_paper/paper.md b/utils/joss_paper/paper.md index 1a2be5c..226854a 100644 --- a/utils/joss_paper/paper.md +++ b/utils/joss_paper/paper.md @@ -21,7 +21,7 @@ affiliations: index: 1 - name: "UHN's Princess Margaret Cancer Centre" index: 2 -date: "2024-11-05" +date: "2024-11-06" bibliography: paper.bib output: rticles::joss_article @@ -36,11 +36,13 @@ journal: JOSS # Summary -{ig.degree.betweenness} is an R package which allows users to implement the "Smith-Pittman" community detection algorithm on networks and sociograms constructed and/or loaded with the {igraph} package. {ig.degree.betweenness} also offers utility functions which enable neater plotting of densely connected networks with high number of edges and a low number of nodes and the relevant preparation of unlabeled graphs for the Smith-Pittman algorithm's present implementation in the R programming language. There presently do not exist other implementations of this algorithm which are ready to use which are compatible in the {igraph} ecosystem. As a result, this contribution is welcome by {igraph} users interested in exploring and applying the Smith-Pittman algorithm in SNA settings. +{ig.degree.betweenness} is an R package which enables users to implement the "Smith-Pittman" community detection algorithm on networks and sociograms constructed or loaded with the {igraph} package. {ig.degree.betweenness} also provides utility functions to enable neater plotting of densely connected networks and to provide relevant preparation for unlabeled graphs to accommodate its present implementation of the Smith-Pittman algorithm in the R programming language. Since this algorithm is relatively new, there presently does not exist other implementations of it which are ready to use and are compatible in the {igraph} ecosystem. As a result, this contribution is welcome by {igraph} users interested in exploring and applying the Smith-Pittman algorithm in social network analysis (SNA) settings. # Statement of Need -{igraph} [@igraph_article] offers a suite functions and tools for interacting with graph data and engaging in social network analysis (SNA). A major area of study in SNA is the identification node clusters through methods referred to as "community detection algorithms" [@rostami2023community]. {igraph} allows users to employ a variety of popular community detection algorithms, including Girvan-Newman^[https://r.igraph.org/reference/cluster_edge_betweenness.html] [@Girvan_Newman_2002], Louvain^[https://r.igraph.org/reference/cluster_louvain.html] [@louvain_paper] and others^[For the full list of available community detection algorithms in the {igraph} R package, see the {igraph} reference manual: https://r.igraph.org/reference/index.html#community-detection]. In densely connected complex networks it has been noted by Smith, Pittman and Xu [@sp_paper] that considering the number of connections possessed by each individual node in a given network (degree centrality) along with edge-betweeness (as done by [@Girvan_Newman_2002]) offers an approach for identifying clusters which are more descriptive in certain settings. {ig.degree.betweenness} is an R package that contains a ready-to-use implementation of the Smith-Pittman community detection algorithm. +{igraph} [@igraph_article] offers a suite functions and tools for interacting with graph data and engaging with SNA. A major area of study and application in SNA is the identification node clusters through methods broadly referred to as "community detection algorithms". There is no specific model which describes exactly what a ”community” is. Generally, community detection algorithms employ specific optimization strategies to partition a large-scale complex network into a set of disjoint and compact subgroups, often (but not always) without prior knowledge regarding the number of subgroups and their sizes [@rostami2023community]. + +{igraph} supports a range of popular community detection algorithms, including Girvan-Newman^[https://r.igraph.org/reference/cluster_edge_betweenness.html] [@Girvan_Newman_2002], Louvain^[https://r.igraph.org/reference/cluster_louvain.html] [@louvain_paper] and others^[For the full list of available community detection algorithms in the {igraph} R package, see the {igraph} reference manual: https://r.igraph.org/reference/index.html#community-detection]. For densely connected, complex networks, research by Smith, Pittman and Xu [@sp_paper] that combining node degree (degree centrality) with with edge-betweeness (as utilized by [@Girvan_Newman_2002]) can enhance cluster identification in certain contexts. The {ig.degree.betweenness} package offers {igraph} users a ready-to-use implementation of the Smith-Pittman community detection algorithm in R [@base2022]. # The Smith-Pittman Algorithm @@ -50,7 +52,7 @@ The steps for the algorithm are: 1. Identify the node with the highest degree-centrality in the network. -2. Select the subgraph of the node with the highest degree centrality. Remove the edge possessing the highest calculated in the subgraph. +2. Select the subgraph of the node with the highest degree centrality. Remove the edge possessing the highest calculated (network-wide) edge-betweenness in the subgraph. 3. Recalculate the degree centrality for all nodes in the network and the betweenness for the remaining edge in the network, @@ -67,160 +69,155 @@ Conceptually, this algorithm (similar to Girvan-Newman and Louvain) can be speci The dataset commonly referred to as "Zachary's karate club network" [@zachary1977information] is a social network between members of a university club led by president John A. and karate instructor Mr. Hi (pseudonyms). At the beginning of the study there was an initial conflict between the club president, John A., and Mr. Hi over the price of karate lessons. As time passed, the entire club became divided over this issue. After a series of increasingly sharp factional confrontations over the price of lessons, the officers of the club, led by John A., fired Mr. Hi. The supporters of Mr. Hi retaliated by resigning and forming a new organization headed by Mr. Hi. Figure 2 shows the karate club network where the nodes signify individuals in the club and the edges signifies the existence of a relationship between two members. The node color indicates which group the members associated with post-split. -Since the division of the club and its members is known, this social network is a classic example dataset used and studied. In the context of community detection, the object of interest is seeing if the split could be identified based on the relationships between members. When applied in an unsupervised setting, the Girvan-Newman and Louvain algorthims identify communities of nodes which optimize modularity according to their approaches. However, the communities identified do not appear to identify a possible division in the group which is contextually informative or interpretative. The Smith-Pittman algorithm identifies 3 communities which could can be understood as individuals who would certainly associate with John A. or Mr. Hi and an uncertain group. Figure 3 shows the comparison between the three algorithms. - -![The Zachary karate club network with the true split between members defined by node colors. John A. and Mr. Hi are denoted by 'J' and 'H', with other members being listed as numbers](./images/karate_network.png){width=60%} - -![Unsupervised Community Detection by (a) Girvan-Newman, (b) Louvain and (c) Smith-Pittman for the karate network.](./images/algorithm_comparison_karate.png) - -## TidyTuesday - "Monster Movies" Dataset - - -The first visual is the constructed network. The other visuals are clusters based on the Girvan Newman (Edge Betweenness), Louvain (Direct Modularity Maximization) and Smith(thats me!)-Pittman (Node Degree + Edge Betweenness). - -Girvan Newman doesn't tell any story (clustering everything in one group isn't much of a story). Louvain might be telling us something in terms of strength of clustering but doesn't necessarily speak about the reality of "monster" movie genre interactions. Smith-Pittman clustering tells the best story (albeit biased) with popular genres forming the primary working group followed by more ambivalent smaller subgroups and outlier nodes. - -This aligns with the degree (popularity) distribution (the bar graph) of the nodes as well (which is what our working paper asserts as well for certain contexts). - -![Monster Movie genre network. Node size corresponds to the node degree and edge thickness and numbers corespond the number of connections shared between generes from the dataset.](paper_files/figure-latex/unnamed-chunk-2-1.pdf) - -![Girvan-Newman communities. Unable to identify any communities.](paper_files/figure-latex/unnamed-chunk-3-1.pdf) - -![Louvain communities. Communities identified, but do not provide descriptive value.](paper_files/figure-latex/unnamed-chunk-4-1.pdf) +Since the division of the club and its members is known, this social network is a classic example dataset used and studied. The data is available in the {igraphdata} package [@igraphdatapackage]. In the context of community detection, the object of interest is seeing if the split could be identified based on the relationships between members. When applied in an unsupervised setting, the Girvan-Newman and Louvain algorthims identify communities of nodes which optimize modularity according to their approaches. However, the communities identified do not appear to identify a possible division in the group which is contextually informative or interpretative. The Smith-Pittman algorithm identifies 3 communities which could can be understood as individuals who would certainly associate with John A. or Mr. Hi and an uncertain group. Figure 3 shows the comparison between the three algorithms. -![Smith-Pittman communities. Communities identified provide descriptive value based on popular genres followed by a less popular genres, followed by isolated genres with litte to no interaction with other genres in the network.](paper_files/figure-latex/unnamed-chunk-5-1.pdf) - -## Other Utility Functions - - -### Preparing Unlabeled Graphs - - -```r -# Set parameters -# Number of nodes (adjust as needed) -num_nodes <- 15 -# Starting edges for preferential attachment -initial_edges <- 1 - -# Create a directed, scale-free network using the Barabási-Albert model -g <- igraph::sample_pa(n = num_nodes, m = initial_edges, directed = TRUE) - -# Introduce additional edges to high-degree nodes to accentuate popularity differences -num_extra_edges <- 350 # Additional edges to create more popular nodes -set.seed(123) # For reproducibility - -for (i in 1:num_extra_edges) { - # Sample nodes with probability proportional to their degree (to reinforce popularity) - from <- sample( igraph::V(g), 1, prob = igraph::degree(g, mode = "in") + 1) # +1 to avoid zero probabilities - to <- sample( igraph::V(g), 1) - - # Ensure we don't add the same edge repeatedly unless intended, allowing self-loops - g <- igraph::add_edges(g, c(from, to)) -} - -# Add self-loops to a subset of nodes -num_self_loops <- 5 -for (i in 1:num_self_loops) { - node <- sample( igraph::V(g), 1) - g <- igraph::add_edges(g, c(node, node)) -} - -g -``` - -``` -## IGRAPH 36ae407 D--- 15 369 -- Barabasi graph -## + attr: name (g/c), power (g/n), m (g/n), zero.appeal (g/n), algorithm -## | (g/c) -## + edges from 36ae407: -## [1] 2-> 1 3-> 2 4-> 1 5-> 1 6-> 1 7-> 1 8-> 6 9-> 3 10-> 7 11-> 7 -## [11] 12-> 1 13-> 7 14-> 1 15-> 1 1->15 7->14 8->10 3-> 6 3-> 5 8->14 -## [21] 10-> 9 1->11 7-> 5 7->11 2->12 15-> 9 14-> 3 11->10 7->10 8->14 -## [31] 9-> 4 1-> 1 6-> 7 10->12 1->10 10-> 7 1-> 9 14-> 7 15->12 12-> 7 -## [41] 1->11 7-> 6 11-> 2 10-> 5 1->12 7->13 10-> 1 3->11 10-> 6 14->15 -## [51] 1->15 7->12 14-> 4 7-> 6 6-> 8 1-> 6 8->15 4-> 1 1-> 2 6-> 2 -## [61] 6->13 5-> 6 10-> 3 2-> 4 15-> 9 15->14 4-> 7 12-> 8 13->14 1->15 -## + ... omitted several edges -``` +The code for reproducing figures 2 and 3 are: ```r -igraph::vertex.attributes(g)$name +# Install relevant packages +# install.packages(c("igraph","igraphdata","ig.degree.betweenness")) +library(igraphdata) +# Attach the Karate Club dataset +# Data from {igraphdata} +data("karate") +# Plot the initial network (Figure 1) +plot(karate) +# Girvan-Newman Clustering (Figure 2 (a)) +# Function from {igraph} +gn_karate <- ig.degree.betweeness::cluster_edge_betweenness(karate) + +# Louvain Clustering (Figure 2 (b)) +# Function from {igraph} +louvain_karate <- igraph::cluster_louvain(karate) + +# Smith-Pittman Clustering (Figure 2 (c)) +# Function from {ig.degree.betweenness} +sp_karate <- igraph::cluster_degree_betweenness(karate) + +# Plot 3 plots next to eachother +par(mfrow= c(1,3),mar=c(0,0,0,0)+1) + +plot(gn_karate, karate, main = "(a)") + +plot(louvain_karate, karate, main = "(b)") + +plot(sp_karate, karate, main = "(c)") ``` -``` -## NULL -``` +![The Zachary karate club network with the true split between members defined by node colors. John A. and Mr. Hi are denoted by 'J' and 'H', with other members being listed as numbers](./images/karate_network.png){width=70%} +![Unsupervised Community Detection by (a) Girvan-Newman, (b) Louvain and (c) Smith-Pittman for the karate network.](./images/algorithm_comparison_karate.png) +## TidyTuesday - "Monster Movies" Dataset -```r -g_ <- ig.degree.betweenness::prep_unlabeled_graph(g) - -igraph::vertex.attributes(g_)$name -``` - -``` -## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 -``` - +The "Monster Movies" dataset, made available by the TidyTuesday project [@Rfordatascience] presents an interesting example for applying SNA and the Smith-Pitman algorithm to interaction between genres in "monster" titled movies. Figure 4 shows the plotted "simplified" network with node sizes corresponding to node degree (i.e. the number of connections a given genre shares with other genres) and edge thickness and annotated numbers corresponding to the number of edges shared between listed genres. Figure 5 shows the genre clusters in the network as preformed by Girvan-Newman, Louvain and Smith-Pittman. -### Plotting Simplified Edgeplots +Girvan Newman doesn't tell any story (clustering everything in one group isn't much of a story). Louvain might be telling us something in terms of strength of clustering but doesn't necessarily speak about the reality of "monster" movie genre interactions. Smith-Pittman clustering tells the best story with popular genres forming the primary working group followed by more ambivalent smaller subgroups and outlier nodes. +The R code for doing this is the following: ```r - par(mar=c(0,0,0,0)+.1) +# Install relevant libraries +# pkgs <- c("dplyr","tibble","tidyr","tidygraph", +# "igraph","ig.degree.betweenness") +# install.packages(pkgs) +# Load data +tuesdata <- tidytuesdayR::tt_load(2024, week = 44) +monster_movie_genres <- tuesdata$monster_movie_genres + +# Prepare data for adjacency matrix +dummy_matrix <- monster_movie_genres |> + dplyr::mutate(value = 1) |> + tidyr::pivot_wider( + id_cols = tconst, + names_from = genres, + values_from = value, + values_fill = 0 + ) |> + dplyr::select(-tconst) |> + as.matrix.data.frame() + +# Construct adjacency matrix +genre_adj_matrix <- t(dummy_matrix) %*% dummy_matrix +# Setting self loops to zero +diag(genre_adj_matrix) <- 0 + +# Construct the graph +movie_graph <- genre_adj_matrix |> + as.data.frame() |> + tibble::rownames_to_column() |> + tidyr::pivot_longer(cols = Comedy:War) |> + dplyr::rename(c(from = rowname, to = name)) |> + dplyr::rowwise() |> + dplyr::mutate(combo = paste0(sort(c(from, to)), collapse = ",")) |> + dplyr::arrange(combo) |> + dplyr::distinct(combo, .keep_all = TRUE) |> + dplyr::select(-combo) |> + tidyr::uncount(value) |> + tidygraph::as_tbl_graph() + +# Resize nodes based on node degree + +VS <- igraph::degree(movie_graph) * 0.1 + +# Figure 4 +ig.degree.betweenness::plot_simplified_edgeplot( + movie_graph, + vertex.size = VS, + edge.arrow.size = 0.001 +) +# Cluster Nodes +gn_cluster <- movie_graph |> + igraph::as.undirected() |> + igraph::cluster_edge_betweenness() -plot( - g_, - edge.arrow.size = 0.2, - main = "Default Network" - ) -``` +louvain_cluster <- movie_graph |> + igraph::as.undirected() |> + igraph::cluster_louvain() -![](paper_files/figure-latex/unnamed-chunk-7-1.pdf) +sp_cluster <- movie_graph |> + igraph::as.undirected() |> + ig.degree.betweenness::cluster_degree_betweenness() -```r - par(mar=c(0,0,0,0)+.1) - +# Figure 5 +par(mfrow = c(1, 3), mar = c(0, 0, 0, 0) + 1) ig.degree.betweenness::plot_simplified_edgeplot( - g_, - edge.arrow.size = 0.2, - main = "Simplified Network" - ) -``` - -![](paper_files/figure-latex/unnamed-chunk-8-1.pdf) - - -```r -par(mar=c(0,0,0,0)+.1) + movie_graph, + gn_cluster, + main = "(a)", + vertex.size = VS, + edge.arrow.size = 0.001 +) -sp_communities <- ig.degree.betweenness::cluster_degree_betweenness(g_) +ig.degree.betweenness::plot_simplified_edgeplot( + movie_graph, + louvain_cluster, + main = "(b)", + vertex.size = VS, + edge.arrow.size = 0.001 +) -plot(sp_communities, - g_, - edge.arrow.size = 0.2, - main = "Default Network") +ig.degree.betweenness::plot_simplified_edgeplot( + movie_graph, + sp_cluster, + main = "(c)", + vertex.size = VS, + edge.arrow.size = 0.001 +) ``` -![](paper_files/figure-latex/unnamed-chunk-9-1.pdf) +![Monster movie genre network. Node size corresponds to the node degree and edge thickness and numbers corespond the number of connections shared between generes in "monster" titled movies.](./images/tt_1.png) -```r -ig.degree.betweenness::plot_simplified_edgeplot( - g_, - communities = sp_communities, - main = "Simplified Network") -``` +![Communities identified in the monster movie genre network via community detection. (a) is Girvan-Newman, (b) is Louvain and (c) is Smith-Pittman. Communities are selected based on maximized modularity.](./images/tt_2.png) -![](paper_files/figure-latex/unnamed-chunk-9-2.pdf) +# Licensing and Availability +{ig.degree.betweenness} is licensed under a MIT lisence. It is available on CRAN, and can be installed using `install.packages("ig.degree.betweenness")`. All code is open-source and hosted on +GitHub, and bugs can be reported at https://github.com/benyamindsmith/ig.degree.betweenness/issues/. # References