SNAGeek

Sociologically Technological, and Technologically Sociological

Generate Random Graphs with Fixed Degree Distribution (R igraph)

※This article is translated version of 次数分布を固定してランダムグラフを生成する - SNAGeek

Introduction

There are situations when you want to evaluate statistically the whole network statistics, such as global clustering coefficient, average path length.

In such a case, you can use a method of comparing network statistics with that generated from a null model to find out how it statistically deviates . What is commonly used is to compare it with a random graph having an degree distribution same as the original graph.

In this article, we introduce a method to generate random graph by fixing the degree distribution, using R igraph package.

Loading graphs

For this case, we use karate club network data attached to igraph by default.

library(igraph)
library(tidyverse)
library(ggnetwork)

net <- graph("Zachary")

Visualization

Let's visualize the graph with ggnetwork. We will visualize the random graph later, but we must first acquire the coordinates of the node in the original graph so that we can use it again later.

hruchterman_reingold = layout.fruchterman.reingold(net)
gnet <- ggnetwork(net, layout = hruchterman_reingold)
g <- ggplot(gnet, aes(x = x, y = y, xend = xend, yend = yend))
g <- g + geom_edges(size = 0.5)
g <- g + geom_nodes(size = 3, col = "blue")
g <- g + geom_nodelabel_repel(aes(label = vertex.names))
g <- g + ggtitle("original karate club network")
g <- g + theme_blank()
g

Here is the image. Somehow, it seems that the clustering coefficient is high and the average path length is large, at first glance.

f:id:meana0:20190302005735p:plain
Zachary Karate Club Network

When actually calculating it

> cc_original <- transitivity(net) # global clustering factor
> pl_original <- mean_distance(net) # average path length
>
> cc_original
[1] 0.2556818
> pl_original
[1] 2.4082

Well, how far is these values ​​characteristic?

Random graph generation

In igraph it can be generated with sample_degseq(in.deg). in.deg is a vector of degree of each node. In the case of an undirected graph, only in.deg., In case of directed graph, in.deg the distribution of in-degree, out.deg specifying the distribution of out-degree.

Acquire the degree distribution of the original network.

degree_dist <- igraph::degree(net)
g <- ggplot(data = tibble(degree = degree_dist), aes(x = degree))
g <- g + geom_histogram()
g <- g + ggtitle("degree distribution of karate club network")
g

f:id:meana0:20190302183725p:plain
Degree distribution of original network

The method of generating the graph is specified by the optional argument. There are three kinds of igraph, simple(default),simple.no.multiple, vl.

simple

A method of generating n edges appropriately from the node and combining them. Although it is a simple method, multiple edges and self loops occur. This is because the possibility of duplication of the edge partners of the two nodes to be joined together is not excluded.

degree_dist <- igraph::degree(net)
simple_random_net <- sample_degseq(degree_dist)

In the graph obtained in this way, it can be confirmed that if the simplify() `and multiple edges and self-loops are eliminated, the number of edges is slightly reduced.

> simple_random_net
IGRAPH 3c731f6 U --- 34 78 - Degree sequence random graph
+ attr: name(g / c), method(g / c)
+ edges from 3c731f6:
 [1] 2 - 6 1- 1-4 3--17 - - 28 1- - 34 19 - - 28 12 - 33 7 - - 18 9 - - 29 21 - 33 5 - 34 1 - 14 3--33 1 - 28 8 - 8 22 - - 27 25 - 31
[18] 7 - 32 9 - 33 1 - 22 19 19 - 20 11 - 34 1 - 20 2- - 7 4- - 34 1 - 30 18 - 34 4 - 32 2- 5 10 - 34 1 - 1 - 6 - 1 14 1 - 3 9 - 17
[35] 6 - 33 8 - - 24 5 - 34 16 - 30 4 - 2 3 1 - 31 33 - 34 3 - 24 4 25 - 27 33 - 34 2 - 26 1- 3 3-33 33 28 - 34 26 - 33 1 - 3 23 - 34
[52] 4--29 14 - 32 10 - 34 2 - 29 9 - 32 11 - 20 34 - 34 3 - 2 4 11 - 25 5 - 1 5 3 - 1 15 1- 14 2- 1-31 1 - 1 6 7 - - 24 2- - 34 3 - 13
[69] 21 - 32 6 - 32 24 - 30 1 - 2 - 2 - 34 13 - 33 8 - 30 33 - 33 9 - 34 26 - 31
> simplify(simple_random_net, remove.multiple = TRUE, remove.loops = TRUE)
IGRAPH 40 d 3 c 01 U --- 34 65 - Degree sequence random graph
+ attr: name(g / c), method(g / c)
+ edges from 40 d 3 c 01:
 [1] 1 - 2 - 1 - - 3 1 - 4 - 1 - 1 14 1 - 1 6 1 - 20 - 1 - 22 22 1 - 28 1- 1-30 1-31 1-34 2- 5 2-- 6 2-- 7 2--26 2--29 2--31
[18] 2 - 34 - 3 - 3 - 3 - 3 - 3 - 3 - 3 - 2 - 3 - 33 4 - 1 5 4 - 2 3 4 - 2 9 4 - 32 4- - 34 5 - 34 6 - - 14 6 - 32 6 - 33 7 - - 18 7 - 24
[35] 7 - 32 8 - - 24 8 - 30 9 - - 17 9 - 29 9 - 32 9 - 33 9 - 34 10 - 34 11 - 20 11 - 25 11 - 34 12 - 33 13 - 33 14 - 32 16 - 30 17 - 28
[52] 18 - 34 19 - 20 19 - 28 21 - 32 21 - 33 22 - 27 23 - 34 24 - 30 25 - 27 25 - 31 26 - 31 26 - 33 28 - 34 33 - 34

simple.no.multiple

The generation logic is the same as simple, but when multiple edges and self loops occur, the graph is discarded and generation continues until a graph without multiple graphs or loops is generated. Therefore, the larger the average degree is , the more calculation time is required, and it is not uniform sampling from the possible graph set.

Visualize the random graph generated by this method.

gnet <- ggnetwork(simple_no_multiple_random_net, layout = hruchterman_reingold)
g <- ggplot(gnet, aes(x = x, y = y, xend = xend, yend = yend))
g <- g + geom_edges(size = 0.5)
g <- g + geom_nodes(size = 3, col = "blue")
g <- g + geom_nodelabel_repel(aes(label = vertex.names))
g <- g + ggtitle("sampled karate club network(method: simple.no.multiple)")
g <- g + theme_blank()
g

f:id:meana0:20190302005738p:plain
Generated random graph

Compared to the original network, the clustering coefficient is small and the mean distance between nodes seems to be close.

Unlike the graph generated by simple, the number of edges does not change before and aftersimplify().

> ecount(simple_no_multiple_random_net)
[1] 78
> ecount(igraph::simplify(simple_no_multiple_random_net, remove.multiple = TRUE, remove.loop = TRUE))
[1] 78

vl

  • Because it is difficult, explanation is omitted.
  • It seems to be uniform sampling here
  • Official document of igraph says

    The "vl" method is a more sophisticated generator. This generator is generated unidentified, connected simple graphs, it is an error to pass the in.deg argument to it. Then some rewiring is done to make the graph connected. Finally a Monte-Carlo algorithm is used to randomize The graph. The "vl" samples from the undirected, connected simple graphs unformly. See http://www-rp.lip6.fr/~latapy/FV/generation.html for details.

Significance Test

In fact, we evaluate the clustering coefficient and the average path length by the deviation from the null model. We generate 1,000 random graphs by the vl method and examine where the value of the original network are located in the distribution of values calculated from them.

sample_cc <- c()
sample_pl <- c()
for(i in 1: 1000) {
  vl_random_net <- sample_degseq(degree_dist, method = "vl")
  sample_cc <- c(sample_cc, transitivity(vl_random_net))
  sample_pl <- c(sample_pl, mean_distance(vl_random_net))
}

The distribution of each indicator is as follows.

# cc hist
g <- ggplot(data = tibble(cc = sample_cc), aes(x = cc))
g <- g + geom_histogram(binwidth =. 01)
g <- g + geom_vline(xintercept = cc_original, col = "red")
g

# pl hist
g <- ggplot(data = tibble(pl = sample_pl), aes(x = pl))
g <- g + geom_histogram(binwidth =. 01)
g <- g + geom_vline(xintercept = pl_original, col = "red")
g

f:id:meana0:20190302183611p:plain
Distribution of clustering coefficients Red line is original value)

f:id:meana0:20190302183615p:plain
Distribution of average path length(red line is original value)

We assume that these distributions follow a normal distribution(though it is distorted to the right with respect to the average path length...), calculate the mean, standard deviation, and z value, and calculate the p value.

cc_mean = mean(sample_cc)
pl_mean = mean(sample_pl)
cc_sd = sd(sample_cc)
pl_sd = sd(sample_pl)
cc_zscore <-(cc_original - cc_mean) / cc_sd
pl_zscore <-(pl_original - pl_mean) / pl_sd
> 1-pnorm(cc_zscore)
[1] 0.1035129
> 1-pnorm(pl_zscore)
[1] 0.0003767265

Compared with the distribution of indices obtained from the random graph, it can be seen that the original network is a statistically deviated value for both network statistics.

Finally

It may be better to keep the series of methods introduced here to the extent that it is used for searching what kind of point is characteristic in a certain graph.

Because according to the paper Comparing Brain Networks of Different Size and Connectivity Density Using Graph Theory which examined the comparison method between graphs with different clustering coefficients and average path length Since these indices are greatly influenced by the node size N and the average degree k, it is difficult to compare graphs of different sizes and densities by raw numerical values ​​based on the original size(this bias is N-independent or k- independent, etc.).

It seems that the null model-based metrics introduced here are more influenced by N and k, if you use it, please be careful.