Path diversity improves the identification of influential spreaders

Duan-Bing Chen; Rui Xiao; An Zeng; Yi-Cheng Zhang

doi:10.1209/0295-5075/104/68006

Introduction

How to identify influential nodes in complex networks is a crucial issue since it is highly related to the information spreading and epidemic controlling [1–4]. So far, a number of centrality indices have been proposed to address this problem such as degree, betweenness [5,6], closeness [7,8] and eigenvector centralities [9]. Among these indices, degree centrality is a very straightforward and efficient one. However, the performance of degree centrality is not satisfying enough. Recently, Kitsak et al. [1] claimed that the location of a node in the network actually plays a more important role than its degree. They accordingly proposed a coarse-grained method by using K-shell decomposition to quantify a node's influence based on the assumption that nodes in the same shell have similar influence and nodes in higher shells are likely to infect more nodes. After this, some methods are proposed to further improve the ranking performance of this network decomposition process [10,11]. In directed networks, the ranking methods are mainly based on the iterative process. The representative methods include the well-known HITs [12] and PageRank [13], as well as some recently proposed algorithms like LeaderRank [14] and TwitterRank [15]. It has been demonstrated that these methods outperform out-degree centrality in terms of ranking effectiveness.

With big-data era coming, the design of ranking algorithms on very large-scale social networks is becoming a big challenge nowadays [16]. The online social systems can have millions of users or even more. The spreader ranking algorithms will be very time-consuming if they are based on the global information of the network. Therefore, the spreader ranking algorithm should be not only effective but also efficient. To solve this problem, it is better to design the ranking algorithm based on the local information of the network. For example, a semi-local index by considering the second nearest neighbors is devised [17]. This index is shown to well identify influential nodes and obtain a good trade-off on effectiveness and efficiency compared with global indices.

Most of the previous ranking methods are designed based on the number of paths for propagation. Actually, the diversity of paths for spreading is also very important. The spreading ability of a node will be significantly lowered if many of its propagation paths overlap. In this case, if the virus/information fails to go through the overlapped path, the following spreading of many paths will be terminated. However, this factor has not been taken into account in designing the spreading ranking algorithm so far, to the best of our knowledge.

Accordingly, we introduced in this letter the concept of path diversity [18] which is mathematically characterized by the information entropy [19,20]. If the spreading paths of a node are not highly overlapped in a small number of links, we consider its spreading paths to be diversified. Actually, the concept of diversity has been considered in ranking spreaders before [21,22]. The diversity is defined as the dispersion of the distribution of neighboring nodes in the network. More precisely, they assume the spreading ability of a node to be high if its neighboring nodes belong to different communities. Even though a node has high diversity in this aspect, its spreading path can still highly overlap in several links (i.e. with low path diversity).

After applying the path diversity to a very simple spreader ranking method, we find that the performance of this ranking method can be further enhanced. Combining the information of path number and diversity, we further propose a local but effective ranking method (called KED method) to identify influential nodes in large-scale social networks. We make use of the SIR spreading model [23] with tunable infectivity [24–26] to test the effectiveness of our method in four real social networks, including two undirected networks, Youtube and Orkut [27], and two directed networks, EmailEU [28] and Digg [29]. Experimental results show that KED performs much better than the simplest degree centrality, PageRank and LeaderRank, and in most cases better than K-shell, i.e. KED can more accurately rank the nodes on their correct places according to their real spreading ability than other ranking algorithms. Moreover, we study in detail the effect of path number and diversity in the KED method by introducing two parameters. It shows that the path diversity is indeed an important factor in ranking spreaders and the performance of the KED method can be further improved by these parameters. Finally, since our algorithm is based on only local information, we claim that it can be efficiently applied to many large real systems.

Empirical analysis and method

To begin our analysis, we first discuss the diversity of the spreading paths. Actually, it is very difficult to trace all the spreading paths of each node. We therefore limit ourselves to study the diversity the local paths (i.e. the paths with length 2). For better illustration, we give an example in fig. 1. The red nodes have the same degree and the same number of second nearest neighbors. The number of path for the target red node to infect each node is exactly the same in those two networks. However, all the paths to infect the gray nodes in fig. 1(b) overlap in the first half. The information can spread to gray nodes only if one specific blue node is infected. On the other hand, the paths to the gray nodes in fig. 1(a) is very diverse. The information may spread to gray nodes as long as any blue node is infected. Intuitively, information could spread to gray nodes from the red node more easily in fig. 1(a) than in fig. 1(b).

**Fig. 1:** (Color online) The example of two toy networks. Red nodes in (a) and (b) have the same degree and the same number of second neighbors, while the distributions of their neighbors' degree are different.
Download figure:
Standard image

In order to compute the probability of the gray nodes getting infected, we consider the SIR model [23] with infection rate μ in this toy network. It has been pointed out that in real case an individual only contact a fraction of his/her friends each time [24–26]. For example, an active user with many friends in the online social system for sure will interact with more friends than the inactive user with much fewer friends. However, it is not likely that such active user will interact with all his/her friends at the same time. We assume that the contact fraction nonlinearly correlates with the nodes' degree. More specifically, each node i contacts $\sqrt{k_i}$ neighbors in each step (i.e. only $\sqrt{k_i}$ neighbors have the possibility μ to be infected by node i in each step). Actually, as long as the contact number Nc is growing slower than the degree (e.g. $Nc\sim k_i^{\alpha}$ and $\alpha<1$ ), our following analysis is valid. In fig. 1, if we set the red nodes as the initial infected nodes, the expected number of infected nodes in fig. 1(a) and fig. 1(b) at the end will be $1+\sqrt{5}\mu+\sqrt{15}\mu^2$ and $1+\sqrt{5}\mu+\sqrt{3}\mu^2$ , respectively. Apparently, the red node in fig. 1(a) can infect more nodes than that in fig. 1(b) simply due to the diverse paths.

In fact, the diversity of local paths can be represented by the degree evenness of the target nodes' neighbors. One can easily see that the degree of the red node's neighbors is more uneven in fig. 1(b) than that in fig. 1(a). To characterize such unevenness, we employ the information entropy [19] h_i of each node i in a network as

$\begin{equation} h_i=\sum_{j\in\Gamma_i}-p_j\log(p_j), \end{equation} \tag{ 1 }$

where $p_j=\frac{k_j}{\sum_{l\in\Gamma_i}k_l}$ , $\Gamma_i$ is the set of neighbors of node i, and k_j is the degree of node j. A high value of h_i indicates that the degree of neighbors of the target node is even, which corresponds to diverse paths for the target node to propagate information. However, there is a shortcoming in eq. (1): the entropy of two nodes with different degree will not be the same, even though the degrees of their neighbors are entirely even. Therefore, the entropy defined in eq. (1) should be normalized according to the target node's degree to overcome this drawback,

$\begin{equation} H_i=\frac{\sum_{j\in\Gamma_i}-p_j\log(p_j)}{\sum_{j\in\Gamma_i}-\frac{1}{k_i}\log(\frac{1}{k_i})}=\frac{\sum_{j\in\Gamma_i}-p_j\log(p_j)}{\log(k_i)}. \end{equation} \tag{ 2 }$

After normalization, $H_i\in [0,1]$ , independently of the target node's degree. To calculate H_i in the directed network, one can simply replace k_i by k_i^out in the above formulae.

Four real social networks including Youtube [27], Orkut [27], EmailEU [28] and Digg [29] are used to empirically investigate the information entropy distribution of nodes. Among these networks, Youtube and Orkut are undirected networks, EmailEU and Digg are directed ones. Youtube is a video-sharing website that includes a social network. Orkut is a free online social network where users make friends with each other. EmailEU is generated by using email data from a large European research institution from October 2003 to May 2005. Given a set of email messages, each node corresponds to an email address and a directed edge indicates that one node sent at least one email to the other. Digg contains data about stories promoted to Digg's front page over a period of a month in 2009. The data contains the voters' friendship networks where a link means that one node is watching the activities of (is a fan of) the other node. Some basic statistical properties of these networks are listed in table 1.

Table 1:. Basic information of four real networks. N is the network size, k_max is the highest degree, $\langle k\rangle$ is the average degree and $\langle c\rangle$ is the clustering coefficient.

Network	N	k_max	$\langle k\rangle$	$\langle c\rangle$
Youtube	1,134,890	28,754	5.2650	0.1723
Orkut	3,072,441	33,313	76.2814	0.1698
EmailEU	265,214	929	1.5838	0.3093
Digg	279,630	12,097	9.3623	0.0775

For each network, we get the counterpart random networks according to the link swap method [30]. At each step, we randomly select a pair of edges $A\text{-}B$ and $C\text{-}D$ . These two edges are then rewired to be $A\text{-}D$ and $B\text{-}C$ . We make sure no multiple edges or self-loop is created in this rewiring process. The distribution of H in the original network and the counterpart random networks are compared, as shown in fig. 2. For simplicity, we only take into account the 1000 highest-degree nodes since they are more likely to have strong spreading ability than those with smaller degree. In fig. 2, one can see that the information entropy distribution of nodes in the original networks is much wider than that in the randomized ones, especially in the Youtube and EmailEU networks. For some nodes in real network, their neighbors' degree can be very uneven. Therefore, the degree unevenness of the neighboring nodes should not be neglected when ranking the spreaders.

**Fig. 2:** (Color online) The entropy distribution in four social networks and their corresponding random ones.
Download figure:
Standard image

Many researches already show that the degree centrality is not enough to accurately identify influential nodes [1,17]. The ranking methods considering also the average degree of the neighboring nodes can effectively improve the pure degree method in ranking spreaders [17]. After taking into account the neighboring nodes' average degree, the score for each target node can be

$\begin{equation} f_i=k_i+\lambda\frac{\sum_{j\in\Gamma_i}k_j}{k_i}, \end{equation} \tag{ 3 }$

where λ is a tunable parameter.

We employ the Susceptible-Infected-Recovered (SIR) model [23] to simulate the spreading process on networks. The simulation runs in discrete time steps. In a social network, a user neither contact all of her neighbors nor only a single neighbor, but part of her neighbors [24–26]. Moreover, the node with larger degree may contact more neighbors. Therefore, at each time step in our simulation, every infected node i will select each of its neighbors (or followers) with probability $p\ (p=\frac{1}{\sqrt{k_i}})$ and then transmit the information to it with probability μ if the selected neighbor (or follower) is a susceptible one. The recovery rate is set as 1 here. The simulation stops when there is no infected node anymore. Notice that this model is slightly different from the standard SIR model where all the followers of an infected node have the chance to be infected. The present mechanism is usually used to mimic the limited neighbor contact capability that is a positive correlation to the degree of individuals [24,25].

The number of nodes that are finally infected when the infection starts from a given node i is denoted as its spreading ability $s_i^{\mu}$ , where μ is the infection rate in the SIR model. Here, we select a relatively small value of μ so that the infected percentage of the nodes is not so large ( $\mu=0.04$ in the Orkut network and $\mu=0.05$ in the other three networks). We select 1000 highest-degree nodes from each network and calculate Kendall's tau correlation coefficient $(\tau)$ between f_i and $s_i^{\mu}$ . The results are reported in fig. 3. Clearly, τ increases with λ at first, and then decreases after reaching a maximum. This indicates that the average degree of the neighboring nodes are beneficial for improving the identification of the most influential spreaders. When λ gets very large, the node with small degree but with large average degree of neighbors will have high ranking, leading to a small τ value.

**Fig. 3:** (Color online) Kendall's tau correlation coefficient τ between real spreading ability s_i and the ranking score f_i of the 1000 highest-degree nodes on four real networks when the simple ranking methods in eq. (3) and eq. (4) are applied. The results are averaged over 100 independent simulations.
Download figure:
Standard image

We further move to improve the ranking accuracy by the path diversity. Specifically, we will make use of the entropy H to modify eq. (3) as

$\begin{equation} f_i^*=k_i+H_i\lambda\frac{\sum_{j\in\Gamma_i}k_j}{k_i}. \end{equation} \tag{ 4 }$

Kendall's tau correlation coefficient between $s_i^{\mu}$ and $f_i^{*}$ is also shown in fig. 3. One can see that the optimal τ can be even higher after H_i is introduced.

So far, we have shown that the local path diversity is very valuable for identifying influential nodes. For the local path number, the nodes' degree and the average degree of neighboring nodes are two key factors. Therefore, we combine these two pieces of information (local path diversity and number) and propose a local spreader ranking method named KED in this letter. It can be described as

$\begin{equation} f_i=k_iE_iD_i, \end{equation} \tag{ 5 }$

where $E_i=H_i$ , D_i is a function of the degree of the neighbors of node i with $D_i=e^{\frac{\sum_{j\in\Gamma_i}k_j}{\max_l(\sum_{j\in\Gamma_l}k_j)}}$ . The power in D is the sum of the neighbors' degree normalized by its maximum possible value among all nodes, so that the value of D_i is bounded in $[1, e]$ . In this ranking metric, the node with larger f_i is supposed to have higher spreading ability. As an example, the f_i for the red node is 13.5914 in fig. 1(a) and 6.5672 in fig. 1(b), which indicates that the red node in fig. 1(a) has stronger spreading ability than that in fig. 1(b). It is also noted that the model described in eq. (5) is not only used to rank nodes in undirected networks, but also in directed networks.

For comparison, we consider two ranking algorithms on directed networks: PageRank [13] and LeaderRank [14], and one algorithm on undirected networks: K-shell [1,31]. PageRank is depicted as a random walk on hyperlinked networks. In the algorithm, the random walker has a certain probability to jump from a web page to a random web page and the probability is set as 0.15 in our simulation. LeaderRank is also a random-walk–based ranking algorithm. Different from PageRank, LeaderRank introduces a ground node g connecting to every node i in the network. The K-shell method is a ranking algorithm based on network decomposition. It starts by removing all nodes with one connection until no more of such nodes remain, and assigns them to the 1-shell. After that, all nodes with residual degree 2 are recursively removed and the 2-shell is created. This procedure continues as the residual degree increases until all nodes in the networks have been assigned to one of the shells.

Experiments and results

To investigate the influence of a node in information spreading, we initially set this node to be infected. As discussed above, the final coverage of this node $s_i^{\mu}$ is used to represent the spreading ability of i. The average final coverage $\langle s^{\mu} \rangle$ of top-L ranked nodes obtained by each ranking algorithm is used to investigate the performance of these algorithms. We define ρ as a ratio of $\langle s^{\mu} \rangle$ of KED to $\langle s^{\mu} \rangle$ of other methods. The results in fig. 4 show that ρ is larger than 1 in four networks. The result indicates that the information can spread more widely from the top ranked nodes obtained by KED than that obtained by degree, K-shell, PageRank and LeaderRank.

**Fig. 4:** (Color online) The ratio ρ of the average final coverage $\langle s^{\mu}\rangle$ of top-L nodes obtained by KED to that by degree, K-shell, PageRank or LeaderRank on four real networks. The spreading starts from each single node separately in the top-L list. The results are averaged over 100 independent simulations.
Download figure:
Standard image

**Fig. 4:** (Color online) The ratio ρ of the average final coverage $\langle s^{\mu}\rangle$ of top-L nodes obtained by KED to that by degree, K-shell, PageRank or LeaderRank on four real networks. The spreading starts from each single node separately in the top-L list. The results are averaged over 100 independent simulations.
Download figure:
Standard image

Kendall's tau correlation coefficient between the ranking of nodes and the real spreading ability is also calculated. We do not compare KED with K-shell since many nodes are with the same K-shell value, which makes the τ value of the K-shell method very low. Like before, we take into account 1000 highest-degree nodes. Kendall's tau correlation coefficient between the ranking of top-L (L ≤ 1000) nodes from different methods and their real spreading ability is shown in fig. 5. One can see that generally τ of KED is the largest among all the ranking methods. That is to say, the KED ranking algorithm can generate a more accurate ranking of spreading ability than degree centrality, PageRank and LeaderRank.

**Fig. 5:** (Color online) Kendall's tau correlation coefficient τ between real spreading ability s_i and the ranking score f_i of the 1000 highest-degree nodes on four real networks. The results are averaged over 100 independent simulations.
Download figure:
Standard image

In fact, D and E can be regarded as the correction factors to the existing spreader ranking algorithms. Specifically, D is the correction factor of path number and E is the correction factor of path diversity. To make our idea more general, we also apply the correction factors D and E to the K-shell, PageRank and LeaderRank algorithms. We show Kendall's tau between the rank of top-1000 nodes and their spreading coverage in table 2. The accuracy of ranking the spreaders is considerably increased due to the correction of E and D. In the K-shell, PageRank and LeaderRank methods, the information of path number is implicitly embedded. Interestingly, we find that even adding D alone can improve these methods. In the above analysis, the infection rate is the same as the values in figs. 3, 4 and 5 ( $\mu=0.04$ in the Orkut network and $\mu=0.05$ in the other three networks). We actually try different infection rates in our simulation, and the results show that adding D and E can always improve the τ value of degree, PageRank and LeaderRank methods. However, there is some parameter range in which D and E may lower the τ value of the K-shell method.

Table 2:. Kendall's tau coefficient when the correction factors D and E are applied to the degree (K), K-shell (KS), PageRank (PR) and LeaderRank (LR) algorithms

Method	Youtube	Orkut	EmailEU	Digg
K	0.5416	0.4811	0.6516	0.5850
K+D	0.5581	0.4816	0.6906	0.6519
K+E	0.5810	0.5467	0.7509	0.6115
K+E+D	0.6064	0.5559	0.7687	0.6746

KS	0.4033	0.4031
KS+D	0.4970	0.6250
KS+E	0.3914	0.6360
KS+E+D	0.4418	0.7032

PR			0.4725	0.4389
PR+D			0.5051	0.4492
PR+E			0.5518	0.4583
PR+E+D			0.5821	0.4684

LR			0.4964	0.5837
LR+D			0.5325	0.6218
LR+E			0.5868	0.6083
LR+E+D			0.6184	0.6437

In order to further understand the effect of the factor E and D in the KED method, we add the power α and β to E and D, respectively. The modified KED method reads

$\begin{equation} f_i=k_iE_i^\alpha D_i^\beta. \end{equation} \tag{ 6 }$

The dependence of τ on the α and β in the modified KED method is reported in fig. 6. It shows that β can indeed enhance τ even without E (i.e., $\alpha=0$ ). With both α and β, τ can be further improved to a much higher value. For example, the improvement in τ solely from β is $69.61\%$ in Youtube, $0.82\%$ in Orkut, $88.81\%$ in emailEU and $42.39\%$ in Digg, respectively. However, with both α and β, the improvement in τ can reach $133.97\%$ in Youtube, $232.17\%$ in Orkut, $124.27\%$ in emailEU and $49.58\%$ in Digg. The results indicate that the path diversity is indeed an important factor in ranking spreaders. Interestingly, there is an optimal β in each subfigure and it depends on the infection rate μ. We remark that a similar heat map can be observed when applying E and D to other ranking algorithms such as K-shell, PageRank and LeaderRank.

**Fig. 6:** (Color online) The dependence of τ on α and β in the modified KED method. The infection rate is set as $\mu=0.12$ in Youtube, emailEU and Digg network and $\mu=0.06$ in Orkut since it is a very dense network. The results are averaged over 100 independent simulations.
Download figure:
Standard image

Conclusion

In this letter, we introduced the concept of path diversity in ranking spreaders. Measured by information entropy, it is introduced to a simple spreader ranking algorithm. The results show that not only the number of paths that determines the spreading ability of a node, but also the path diversity is also a significant factor. We further proposed a local spreader ranking algorithm named KED, in which the information of path number and that of path diversity are combined. We use the SIR model to simulate the spreading process on four real social networks. The results show that the performance of KED is better than that of degree, PageRank and LeaderRank, and in most cases, better than that of K-shell. With two parameters to adjust the contribution of path number and path diversity in the KED method, we find that the ranking accuracy can be further improved. From the practical point of view, our method can be easily applied to very large real systems since it is based on only local information. Finally, we find that the concept of combining path number and path diversity can be applied to improve the K-shell, PageRank and LeaderRank algorithms as well.

Many extensions can be made based on this work. Most existing ranking algorithms are designed to rank the spreading ability of individual nodes. In our simulation, we notice that their performance is almost the same when they are applied to pick up a group of nodes for spreading (i.e. instead of originating from an individual node, the spreading will start from a group of the top ranking nodes). Therefore, the method for this task will require further investigation in the future. Moreover, how to identify influential nodes in temporal networks and bipartite networks is also a long-term challenge problem. Some progress has been made in this direction [32,33], but systematic analysis is still lacking. We remark here that in these systems, the combination of path number and path diversity may also lead to an improvement in ranking spreaders.

Acknowledgments

We thank the anonymous referees for helpful comments to improve the paper. This work was partially supported by the National Natural Science Foundation of China under Grant Nos. 61103109 and 61003231. D-BC acknowledges the Huawei University-Enterprise Cooperation project YBCB2011057.

Path diversity improves the identification of influential spreaders

Article metrics

Permissions

Author e-mails

Author affiliations

Dates

Abstract

Introduction

Empirical analysis and method

Experiments and results

Conclusion

Acknowledgments

Path diversity improves the identification of influential spreaders

Article metrics

Permissions

Share this article

Author e-mails

Author affiliations

Dates

Abstract

Introduction

Empirical analysis and method

Experiments and results

Conclusion

Acknowledgments