Social Media Mining: An Introduction

(Axel Boer) #1

P1: Trim: 6.125in×9.25in Top: 0.5in Gutter: 0.75in
CUUS2079-10 CUUS2079-Zafarani 978 1 107 01885 3 January 13, 2014 17:56


282 Behavior Analytics

between the same pair of nodes. For example, consider two authorsx
andywho have collaboratedctimes. In this case,|paths<x,^1 y>|=c.
 Hitting and Commute Time. Consider a random walk that starts at
nodexand moves to adjacent nodes uniformly. Hitting timeHx,yis
the expected number of random walk steps needed to reachystarting
fromx. This is a distance measure. In fact, a smaller hitting time
implies a higher similarity; therefore, a negation can turn it into a
similarity measure:

σ(x,y)=−Hx,y. (10.12)

Note that if nodeyis highly connected to other nodes in the network
(i.e., has a high stationary probabilityπy), then a random walk starting
from anyxlikely ends up visitingyearly. Hence, all hitting times toy
are very short, and all nodes become similar toy. To account for this,
one can normalize hitting time by multiplying it with the stationary
probabilityπy:

σ(x,y)=−Hx,yπy. (10.13)

Hitting time is not symmetric, and in general,Hx,y =Hy,x. Thus, one
can introduce thecommutetime to mitigate this issue:

σ(x,y)=−(Hx,y+Hy,x). (10.14)

Similarly, commute time can also be normalized,

σ(x,y)=−(Hx,yπy+Hy,xπx). (10.15)

 Rooted PageRank. A modified version of the PageRank algorithm
can be used to measure similarity between two nodesxandy.In
rooted PageRank, we measure the stationary probability of y:πy
given the condition that during each random walk step, we jump to
xwith probabilityαor a random neighbor with probability 1−α.
The matrix format discussed in Chapter 3 can be used to solve this
problem.
 SimRank. One can define similarity between two nodes recursively
based on the similarity between their neighbors. In other words, sim-
ilar nodes have similar neighbors. SimRank performs the following:

σ(x,y)=γ·


x′∈N(x)


y′∈N(y)σ(x′,y′)
|N(x)||N(y)|

, (10.16)

Free download pdf