Skip to content

Instantly share code, notes, and snippets.

@shagunsodhani
Created September 14, 2016 14:35
Show Gist options
  • Save shagunsodhani/6329486212643fd61f58a5a3eb5abb3c to your computer and use it in GitHub Desktop.
Save shagunsodhani/6329486212643fd61f58a5a3eb5abb3c to your computer and use it in GitHub Desktop.
Notes for "SimRank: A Measure of Structural-Context Similarity" paper

SimRank: A Measure of Structural-Context Similarity

Introduction

  • Algorithm to derive similarity between 2 nodes of a graph (or graphical model derived from any other kind of dataset).
  • Link to the paper

SimRank

  • Input: A directed graph G = (V, E) where V represents vertices and E represents edges.
  • SimRank defines similarity between 2 vertices (or nodes) i and j as the average of the similarity between their in-neighbours decayed by a constant factor C.
  • Any node is maximally similar to itself (with similarity = 1).
  • PageRank analyses the individual vertices of the graph with respect to the global structure, while SimRank analyses the relationship between a pair of vertices (edges).
  • SimRank scores are symmetric and can be defined between all pair of vertices.
  • G2 is defined as the node pair graph such that each node in G2 corresponds to an ordered pair of nodes of G and there exists an edge between node pair (a, b) and (c, d) if there exists an edge between (a, c) and (b, d).
  • In G2, similarity flows from node to node with singleton nodes (nodes of the form (a, a)) as the source of similarity.

Variants

Minimax Variant

  • Defines similarity of nodes i and j as the minimum of maximum similarity between i and any in-neighbour of j and between j and any in-neighbour of i.

Computing SimRank

  • A naive solution can be obtained by iteration to a fixed point.
  • Space complexity is O(n2) and time complexity is *O(kn2d) where k is the number of iterations, n is the number of vertices and d is the average of product of indegrees of pair of vertices.
  • Optimisations can be made by setting the similarity between far off nodes as 0 and considering only nearby nodes for an update.

Different Interpretations

Co-citation Score

  • The first iteration of SimRank produces results same as co-citation score between a pair of vertices.
  • Successive iterations improve these initial scores.

Random Surfer-Pairs Model

  • SimRank s(a, b) can be interpreted as the measure of how soon two random surfers are expected to meet at the same node if they start at nodes a and b and walk the graph backwards.
  • Expected Meeting Distance (EMD) between 2 nodes a and b is the expected number of steps required before 2 surfers (starting at a and b) would meet if they walked randomly in locked step.
  • Surfers are allowed to teleport with a small probability - to circumvent the infinite EMD problem.
  • Expected-f Meeting Distance (EMD) - Given length l of a tour, compute f(l) (where f is a non-negative monotonic function) to bound the expected distance to a finite interval.
  • Common choice for f is f(z) = Cz where C ε (0, 1)
  • The SimRank score for two nodes, with parameter C, is the expected-f meeting distance travelling back-edges with f(z) = Cz

Evaluation

  • Experiments on 2 datasets:
    • Corpus of scientific research papers from ResearchIndex.
    • Transcripts of undergrad students at Stanford.
  • Domain specific properties used to measure similarity and compared with SimRank scores.
  • Results show improvement over co-citation scores.
@tiankonghenlan20113046
Copy link

I read this paper before,but i cannot understand some aspects about the definition. can I have a discussion wtih you ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment