- GraphLab abstraction exposes asynchronous, dynamic, graph-parallel computation model in the shared-memory setting.
- This paper extends the abstraction to the distributed setting.
- Link to the paper.
- Graph Structured Computation
- Sometimes computation requires modeling dependencies between data.
- eg modeling dependencies between similar users for the recommendation use case.
- Asynchronous Iterative Computation
- In many cases, asynchronous procedures outperform synchronous ones.
- eg linear systems, belief propagation, stochastic optimization etc.
- Dynamic Computation
- Iterative computation converges asymmetrically.
- Convergence can be accelerated by dynamic scheduling.
- eg do not update parameters that have already converged.
- Serializability
- Ensuring that all parallel executions have an equivalent serial execution is desirable for both correctness and faster convergence.
- Store program state as a directed graph.
- G = (V,E,D) where D is the user defined data (model parameters, algorithm state, statistical data etc).
- The graph data D is mutable but the state of the graph (V,E) is immutable.
- Stateless procedure that modifies the data within the scope of a vertex and schedules the execution of the update function on other vertices.
- Scope of a vertex (S) - data corresponding to the vertex, its edges and its adjacent vertices.
- update: f (v, Sv) -> (Sv, T) where T is the set of vertices where update function is scheduled to be invoked.
- Scheduling of computation id decoupled from movement of data and no message passing is required between vertices.
- Input to the model is G and T, the initial set of vertices to be updated.
- During each step, a vertex is extracted from T, updated and a set of vertices is added to T (for future computation).
- Vertices in T can be executed in any order with the only constraint that all vertices be eventually executed.
- Sync operation runs in the background to maintain global aggregates concurrently.
- These global values are read by update function and written by the sync operation.
- Full consistency
- Full read/write access in the scope.
- Scope of concurrently updating vertices cannot overlap.
- Edge consistency
- Read/write access on the vertex and the adjacent edges but only read access to adjacent vertices.
- Slightly overlapping scope.
- Vertex consistency
- Write access to the vertex and read access to adjacent edges and vertices.
- All vertices can run update function simultaneously.
- Two-phase partitioning process for load balancing the graph on arbitrary cluster size.
- In the first phase, partition the graph into k parts (k >> number of machines).
- Each part, called atom, is a file of graph generating commands.
- Atom also stores information about ghosts (set of vertices and edges adjacent to the partition boundary).
- Atom index file contains connectivity structure and file location for the k atoms as a meta-graph.
- In the second phase, this meta-graph is partitioned over the physical machines.
- A vertex coloring (no adjacent vertices have the same color) is constructed to serialize parallel execution of dependent tasks (in our case, vertices in the graph).
- For edge consistency model, execute all vertices of the same color before going to next color and run sync operation between color steps.
- Changes to ghost vertices and edges are communicated asynchronously as they are made.
- Vertex consistency is trivial - assign same color to all the vertices.
- For full consistency, construct second-order vertex coloring (no vertex shares the same color as any of its distance two neighbors)
- Associate reader-writer locks on each vertex.
- Each machine can update only the local vertices.
- Optimisations
- Ghosting system uses caching to eliminate wait on remote, unchanged data.
- Lock request and synchronization are pipelined to hide network latency.
- Each machine maintains a pipeline of vertices for which locks have been requested but not granted.
- A vertex is executed once lock acquisition and data synchronization are complete.
- Nonblocking reader-writer locks, that work through callback functions, are used.
- Distributed checkpointing via two modes:
- Synchronous checkpointing
- Suspend computation to save all modified data since the last checkpoint.
- Asynchronous checkpointing based on Chandy-Lamport snapshot algorithm.
- The snapshot step becomes an update function in the GraphLab abstraction.
- Better than synchronous checkpointing.
- One instance of GraphLab runs on each machine.
- These processes are symmetric and communicate via RPC.
- The first process additionally acts as the master and computes placement of atoms based on atom index.
- Each process maintains a local scheduler (for its vertices) and a cache to access remote data.
- Distributed consensus algorithm to decide when all the schedulers are empty.
- The biggest strength of the paper are its extensive experiments.
- GraphLab benefits from the use of background asynchronous communication and pipelined locking but its communication layer is not as efficient as MPI's communication layer.