Out-of-core Graph Algorithms#

An important application case of GraphAr is to serve out-of-core graph processing scenarios. With the graph data saved as GAR files in the disk, GraphAr provides a set of reading interfaces to allow to load part of graph data into memory when needed, to conduct analytics. While it is more convenient and efficient to store the entirety of the graph in memory (as is done in BGL), out-of-core graph processing makes it possible to complete analytics on the large-scale graphs using limited memory/computing resources.

The are some out-of-core graph analytic algorithms that have been implemented based on GraphAr, include:

  • PageRank (PR)

  • Connected Components (CC)

  • Breadth First Search (BFS)

These algorithms represent for different compute patterns and are usually building blocks for constructing other graph algorithms.

PageRank#

PageRank (PR) is an algorithm used by Google Search to rank web pages in their search engine results. The source code of PageRank based on GraphAr located at pagerank_example.cc, and the explanations can be found in the Getting Started page.

Connected Components#

A weakly connected component is a maximal subgraph of a graph such that for every pair of vertices in it, there is an undirected path connecting them. And Connected Components (CC) is an algorithm to identify all weakly connected components in a graph. CC based on BGL is provided in GraphAr, also, we implement out-of-core algorithms for this application.

A typical method for calculating CC is label propagation. In this algorithm, each vertex is attached with a property which represents its component label, being its own vertex id initially. In the subsequent supersteps (i.e., iterations), a vertex will update its label if it receives a smaller id and then it propagates this id to all its neighbors.

This algorithm can be implemented based on streaming the edges via GraphAr’s reading interface. That is to say, the edges are accessed and processed chunk by chunk, instead of being loaded into memory at once (as in the BGL example).

// construct the edge collection in GraphAr
auto &edges = ...
auto it_begin = edges.begin(), it_end = edges.end();

// initialize for all vertices
std::vector<GraphArchive::IdType> component(num_vertices);
for (GraphArchive::IdType i = 0; i < num_vertices; i++)
  component[i] = i;

// stream all edges for each iteration
for (int iter = 0; ; iter++) {
  bool flag = false;
  for (auto it = it_begin; it != it_end; ++it) {
    GraphArchive::IdType src = it.source(), dst = it.destination();
    // update
    if (component[src] < component[dst]) {
        component[dst] = component[src];
        flag = true;
    } else if (component[src] > component[dst]) {
        component[src] = component[dst];
        flag = true;
    }
  }
  // check if it should terminate
  if (!flag) break;
}

The file cc_stream_example.cc located inside the source tree contains the complete implementation for this algorithm. Also, we can only process active vertices (the vertices which are updated in the last iteration) and the corresponding edges for each iteration, since an inactive vertex does not need to update its neighbors. Please refer to cc_push_example.cc for the complete code.

Tip

In this example, two kinds of edges are used. The ordered_by_source edges are used to access all outgoing edges of an active vertex, and ordered_by_dest edges are used to access the incoming edges. In this way, all the neighbors of an active vertex can be accessed and processed.

Although GraphAr supports to get the outgoing (incoming) edges of a single vertex for all adjList types, it is most efficient when the type is ordered_by_source (ordered_by_dest) since it can avoid to read redundant data.