GRASP is aimed at developing a system for delivering high-performance on graph processing tasks by exploiting the resources available on a heterogeneous computing cluster. The key goals of this work are to provide: an easy to use programming interface; a compiler and runtime system that delivers high performance; and the ability to handle large problem sizes.
ASPIRE
Many vertex-centric graph algorithms can be expressed via asynchronous parallelism by relaxing certain read-after-write data dependences and thus allowing threads to compute vertex values using stale (i.e., not the most recent) values of their neighboring vertices. ASPIRE tolerates high inter-node communication latency of a distributed shared-memory system by converting synchronous algorithms into their asynchronous counterparts. We observe that high inter-node communication latency can lead to excessive use of stale values which delays the algorithm's convergence while employing bounded staleness limits the degree to which the latency can be tolerated. This problem is addressed by ASPIRE using a novel relaxed memory consistency model and consistency protocol that simultaneously tolerate communication latency and minimize the use of stale values. For a range of asynchronous graph algorithms, on an average, ASPIRE outperforms algorithms based upon: prior relaxed memory models that allow stale values by at least 2.27x; and Bulk Synchronous Parallel (BSP) model by 4.2x. ASPIRE also performs well in comparison to GraphLab, a popular distributed graph processing framework. [OOPSLA'14]
CuSha
Prior works of graph processing on a GPU employ Compressed Sparse Row (CSR) form for its space-efficiency; however, CSR suffers from irregular memory accesses and GPU underutilization that limit its performance. We have developed CuSha, a CUDA-based graph processing framework that overcomes the above drawbacks via the use of two novel graph representations: G-Shards and Concatenated Windows (CW). G-Shards uses a concept recently introduced for non-GPU systems that organizes a graph into autonomous sets of ordered edges called shards. CuSha's mapping of GPU hardware resources on to shards allows fully coalesced memory accesses. CW is a novel representation that enhances the use of shards to achieve higher GPU utilization for processing sparse graphs. Finally, CuSha fully utilizes the GPU power by processing multiple shards in parallel on GPU's streaming multiprocessors. Our experiments show that CuSha outperforms state-of-the-art warp-centric algorithm across wide range of benchmarks and large real world input graphs -- for PageRank, average speedup of 7.21x is observed across the input graphs. CuSha is available on GitHub for download. [HPDC'14]
PaRMAT
Synthetic R-MAT graphs are widely used to evaluate the response of various graph processing systems for inputs that imitate the characteristics of real-world graphs. Available R-MAT graph generators have not considered extreme scalability during the design and usually fail to produce very large graphs. PaRMAT is a tool designed to create very large RMAT graphs on machines with limited amount of memory. To avoid the use of slow secondary memory during graph generation, it divides the probabilistic 2D adjacency matrix into smaller sub-matrices so that edges generated in these sub-matrices can fit in the available system memory. PaRMAT can also use multiple threads to fully utilize multicore processors and accelerate the graph generation process. In addition to generating large graphs where existing generators fail, PaRMAT creates moderate sized graphs faster than available R-MAT graph generators, while exhibiting similar power-law degree distribution in the output. PaRMAT provides various options for the RMAT graph: being directed or non-directed, disallowing duplicate edges, sorting the output, varying RMAT graph parameters, etc. It is available on GitHub for download.
InfiniMem
Most existing graph processing frameworks make programming easier but simply fail when the graph being processed does not fit in the memory of the target machine. Infinimem is a framework that enables generation and processing of large datasets that do not fit in available memory by having them reside on disk. Infinimem preserves the ease of programming by allowing the programmer to code the application in a size oblivious fashion. The programmer simply identifies which collection of data items could potentially grow large and the task of memory and I/O management for applications that process such large collections of data is transparently handled by Infinimem. Our experiments show that InfiniMem can successfully generate a mesh with 7.5 million nodes and 300 million edges (4.5 GB on disk) in 40 minutes and it performs the PageRank computation on a 14GB graph with 134 million vertices and 805 million edges at 14 minutes per iteration on an 8-core machine with 8 GB RAM. Many graph generators and processing frameworks cannot handle such large graphs