GRASP is aimed at developing scalable high-performance solutions for graph processing employing resources available on a heterogeneous computing cluster. The key goal of this work is to develop easy to use programming frameworks with sophisticated runtimes that handle diverse and emerging scenarios. To date GRASP has achieved the following:
ASYNCHRONOUS DISTRIBUTED PROCESSINGThe long communication latencies involved in cluster based processing heavily impact the performance of graph processing frameworks. While the communication cannot be completely eliminated, the inherent asynchrony involved in graph algorithms allow tolerating the inter-node latency by using stale vertex values. ASPIRE [OOPSLA'14] introduces a novel relaxed memory consistency model & protocol that simultaneously tolerate communication latency and minimize the use of stale values. CoRAL [ASPLOS'17a] is the first fault-tolerance protocol for distributed asynchronous graph processing. It uses local snapshots, avoids rollback, and leverages the algorithmic asynchrony to progress the global execution state to a resumable state from which forward progress produces accurate results.
MULTIQUERY & BATCHING SCENARIOSWe have developed the distributed MultiLyra [BigData'19] system whose scalablity via overhead amortization enables simultaneous evaluation of batches of hundreds of iterative graph queries. BEAD [BigData'20] extends MultiLyra to consider scenarios in which a batch of queries needs to be continuously reevaluated due to changes to the graph. For the shared-memory setting, we have developed VRGQ [SIGOPS'21] that employs value reuse to speedup evaluation of future graph queries using results of prior queries. SimGQ [HiPC'20] is an online system that combines value reuse with batching. It optimizes evaluation of a batch of queries by reusing results of common subcomputations. In our recent work we are developing a batched system for evaluating point-to-point queries. It is based upon the PnP [ASPLOS'19] system that efficiently evaluates point-to-point graph queries. The distinctive feature of PnP is its bidirectional phase followed by a unidirectional phase based upon dynamic direction selection.
STREAMING & EVOLVING GRAPHSTo process graphs that change over time, EvoG [TACO'16] incorporates optimizations for incrementally processing multiple graph snapshots in a pipelined fashion. We have developed BEAD to consider the scenario in which graph is allowed to change requiring simultaneous evaluation of a batch of queries to be continuously repeated. For the streaming graphs scenario we have developed KickStarter [ASPLOS'17b] is a dynamic dependence tracking based technique to correctly and efficiently reevaluating an installed query in presence of edge additions and deletions. Our Tripoline [EuroSys'21] system generalizes the streaming model to support incremental evaluation of arbitrary queries from results of fixed set of (different) installed queries. It essentially employs value reuse by using the results of installed queries to incrementally evaluate other user queries.
CUSTOM ACCELERATORS & GPUs FORIn collaboration with Dr. Abu-Ghazaleh's group we have designed the GraphPulse [MICRO'20] custom accelerator for Graph processing that implements an asynchronous event-driven model of computation directly in hardware. GraphPulse eliminates a number of inefficiencies associated with BSP-based Push and Pull models. We have further generalized the above accelerator to allow for incremental reevaluation of graph queries following graph updates that take the form of edge additions and deletions. The JetStream [MICRO'21] accelerator is the first accelerator for graph queries that handles the streaming graph scenario.
CuSha [HPDC'14] is a CUDA-based graph processing framework that uses two novel graph representations, G-Shards and Concatenated Windows to allow coalesced memory access and overcome GPU under-utilization. WS-VR [PACT’15] is a multi-GPU graph processing framework that delivers high SIMD-efficiency and inter-GPU communication efficiency. Solutions to branch divergence and load balancing can be found in CCC [MICRO'15] and CTE [IPDPS'16]. In recent work we present G-Morph that uses a GPU to accelerate the subgraph isomorphism problem [Euro-Par'21]. We are also developing techniques for efficiently processing large graphs that cannot fit in GPU memory. In work led by Dr. Zhao's group, we presented Subway [EuroSys'20] a partition-based graph processing framework that minimizes the overhead of transferring subgraphs between host and GPU memory.
LIMITED MEMORY SINGLE-MACHINE GRAPH GENERATION, PARTITIONING & PROCESSINGWhile today's machines are well equipped with multiple cores and large main memories, the ever growing graph sizes pose a memory scalability challenge in processing real world graphs on single machines. For in-memory scaling, our Input Reduction [HPDC'16] based processing trasforms the original graph to a smaller graph which is used to calculate intermediate approximation that is usable for incremental processing over the original graph. For out-of-core graph processing, Dynamic Partitions [USENIX ATC'16] truly capture the dynamic nature of graph computation by adapting the set of edges that are held in memory during runtime. This approach greatly reduces IO required for out-of-core graph processing.
We have also developed graph generation and graph partitioning algorithms under limited memory constraint. PaRMAT [GitHuB] is a tool designed to create very large RMAT graphs on machines with limited amount of memory. It provides various options for the RMAT graph: being directed or non-directed, disallowing duplicate edges, etc. InfiniMem [LCPC'15] enables generation and processing of large graphs that do not fit in memory by having them reside on disk and allows the programmer to code the applications in a size oblivious fashion. Our GO [NAS'21] out-of-core system performs Graph Partitioning on large input data sets that do not fit in the memory. Finally, we have developed OMR [ISMM'18] out-of-core system for MapReduce computations on a single machine with limited memory. We have extended this system to allow graph partitioning and partition-based graph processing on a single machine.