GRASP is aimed at developing scalable highperformance solutions for graph processing employing resources available on a heterogeneous computing cluster. The key goal of this work is to develop easy to use programming frameworks with sophisticated runtimes that handle diverse and emerging scenarios. To date GRASP has achieved the following: MULTIQUERY & BATCHING SCENARIOS
We have developed the distributed MultiLyra [BigData'19] system whose scalablity via overhead amortization enables simultaneous evaluation of batches of hundreds of iterative graph queries. BEAD [BigData'20] extends MultiLyra to consider scenarios in which a batch of queries needs to be continuously reevaluated due to changes to the graph. For the sharedmemory setting, we have developed VRGQ [SIGOPS'21] that employs value reuse to speedup evaluation of future graph queries using results of prior queries. SimGQ [HiPC'20] is an online system that combines value reuse with batching. It optimizes evaluation of a batch of queries by reusing results of common subcomputations. SimGQ+ [JPDC'22] extends this batched system to evaluate pointtopoint queries. It is based upon the PnP [ASPLOS'19] system that efficiently evaluates pointtopoint graph queries. The distinctive feature of PnP is its bidirectional phase followed by a unidirectional phase based upon dynamic direction selection. Our most recent work on Glign [ASPLOS'23a] develops a set of techniques that improve the efficiency with which a batch of queries is evaluted by aligning their associated graph traversals. Our recent work shows the use of batchedPnP solver in designing the P4[IROS'24] Multi Agent Path Finding algorithm that scales to several hundred agents.
STREAMING & EVOLVING GRAPHS
To process graphs that change over time, EvoG [TACO'16] incorporates optimizations for incrementally processing multiple graph snapshots in a pipelined fashion. We have developed BEAD [BigData'20] to consider the scenario in which graph is allowed to change requiring simultaneous evaluation of a batch of queries to be continuously repeated. For the streaming graphs scenario we have developed KickStarter [ASPLOS'17b] is a dynamic dependence tracking based technique to correctly and efficiently reevaluating an installed query in presence of edge additions and deletions. Our Tripoline [EuroSys'21] system generalizes the streaming model to support incremental evaluation of arbitrary queries from results of fixed set of (different) installed queries. It essentially employs value reuse by using the results of installed queries to incrementally evaluate other user queries. Our most recent work on CommonGraph [ASPLOS'23b] develops a technique for efficiently supporting deletions of edges in an evolving graph while scaling parallel evaluation of a query on large number of snapshots.
CUSTOM ACCELERATORS & GPUs FOR
In collaboration with Dr. AbuGhazaleh's group we have designed the GraphPulse [MICRO'20] custom accelerator for Graph processing that implements an asynchronous eventdriven model of computation directly in hardware. GraphPulse eliminates a number of inefficiencies associated with BSPbased Push and Pull models. We have further generalized the above accelerator to allow for incremental reevaluation of graph queries following graph updates that take the form of edge additions and deletions. The JetStream [MICRO'21] accelerator is the first accelerator for graph queries that handles the streaming graph scenario. We have recently developed the MEGA [MICRO'23] accelerator that simultaneously process multiple snapshots of an evolving graph.
GRAPH PROCESSING CuSha [HPDC'14] is a CUDAbased graph processing framework that uses two novel graph representations, GShards and Concatenated Windows to allow coalesced memory access and overcome GPU underutilization. WSVR [PACTâ€™15] is a multiGPU graph processing framework that delivers high SIMDefficiency and interGPU communication efficiency. Solutions to branch divergence and load balancing can be found in CCC [MICRO'15] and CTE [IPDPS'16]. In recent work we present GMorph that uses a GPU to accelerate the subgraph isomorphism problem [EuroPar'21]. We are also developing techniques for efficiently processing large graphs that cannot fit in GPU memory. In work led by Dr. Zhao's group, we presented Subway [EuroSys'20] a partitionbased graph processing framework that minimizes the overhead of transferring subgraphs between host and GPU memory. LIMITED MEMORY SINGLEMACHINE GRAPH GENERATION, PARTITIONING & PROCESSING
While today's machines are well equipped with multiple cores and large main memories, the ever growing graph sizes pose a memory scalability challenge in processing real world graphs on single machines.
For inmemory scaling, our Core Graph [EuroSys'24] work identifies a proxy graph of a power law graph that contains roughly 10% of edges and yet produces precise results for over 95% of vertices across random queries.
Also our Input Reduction [HPDC'16] based processing trasforms the original graph to a smaller graph which is used to calculate intermediate approximation that is usable for incremental processing over the original graph.
For outofcore graph processing, Dynamic Partitions [USENIX ATC'16] truly capture the dynamic nature of graph computation by adapting the set of edges that are held in memory during runtime. This approach greatly reduces IO in outofcore graph processing. We have als exploited similar approch for distrubuted systems via ExpressWay [BigGraph'23].
We have also developed graph generation and graph partitioning algorithms under limited memory constraint. PaRMAT [GitHuB] is a tool designed to create very large RMAT graphs on machines with limited amount of memory. It provides various options for the RMAT graph: being directed or nondirected, disallowing duplicate edges, etc. InfiniMem [LCPC'15] enables generation and processing of large graphs that do not fit in memory by having them reside on disk and allows the programmer to code the applications in a size oblivious fashion. Our GO [NAS'21] outofcore system performs Graph Partitioning on large input data sets that do not fit in the memory. Finally, we have developed OMR [ISMM'18] outofcore system for MapReduce computations on a single machine with limited memory. We have built upon this system and developed OMRGx that transparently performs outofcore partitioning and partitionbased graph processing on a single machine [ISMM'23]. ASYNCHRONOUS DISTRIBUTED PROCESSING
The long communication latencies involved in cluster based processing heavily impact the performance of graph processing frameworks. While the communication cannot be completely eliminated, the inherent asynchrony involved in graph algorithms allow tolerating the internode latency by using stale vertex values. ASPIRE [OOPSLA'14] introduces a novel relaxed memory consistency model & protocol that simultaneously tolerate communication latency and minimize the use of stale values. CoRAL [ASPLOS'17a] is the first faulttolerance protocol for distributed asynchronous graph processing. It uses local snapshots, avoids rollback, and leverages the algorithmic asynchrony to progress the global execution state to a resumable state from which forward progress produces accurate results.

PAPERS ACCEPTED
SOFTWARE
