HPC-Based Algorithms for Network Science

(Motivation | Overview | Publications)

Motivation

During the past decade, the field of network analysis and mining has been an emerging research area. The most relevant problems in this area include: collecting and managing network data, modeling and generating random networks, and developing network mining algorithms such as frequent pattern mining, clustering & classification, centrality analysis, triangle counting, clustering coefficient, eigenvalue, community detection, information diffusion, epidemics spread, etc. Complex systems with networks are abstracted as random graphs for the purposes of obtaining a rigorous mathematical analysis. Various social network theories, network metrics, topology, and mathematical models have been proposed to understand the underlying properties and relationships of those systems. As some of the complex networks grow, it has become necessary to correspondingly generate massive random networks efficiently. A smaller network may not exhibit the same behavior even if both networks are generated using the same model. It has been shown that the structure of larger networks is fundamentally different from small networks, and many patterns emerge only in massive networks. The advent of large random networks necessitates efficient, both in terms of running time and memory consumption, algorithms to generate such networks.

Overview

I have designed, implemented, and analyzed MPI-based distributed memory parallel algorithms for generating massive random networks using various models that can generate networks with hundreds of billions of nodes and edges.Many random network models have been developed in the past to capture the diversity of complex systems such as Erdos-Renyi (ER), stochastic blockmodels (SB), small-world, preferential attachment (PA), Chung-Lu (CL), highly optimized tolerance (HOT), exponential random graph (ERGM), recursive matrix (R-MAT), stochastic Kronecker graph (SKG) and block two-level Erdos-Renyi (BTER) models. Different models capture different characteristics of networks.

We have designed and developed distributed memory-based parallel algorithms for the PA, CL, SB, BTER, and small-world models. To achieve the best performance, we have analyzed the nature of computational dependencies and developed novel, effective, parallel load balancing techniques. Therefore, our algorithms are highly scalable and can generate massive random networks. For instance, our parallel algorithm for the preferential attachment model can generate networks with 400 billion edges in just five minutes using 1024 processors. We also developed a novel algorithmic method for the CL model, which outperforms the best--known previous algorithms by a significant margin in terms of both time and space. Experimental results on various large-scale networks show that both of our sequential and parallel algorithms require 400-3000 times less memory than the existing sequential and parallel algorithms, respectively, making our algorithm suitable for generating very large-scale networks. Moreover, both of our algorithms are about 3-4 times faster than the existing sequential and parallel algorithms. The algorithmic method is highly efficient, extensible and scalable. We have extended the model to generate networks using the SB and BTER models. For example, we can generate the largest known real-world network using the BTER  model with 5 billion edges in just 0.37 seconds using 1024 processors. In contrast, the best-known algorithm for the BTER model requires 1350 seconds to generate the same network.

Publications

I have published another book chapter, a journal paper, and two conference papers in this work. Another article has been submitted recently to a top-tier conference.