◾ KSC-net: Community Detection for Big Data Networks

doi:10.1201/b18050-14

ABSTRACT

ABSTRACT In this chapter, we demonstrate the applicability of the kernel spectral clustering (KSC) method for community detection in Big Data networks. We give a practical exposition of the KSC method [1] on large-scale synthetic and real-world networks with up to 106 nodes and 107 edges. e KSC method uses a primal-dual framework to construct a model on a smaller subset of the Big Data network. e original largescale kernel matrix cannot t in memory. So we select smaller subgraphs using a fast and unique representative subset (FURS) selection technique as proposed in Reference 2. ese subsets are used for training and validation, respectively, to build the model

CONTENTS 9.1 Introduction 158 9.2 KSC for Big Data Networks 160

9.2.1 Notations 160 9.2.2 FURS Selection 161 9.2.3 KSC Framework 162

9.2.3.1 Training Model 162 9.2.3.2 Model Selection 163 9.2.3.3 Out-of-Sample Extension 163

9.2.4 Practical Issues 165 9.3 KSC-net Soware 165

9.3.1 KSC Demo on Synthetic Network 165 9.3.2 KSC Subfunctions 167 9.3.3 KSC Demo on Real-Life Network 169

9.4 Conclusion 171 Acknowledgments 171 References 172

and obtain the model parameters. It results in a powerful out-of-sample extensions property, which allows inferring of the community aliation for unseen nodes. e KSC model requires a kernel function, which can have kernel parameters and what is needed to identify the number of clusters k in the network. A memory-ecient and computationally ecient model selection technique named balanced angular tting (BAF) based on angular similarity in the eigenspace is proposed in Reference 1. Another parameter-free KSC model is proposed in Reference 3. In Reference 3, the model selection technique exploits the structure of projections in eigenspace to automatically identify the number of clusters and suggests that a normalized linear kernel is sucient for networks with millions of nodes. is model selection technique uses the concept of entropy and balanced clusters for identifying the number of clusters k.