Efficient Overlapping Community Detection Using MapReduce-based Fuzzy C-Means Clustering
on Seed Nodes
KaliaperumalKumaravel1
RathodMeenakshi L.2
RajuLeo3
MahilJ.4
-
( Department of Orthodontics, Saveetha Dental College and Hospitals, Saveetha Institute
of Medical and Technical Sciences (SIMATS), Saveetha University, Chennai, India
kumaravelk.sdc@saveetha.com)
-
( Department of ECE, Dr. Ambedkar Institute of Technology, Outer Ring Road, Near Jnana
Bharathi Campus, Mallathahalli, Bengaluru, Karnataka, India meenakshilr.ec@drait.edu.in)
-
( Department of Electrical and Electronics Engineering, Sri Sivasubramaniya Nadar
College of Engineering, Rajiv Gandhi Salai (Omr), Kalavakkam, Tamilnadu, India leor@ssn.edu.in)
-
( Department of Electrical and Electronics Engineering, K.Ramakrishnan College of Technology,
Trichy, Tamilnadu, India \qquad mahilj.eee@krct.ac.in)
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Keywords
Leader nodes, Community detection social network, Big data, Map reduce
1. Introduction
The data analysis of social network and the social network are noticed as more and
more in recent study in Data Mining, Data Analysis, Machine Learning, Graph Mining
and Tweets Classification [1]. The community identification [2] describes at group of nodes concerning to the set of communities to make as strongly
linked sub- graphs from the graph [3]. The networks have modeled as graphs to detect the communities is called graph partition
issue in modern graph theory [4], also the graph clustering or dense sub graph findings issue [5] in the graph mining region.
2. Related work
V. E. Lee, et al. [6] have suggested an overlapping clique group that, according to the community assumption,
supported the Clique Percolation Method (CPM) [7,8]. By searching for neighborhood cliques, CPM finds the community's structures and
finds all size $k$ cliques [9,10]. On the basis of link clustering for community recognition, J. M. Kumpula et al.
[11]. To utilize the label [12] broadcast method to determine the overlapping community structures [13,14,15] by enabling the node to display the various labels, such as COPRA and SLPA [16,17]. The label distribution shows some label clustering in addition to rapidly revealing
an overlapping community in the non-determinacy approach of the network [18]. Remaining paper is structured as: Architecture of the proposed approach is delineated
in Section III, Find the every pair shortest pair utilizing Giraph API is portrayed
in Section IV, Computing the degree of impact is presented in Section V, Leader node
recognition is depicted in Section VI, Community based generation is displayed in
Section VII, and the results with discussions are shown in Section VIII. Section IX
gives the conclusion of this work.
3. Proposed Architecture
Fig. 1 depicts the structural design of the proposed technique. It has 6 steps: Step 1:
Find each pair on the shortest path for the particular graph of social network applying
a Giraph API step 2: Applying HADOOP Map Reduce to find the degree, betweenness, and
proximity centralities; step 3: Combining all of the centralities identified in step
2 to calculate the degree of impact (DI). Step 4: Determine the leader nodes by using
the structural centrality calculation; step 5: Utilizing leader nodes to generate
node vectors; step 6. By using the Parallel Fuzzy C-Means Clustering (FCM) Approach,
the soft and hard community clusters are developed.
Fig. 1. Structural design of proposed approach.
4. Find the Every Shortest Pair Utilizing Giraph
The social network folder is uploaded to Giraph's task input in CSV format Runner.run
(arg1; arg2) approach called up to begin the process. For instance, the technique
uses the SHORTEST PATH action to start the Giraph process from a one node. Image,
Fig. 2 portrays Giraph's implementation pattern. Fig. 2 portrays that Giraph's implementation method.
Fig. 2. Giraph's implementation method.
5. Degree of Influence (DI) computation utilizing Map-Reduce Method
This paper utilizes the DI based on the graph's fundamental centrality measurement
to display a node's relevance inside the specified social network. This part outlines
the Map-Reduce based centrality computations that we presented, and how these centralities
are finally combined to determine the DI.
5.1 Finding Betweenness centrality utilizing Map Reduce
The whole node acts as a bridge as betweenness centrality develops over the shortest
path connecting the two nodes. The following analysis examines betweenness vertex
v at the graph $G= (V,E)$ including $V$ vertices:
1. Determine which node pair's shortest paths are $(s,t)$.
2. Calculate the percentage of the shortest path that leads from vertex $v$ to each
of the node pairs $(s, t)$.
3. Total vertex pairs $(s,t)$ add up to this ratio.
The betweenness centrality indicated as follows:
where $\rho_{st}$ implies entire shortest paths through `$s$' node $t$ and $\rho_{st}(v)$
represents overall paths to pass from $v$
Algorithm 1: Map of betweeness centrality job
Algorithm 2: Betweeness centrality minimize job
5.2 Map reduce dependent nearness centrality finding
A node's closeness to all other nodes increases with its center position, this is
labeled in equation (2).
The less path distance amid the $v$ and $u$ vertices is expressed as $dist(v,u)$,
here $N$ refers total nodes in the network that is shown.
Algorithm 3: Closeness centrality map job
Algorithm 4: Closeness centrality decrease job
5.3 Map reduce dependent centrality degree finding
A simple method of calculating centrality that counts a node's neighbors is called
degree centrality. This centrality can be mathematically stated in (3)
where deg($v$) implies the quantity of node real connection node $v$, and $n-1$ is
the count of predictable link of every vertex.
Algorithm 5: The degree of centrality on a map job
Algorithm 6: Decrease the centrality degree for the job
5.4 Computing Level of Influence
Equations (4), (5), and (6) define the centrality ratios. The aggregation which refers to as the DI and is normalized
among 0 and 1 that is exhibited in equation (7). Consider $\alpha$, $\beta=0.4$, and $\gamma =0.2$.
• DI(vi) - DI for node vi.
• $\alpha$, $\beta$, $\gamma$ - weight variables $\beta \ge \gamma > \lambda$ with
$\alpha +\beta + \gamma =1$.
6. Leader nodes identification
6.1 Relative Distance
Node vi's relative distance, or $\rho$I, is defined as the shorter distance between
nodes v${}_{i}$ with high DI values is depicted in equation (8),
Typically with huge density for the node that take $\rho_{i} = \max_{j}(d_{ij})$.
Structural Centrality
The structural centers can be identified by their greater DI value for node vi is
labelled in eqn (9).
The metric essentially ignores this situation when adjacent nodes along the better
level of effect are recognized as structural centers.
Algorithm 7: Identify the leader nodes
7. Community creation
7.1 Node Vector Set Generation
The leader nodes found in the previous step were used to create the node vectors for
each node on the chosen network. Fig. 3 shows the node vectors for a certain nodes in the Karathe network.
Fig. 3. Node vector including seed nodes for karathe network.
7.2 Generation of Soft Community Clusters
Algorithm 8: Map_Reduce_FCM
Algorithm 9: Cluster_Centre_Mapper
Algorithm 10: Cluster_Center_Reducer
Algorithm 11: Membership_Mapper
Algorithm 12: Membership_Reducer
7.3 Allocating leader nodes towards the communities
The community cluster whose centroid has the greatest similarity to the leader node
receives the allocation of the leader node. This is labeled in equation (10),
V${}_{leader}$ - Leader node,
C${}_{i}$ - Centroid for $i$th cluster.
7.3.1 Hard community clusters Creation
The soft clusters are generated by parallel FCM approach. The matrix of community
membership was produced by this FCM. The node v${}_{i}$ can be assigned to two communities
c${}_{i}$ and c${}_{j}$ if node v${}_{i}$ having the community membership score difference
on two communities c${}_{i}$ and c${}_{j}$ not more than $\lambda$. i.emem${}_{vi}
($c${}_{\rm max} )$-mem${}_{vi} ($c${}_{i} )$&mem${}_{vi} ($c${}_{\rm max} )$-mem${}_{vi}
($c${}_{j} ) \le \lambda$, and both c${}_{i}$ and c${}_{j}$ must be top two scorer
community for node v${}_{i}$. Here we are setting threshold $\lambda=0.00001$. Fig. 4(a) illustrates the Communities of Karathe Network; Fig. 4(b) shows the Communities of Dolphins Network; Fig. 4(c) denotes Communities of Jazz Network and Fig. 4(d) shows communities of E-Mail Network.
Fig. 4. (a) Communities of Karathe Network. (b) Communities of Dolphins Network. (c)
Communities of Jazz Network. (d) Communities of E-Mail Network.
(a)
(b)
(c)
(d)
8. Results with discussions
8.1 Description of the Dataset
The datasets for our experiments include Dolphins, Football, Karate, Political Blogs,
Power, Yeast, Political Book Network, Netscience, Jazz, and Email networks. Detailed
information is provided in Table 1, indicating the number of vertices (|V|), links (|E|), average degree (Davg), clustering
coefficient (C), and average path length (PLavg).
Table 1. Graph features of various databases.
Networks
|
∣V∣
|
∣E∣
|
Davg
|
C-coef
|
PLavg
|
Football
|
115
|
613
|
10.661
|
0.403
|
2.508
|
Dolphin
|
62
|
159
|
5.129
|
0.303
|
3.357
|
Karathe
|
34
|
78
|
4.588
|
0.285
|
1.274
|
Jazz
|
198
|
2742
|
27.7
|
0.52
|
2.21
|
Pollblocks
|
1490
|
19025
|
25.537
|
0.172
|
3.39
|
Power
|
4941
|
6594
|
2.669
|
0.107
|
18.989
|
Yeast
|
2375
|
11693
|
9.847
|
0.153
|
4.14
|
Pol books
|
105
|
441
|
8.40
|
0.4875
|
2.341
|
E-mail
|
1133
|
5451
|
9.62
|
0.166
|
3.65
|
Net science
|
1589
|
2742
|
3.45
|
0.123
|
2.123
|
Word
|
112
|
425
|
7.59
|
0.1431
|
3.214
|
8.2 Outcomes based upon datasets
This section examines the proposed and existing CORPA, LBCD, LC, SLPA, LFM, and CPM
methods under 10 real-world databases. The results produced by these techniques are
tabulated in Table 2. Fig. 5 displays the membership matrix for the karate network. Fig. 6 reveals nodes with a membership value of 1 for one cluster, identifying them as leader
nodes.
Table 2. EQ values comparison in real world datasets.
Data sets
|
EQ (Extended Modularity score) / number of communities
|
|
CPM
|
LC
|
SLPA
|
LFM
|
GCE
|
COPRA
|
LBCD
|
Karathe
|
0.2708 / 3
|
0.2251 / 2
|
0.5192/2
|
0.4852/2
|
0.3474/2
|
0.2610/3
|
0.5295 /2
|
Dolphin
|
0.4195 / 4
|
0.216/10
|
0.4997/5
|
0.5183/5
|
0.4706/5
|
0.4640/4
|
0.5486/2
|
Football
|
0.4897 /15
|
0.1729/15
|
0.4712/11
|
0.4816/14
|
0.5907/12
|
0.3729/18
|
0.6212/10
|
Jazz
|
0.2396 /8
|
0.0381/6
|
0.4944/6
|
0.5198/4
|
0.3206/2
|
0.4560/4
|
0.4816/5
|
Polbooks
|
0.5008/5
|
0.0244/3
|
0.4959/3
|
0.4550/3
|
0.4865/3
|
0.5115/3
|
0.4996/3
|
Pollblocks
|
0.0106 /28
|
0.0118/27
|
0.4328/6
|
0.4425/8
|
0.2901/2
|
0.4290/9
|
0.4815/4
|
Power
|
0.1567/301
|
0.2149/322
|
0.559/661
|
0.5649/167
|
0.468/300
|
0.435/427
|
0.5105/290
|
Yeast
|
0.1153/17
|
0.6388/47
|
0.6268/53
|
0.6623/27
|
0.3544/40
|
0.2742/52
|
0.6667/28
|
Email
|
0.1163/4
|
0.0469/21
|
0.3634/16
|
0.4272/13
|
0.4273/35
|
0.00001/1
|
0.5619/7
|
Netscience
|
0.5745/169
|
0.8718/334
|
0.8683/325
|
0.8816/331
|
0.7250/126
|
0.6853/703
|
0.8980/310
|
Word
|
0.1003/4
|
0.0472/5
|
0.0910/3
|
0.0983/4
|
0.00001/1
|
0.00001/1
|
0.1386/7
|
The relation among EQ and $\beta$ is displayed in Fig. 6. The value of EQ is raised with $\beta$ variable then it attains higher value0.4
to 0.45 in small datasets as shown in Fig. 6(a). and it reaches the greater value 0.35 to 0.4 in large datasets as shown in Fig. 6(b).
Table 3 tabulates the implementation time of proposed approach. This is represented in Fig. 7.
Table 3. Implementation period of four real-world databases.
Database
|
Time consume at secs
|
Processor amount (p)
|
|c|=1
|
|c|=2
|
|c|=4
|
|c|=8
|
|c|=16
|
Power
|
15,500
|
9850
|
4550
|
2560
|
1752
|
Yeast
|
8150
|
4456
|
2386
|
1326
|
786
|
Net Science
|
5256
|
3001
|
1656
|
932
|
532
|
Poll Block
|
4932
|
2621
|
1412
|
802
|
486
|
Fig. 5. Membership Matrix of karathe Network.}
Fig. 6. (a) EQ value on small data sets against ${\beta}$ values. (b) EQ value on
large data sets against ${\beta}$ values.
(a)
(b)
Fig. 7. Implementation period of four real-world datasets.
9. Conclusion with future work
This section discusses structural centrality as a method for determining the leader
nodes in networks. In order to identify the overlapping communities inside the supplied
network, the Fuzzy C-Means Clustering approach, which is based on Parallel Map-Reduce,
is dependent on these leader nodes. Advantages in decidable tests are characterized
by two elements by LBCD. First, the locating leader (seed) nodes compute the community
structures count for a network. Second, in contrast to other approaches, the expansion
strategy goes around to the leader nodes and removes the need for non-linear seed
selection and human parameter adjustments, which accelerates convergence to the best
results and makes the most stable algorithm possible. In the feature, the relationship
between any two nodes in the provided social network graph may be predicted by applying
this leader-based approach moving feature.
REFERENCES
S. Fortunato, ``Community detection in graphs,'' Physics Reports, vol. 486, no. 3-5,
pp. 75-174 2010.

L .Tang and H. Liu, ``Community detection and mining in social media,'' Synthesis
Lectures on Data Mining and Knowledge Discovery, vol. 2, no. 1, pp. 1-137, 2010.

R Diestel, Random Graphs in Graph Theory, Springer, Berlin, Heidelberg, pp. 323-345,
2017.

B. Bollobás, Modern Graph Theory, vol. 184 of Graduate Texts in Mathematics, Springer,
New York, NY, p. 4, 1998.

C. C Aggarwal and H. Wang, Managing and Mining Graph Data, Springer, New York, vol.
40, 2010.

V. E. Lee, et al., ``A survey of algorithms for dense subgraph discovery,'' Managing
and Mining Graph Data, Springer, Boston, MA, pp. 303-336, 2010.

A Clauset, et al., ``Finding community structure in very large networks,'' Physical
Review E, vol. 70, no. 6, 066111, 2004.

D. Gibson, et al., ``Discovering large dense subgraphs in massive graphs,'' Proc.
of the 31st International Conference on Very Large Data Bases, pp. 721-732, August
2005.

M. E. Newman, ``Modularity and community structure in networks,'' Proceedings of the
National Academy of Sciences, vol. 103, no. 23, pp. 8577-8582, 2006.

U. N. Raghavan, et al., ``Near linear time algorithm to detect community structures
in large-scale networks,'' Physical Review E, vol. 76, no. 3, 036106, 2007.

J. M, Kumpula, et al., ``Sequential algorithm for fast clique percolation,'' Physical
Review E, vol. 78, no. 2, 026109, 2008.

F. Zhao and A. K. Tung, ``Large scale cohesive subgraphs discovery for social network
visual analysis,'' Proceedings of the VLDB Endowment, vol. 6, no. 2, pp. 85-96, 2012.

J. Saad, ``Dense subgraph extraction with application to community detection,'' IEEE
Transactions on Knowledge and Data Engineering, vol. 24, no. 7, pp. 1216-1230, 2010.

J. Huang, et al., ``Revealing density-based clustering structure from the core-connected
tree of a network,'' IEEE Transactions on Knowledge and Data Engineering, vol. 25,
no. 8, pp. 1876-1889, 2012.

M. C. Nascimento and A. C. de Carvalho, ``Spectral methods for graph clustering -
A survey,'' European Journal of Operational Research, vol. 211, no. 2, pp. 221-231,
2011.

L. C. Freeman, et al., ``Centrality in valued graphs: A measure of betweenness based
on network flow,'' Social Networks, vol. 13, no. 2, pp. 141-154, 1991.

M. M. Iqbal and K. Latha, ``An effective community-based link prediction model for
improving accuracy in social networks,'' Journal of Intelligent & Fuzzy Systems, (Preprint),
pp. 1-17, 2022.

A. Rusinowska, R. Berghammer, H. D. Swart, and M. Grabisch, ``Social networks: Prestige,
centrality, and influence,'' Proc. of International Conference on Relational and Algebraic
Methods in Computer Science, Springer, Berlin, Heidelberg, pp. 22-39, May 2011.

Kumaravel Kaliaperumal is currently working an Assistant Professor with the Department
of Orthodontics, Saveetha Dental College and Hospitals, Saveetha Institite of Medical
and Technical Sciences (SIMATS), Saveetha University, Chennai, India. His research
interests include soft computing application in power system Engineering.
Meenakshi L. Rathod is currently working an Assistant professor with the Department
of ECE, Dr.Ambedkar Institute of Technology, Bengaluru, Karnataka. She obtained her
B.E. in electronics & communication engineering from P.D.A. College of Engineering,
Gulbarga, in Feb.2002, an M.Tech. degree in VLSI design and embedded system from UTL
Technological Ltd. VTU extension center under Visvesvaraya Technological University
in 2009, and a Ph.D. degree under Visvesvaraya Technological University, Belagavi.
She has published more than 20 researchpapers in the journals of international repute
and 4 Patents filed. Her research interests include VLSI and Communications.
Leo Raju is an Associate Professor in the Department of Electrical and Electronics
Engineering with 24 years of teaching and 4 years of industrial experience. He received
his B.E. (EEE) degree from Government College of Technology, Coimbatore, Bharathiyar
University in 1994, an M.E. degree in computer science and engineering from Anna University,
Chennai in 2005, and a Ph.D. degree from Anna University, Chennai. He is a recognized
supervisor at Anna University. His research interests include Smart Grid, Micro-grid
Energy Management, Multi-Agent Systems, IoT, Big Data Analytics, Soft Computing, Machine
Learning, and Blockchain.
J. Mahil is currently working as Professor in the Department of EEE at K.Ramakrishnan
College of Technology, Trichy, Tamilnadu, India. She has completed her B.E. degree
in electrical and electronics engineering, an M.E degree in process control and instrumentation,
and a Ph.D degree in electrical engineering, from the Manonmaniam Sundaranar University,
Annamalai University and Noorul Islam University, India, in 2001, 2004, and 2013,
respectively. Her researches focus on Neural Networks, Evolutionary Optimization,
Biomedical, Computational Intelligence and Bio-Signal Processing. She has 20 years
of experience in teaching and research and published many research papers in papers
in leading peer reviewed International Journals.