Mobile QR Code QR CODE

2024

Acceptance Ratio

21%

Main Menu

※ The user interface design of www.ieiespc.org has been recently revised and updated. Please contact inter@theieie.org for any inquiries regarding paper submission.

Journal Search

IEIESPC(IEIE Transactions on Smart Processing and Computing)

IEIESPC Vol. 14, No. 01, p.1-10

ISSN (online) :

2287-5255

Received : 14 March 2023Accepted : 25 July 2025

DOI :

https://doi.org/10.5573/IEIESPC.2025.14.1.1

Regular Paper

Efficient Overlapping Community Detection Using MapReduce-based Fuzzy C-Means Clustering on Seed Nodes

KaliaperumalKumaravel¹ RathodMeenakshi L.² RajuLeo³ MahilJ.⁴

( Department of Orthodontics, Saveetha Dental College and Hospitals, Saveetha Institute of Medical and Technical Sciences (SIMATS), Saveetha University, Chennai, India kumaravelk.sdc@saveetha.com)
( Department of ECE, Dr. Ambedkar Institute of Technology, Outer Ring Road, Near Jnana Bharathi Campus, Mallathahalli, Bengaluru, Karnataka, India meenakshilr.ec@drait.edu.in)
( Department of Electrical and Electronics Engineering, Sri Sivasubramaniya Nadar College of Engineering, Rajiv Gandhi Salai (Omr), Kalavakkam, Tamilnadu, India leor@ssn.edu.in)
( Department of Electrical and Electronics Engineering, K.Ramakrishnan College of Technology, Trichy, Tamilnadu, India \qquad mahilj.eee@krct.ac.in)

^* Corresponding Author: Kumaravel Kaliaperumal

License :

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.(www.theieie.org).

Abstract

In social and real-time network applications, community detection is the very popular and rapidly expanding field of study. Recently, many community detection approaches have been developed. In instance, community detection has proved to be effective and successful in local development strategies. Nevertheless, there are few basic problems to expose the overlapping communities. Although certain techniques are not sensitive enough to demonstrate widespread overlaps, the maximal approaches allow the seeds to be initialized and parameters to be created. A new unsupervised Map Reduce dependent local expanding technique to overlap community dependent seed node finding is presented in this study. The proposed method finds the leader or seed nodes of communities by using simple graph metrics, including closeness, centrality, degree, and betweenness. It then finds the communities that follow the leader nodes. Map-Reduce-oriented is proposed to utilize the Fuzzy C-Means Clustering approach to find which communities overlap depending on the leader nodes. The experimental outcomes illustrate the proposed method (LBCD), which assess network graph allowed the overlapping community structures, is more effectual and confident when used to entire 11 actual-world data sets.

Keywords

Leader nodes, Community detection social network, Big data, Map reduce

1. Introduction

The data analysis of social network and the social network are noticed as more and more in recent study in Data Mining, Data Analysis, Machine Learning, Graph Mining and Tweets Classification ^[1]. The community identification ^[2] describes at group of nodes concerning to the set of communities to make as strongly linked sub- graphs from the graph ^[3]. The networks have modeled as graphs to detect the communities is called graph partition issue in modern graph theory ^[4], also the graph clustering or dense sub graph findings issue ^[5] in the graph mining region.

2. Related work

V. E. Lee, et al. ^[6] have suggested an overlapping clique group that, according to the community assumption, supported the Clique Percolation Method (CPM) ^[7,^8]. By searching for neighborhood cliques, CPM finds the community's structures and finds all size $k$ cliques ^[9,^10]. On the basis of link clustering for community recognition, J. M. Kumpula et al. ^[11]. To utilize the label ^[12] broadcast method to determine the overlapping community structures ^[13,^14,^15] by enabling the node to display the various labels, such as COPRA and SLPA ^[16,^17]. The label distribution shows some label clustering in addition to rapidly revealing an overlapping community in the non-determinacy approach of the network ^[18]. Remaining paper is structured as: Architecture of the proposed approach is delineated in Section III, Find the every pair shortest pair utilizing Giraph API is portrayed in Section IV, Computing the degree of impact is presented in Section V, Leader node recognition is depicted in Section VI, Community based generation is displayed in Section VII, and the results with discussions are shown in Section VIII. Section IX gives the conclusion of this work.

3. Proposed Architecture

Fig. 1 depicts the structural design of the proposed technique. It has 6 steps: Step 1: Find each pair on the shortest path for the particular graph of social network applying a Giraph API step 2: Applying HADOOP Map Reduce to find the degree, betweenness, and proximity centralities; step 3: Combining all of the centralities identified in step 2 to calculate the degree of impact (DI). Step 4: Determine the leader nodes by using the structural centrality calculation; step 5: Utilizing leader nodes to generate node vectors; step 6. By using the Parallel Fuzzy C-Means Clustering (FCM) Approach, the soft and hard community clusters are developed.

Fig. 1. Structural design of proposed approach.

4. Find the Every Shortest Pair Utilizing Giraph

The social network folder is uploaded to Giraph's task input in CSV format Runner.run (arg1; arg2) approach called up to begin the process. For instance, the technique uses the SHORTEST PATH action to start the Giraph process from a one node. Image, Fig. 2 portrays Giraph's implementation pattern. Fig. 2 portrays that Giraph's implementation method.

Fig. 2. Giraph's implementation method.

5. Degree of Influence (DI) computation utilizing Map-Reduce Method

This paper utilizes the DI based on the graph's fundamental centrality measurement to display a node's relevance inside the specified social network. This part outlines the Map-Reduce based centrality computations that we presented, and how these centralities are finally combined to determine the DI.

5.1 Finding Betweenness centrality utilizing Map Reduce

The whole node acts as a bridge as betweenness centrality develops over the shortest path connecting the two nodes. The following analysis examines betweenness vertex v at the graph $G= (V,E)$ including $V$ vertices:

1. Determine which node pair's shortest paths are $(s,t)$.

2. Calculate the percentage of the shortest path that leads from vertex $v$ to each of the node pairs $(s, t)$.

3. Total vertex pairs $(s,t)$ add up to this ratio.

The betweenness centrality indicated as follows:

(1)

$ CB_{v=S\neq v\neq t\in V}={\rho }_{st}(v)/{\rho }_{st} , $

where $\rho_{st}$ implies entire shortest paths through `$s$' node $t$ and $\rho_{st}(v)$ represents overall paths to pass from $v$

Algorithm 1: Map of betweeness centrality job

Algorithm 2: Betweeness centrality minimize job

5.2 Map reduce dependent nearness centrality finding

A node's closeness to all other nodes increases with its center position, this is labeled in equation (2).

(2)

$ C(v )= \frac{N-1}{\sum_u dist(v,u) }. $

The less path distance amid the $v$ and $u$ vertices is expressed as $dist(v,u)$, here $N$ refers total nodes in the network that is shown.

Algorithm 3: Closeness centrality map job

Algorithm 4: Closeness centrality decrease job

5.3 Map reduce dependent centrality degree finding

A simple method of calculating centrality that counts a node's neighbors is called degree centrality. This centrality can be mathematically stated in (3)

(3)

$ Cd(v)=\frac{\mathrm{deg}(v)}{n-1}, $

where deg($v$) implies the quantity of node real connection node $v$, and $n-1$ is the count of predictable link of every vertex.

Algorithm 5: The degree of centrality on a map job

Algorithm 6: Decrease the centrality degree for the job

5.4 Computing Level of Influence

Equations (4), (5), and (6) define the centrality ratios. The aggregation which refers to as the DI and is normalized among 0 and 1 that is exhibited in equation (7). Consider $\alpha$, $\beta=0.4$, and $\gamma =0.2$.

(4)

$ {{\mathrm{deg}}_{\mathrm{ratio}} (vi)}=\frac{\mathrm{deg}\mathrm{(vi)}}{\sum_{\mathrm{vi}{\epsilon }\mathrm{V}}{\mathrm{deg} \mathrm{(vi)}}},\\ $

(5)

$ {{\mathrm{between}}_{\mathrm{ratio}} \left(vi\right)}=\frac{{\mathrm{between} \left(\mathrm{vi}\right)\ }}{\sum_{\mathrm{vi}{\epsilon }\mathrm{V}}{\mathrm{between(vi)}}} ,\\ $

(6)

$ {{\mathrm{closeness}}_{\mathrm{ratio}} (vi)}=\frac{\mathrm{closeness}\mathrm{(vi)}}{\sum_{\mathrm{vi}{\epsilon }\mathrm{V}}{\mathrm{closeness(vi)}}},\\ $

(7)

$ DI\left(vi\right)=\alpha\mathrm{*deg}\mathrm{ratio}\left(\mathrm{vi}\right)+{\beta }*\mathrm{closenessratio}\left(\mathrm{vi}\right)\nonumber\\ \hskip 3pc +{\gamma }*\mathrm{betweenratio} $

• DI(vi) - DI for node vi.

• $\alpha$, $\beta$, $\gamma$ - weight variables $\beta \ge \gamma > \lambda$ with $\alpha +\beta + \gamma =1$.

6. Leader nodes identification

6.1 Relative Distance

Node vi's relative distance, or $\rho$I, is defined as the shorter distance between nodes v${}_{i}$ with high DI values is depicted in equation (8),

(8)

$ \rho_i =\min_{j:DI_{j}>DI_i} (d_{ij}). $

Typically with huge density for the node that take $\rho_{i} = \max_{j}(d_{ij})$.

Structural Centrality

The structural centers can be identified by their greater DI value for node vi is labelled in eqn (9).

(9)

$ Sc_i=D. $

The metric essentially ignores this situation when adjacent nodes along the better level of effect are recognized as structural centers.

Algorithm 7: Identify the leader nodes

7. Community creation

7.1 Node Vector Set Generation

The leader nodes found in the previous step were used to create the node vectors for each node on the chosen network. Fig. 3 shows the node vectors for a certain nodes in the Karathe network.

Fig. 3. Node vector including seed nodes for karathe network.

7.2 Generation of Soft Community Clusters

Algorithm 8: Map_Reduce_FCM

Algorithm 9: Cluster_Centre_Mapper

Algorithm 10: Cluster_Center_Reducer

Algorithm 11: Membership_Mapper

Algorithm 12: Membership_Reducer

7.3 Allocating leader nodes towards the communities

The community cluster whose centroid has the greatest similarity to the leader node receives the allocation of the leader node. This is labeled in equation (10),

(10)

$ \text{Cluster (V${}_{leader} )= \max_{ci\in C} \{$Cosine(ci,Vleader)$\}$}, $

V${}_{leader}$ - Leader node,

C${}_{i}$ - Centroid for $i$th cluster.

7.3.1 Hard community clusters Creation

The soft clusters are generated by parallel FCM approach. The matrix of community membership was produced by this FCM. The node v${}_{i}$ can be assigned to two communities c${}_{i}$ and c${}_{j}$ if node v${}_{i}$ having the community membership score difference on two communities c${}_{i}$ and c${}_{j}$ not more than $\lambda$. i.emem${}_{vi} ($c${}_{\rm max} )$-mem${}_{vi} ($c${}_{i} )$&mem${}_{vi} ($c${}_{\rm max} )$-mem${}_{vi} ($c${}_{j} ) \le \lambda$, and both c${}_{i}$ and c${}_{j}$ must be top two scorer community for node v${}_{i}$. Here we are setting threshold $\lambda=0.00001$. Fig. 4(a) illustrates the Communities of Karathe Network; Fig. 4(b) shows the Communities of Dolphins Network; Fig. 4(c) denotes Communities of Jazz Network and Fig. 4(d) shows communities of E-Mail Network.

Fig. 4. (a) Communities of Karathe Network. (b) Communities of Dolphins Network. (c) Communities of Jazz Network. (d) Communities of E-Mail Network.

(a)

(b)

(c)

(d)

8. Results with discussions

8.1 Description of the Dataset

The datasets for our experiments include Dolphins, Football, Karate, Political Blogs, Power, Yeast, Political Book Network, Netscience, Jazz, and Email networks. Detailed information is provided in Table 1, indicating the number of vertices (|V|), links (|E|), average degree (Davg), clustering coefficient (C), and average path length (PLavg).

Table 1. Graph features of various databases.

Networks	∣V∣	∣E∣	D_avg	C-coef	PL_avg
Football	115	613	10.661	0.403	2.508
Dolphin	62	159	5.129	0.303	3.357
Karathe	34	78	4.588	0.285	1.274
Jazz	198	2742	27.7	0.52	2.21
Pollblocks	1490	19025	25.537	0.172	3.39
Power	4941	6594	2.669	0.107	18.989
Yeast	2375	11693	9.847	0.153	4.14
Pol books	105	441	8.40	0.4875	2.341
E-mail	1133	5451	9.62	0.166	3.65
Net science	1589	2742	3.45	0.123	2.123
Word	112	425	7.59	0.1431	3.214

8.2 Outcomes based upon datasets

This section examines the proposed and existing CORPA, LBCD, LC, SLPA, LFM, and CPM methods under 10 real-world databases. The results produced by these techniques are tabulated in Table 2. Fig. 5 displays the membership matrix for the karate network. Fig. 6 reveals nodes with a membership value of 1 for one cluster, identifying them as leader nodes.

Table 2. EQ values comparison in real world datasets.

Data sets	EQ (Extended Modularity score) / number of communities
	CPM	LC	SLPA	LFM	GCE	COPRA	LBCD
Karathe	0.2708 / 3	0.2251 / 2	0.5192/2	0.4852/2	0.3474/2	0.2610/3	0.5295 /2
Dolphin	0.4195 / 4	0.216/10	0.4997/5	0.5183/5	0.4706/5	0.4640/4	0.5486/2
Football	0.4897 /15	0.1729/15	0.4712/11	0.4816/14	0.5907/12	0.3729/18	0.6212/10
Jazz	0.2396 /8	0.0381/6	0.4944/6	0.5198/4	0.3206/2	0.4560/4	0.4816/5
Polbooks	0.5008/5	0.0244/3	0.4959/3	0.4550/3	0.4865/3	0.5115/3	0.4996/3
Pollblocks	0.0106 /28	0.0118/27	0.4328/6	0.4425/8	0.2901/2	0.4290/9	0.4815/4
Power	0.1567/301	0.2149/322	0.559/661	0.5649/167	0.468/300	0.435/427	0.5105/290
Yeast	0.1153/17	0.6388/47	0.6268/53	0.6623/27	0.3544/40	0.2742/52	0.6667/28
Email	0.1163/4	0.0469/21	0.3634/16	0.4272/13	0.4273/35	0.00001/1	0.5619/7
Netscience	0.5745/169	0.8718/334	0.8683/325	0.8816/331	0.7250/126	0.6853/703	0.8980/310
Word	0.1003/4	0.0472/5	0.0910/3	0.0983/4	0.00001/1	0.00001/1	0.1386/7

The relation among EQ and $\beta$ is displayed in Fig. 6. The value of EQ is raised with $\beta$ variable then it attains higher value0.4 to 0.45 in small datasets as shown in Fig. 6(a). and it reaches the greater value 0.35 to 0.4 in large datasets as shown in Fig. 6(b).

Table 3 tabulates the implementation time of proposed approach. This is represented in Fig. 7.

Table 3. Implementation period of four real-world databases.

Database	Time consume at secs
	Processor amount (p)
	\|c\|=1	\|c\|=2	\|c\|=4	\|c\|=8	\|c\|=16
Power	15,500	9850	4550	2560	1752
Yeast	8150	4456	2386	1326	786
Net Science	5256	3001	1656	932	532
Poll Block	4932	2621	1412	802	486

Fig. 5. Membership Matrix of karathe Network.}

Fig. 6. (a) EQ value on small data sets against ${\beta}$ values. (b) EQ value on large data sets against ${\beta}$ values.

(a)

(b)

Fig. 7. Implementation period of four real-world datasets.

9. Conclusion with future work

This section discusses structural centrality as a method for determining the leader nodes in networks. In order to identify the overlapping communities inside the supplied network, the Fuzzy C-Means Clustering approach, which is based on Parallel Map-Reduce, is dependent on these leader nodes. Advantages in decidable tests are characterized by two elements by LBCD. First, the locating leader (seed) nodes compute the community structures count for a network. Second, in contrast to other approaches, the expansion strategy goes around to the leader nodes and removes the need for non-linear seed selection and human parameter adjustments, which accelerates convergence to the best results and makes the most stable algorithm possible. In the feature, the relationship between any two nodes in the provided social network graph may be predicted by applying this leader-based approach moving feature.

REFERENCES

S. Fortunato, ``Community detection in graphs,'' Physics Reports, vol. 486, no. 3-5, pp. 75-174 2010.

L .Tang and H. Liu, ``Community detection and mining in social media,'' Synthesis Lectures on Data Mining and Knowledge Discovery, vol. 2, no. 1, pp. 1-137, 2010.

R Diestel, Random Graphs in Graph Theory, Springer, Berlin, Heidelberg, pp. 323-345, 2017.

B. Bollobás, Modern Graph Theory, vol. 184 of Graduate Texts in Mathematics, Springer, New York, NY, p. 4, 1998.

C. C Aggarwal and H. Wang, Managing and Mining Graph Data, Springer, New York, vol. 40, 2010.

V. E. Lee, et al., ``A survey of algorithms for dense subgraph discovery,'' Managing and Mining Graph Data, Springer, Boston, MA, pp. 303-336, 2010.

A Clauset, et al., ``Finding community structure in very large networks,'' Physical Review E, vol. 70, no. 6, 066111, 2004.

D. Gibson, et al., ``Discovering large dense subgraphs in massive graphs,'' Proc. of the 31st International Conference on Very Large Data Bases, pp. 721-732, August 2005.

M. E. Newman, ``Modularity and community structure in networks,'' Proceedings of the National Academy of Sciences, vol. 103, no. 23, pp. 8577-8582, 2006.

U. N. Raghavan, et al., ``Near linear time algorithm to detect community structures in large-scale networks,'' Physical Review E, vol. 76, no. 3, 036106, 2007.

J. M, Kumpula, et al., ``Sequential algorithm for fast clique percolation,'' Physical Review E, vol. 78, no. 2, 026109, 2008.

F. Zhao and A. K. Tung, ``Large scale cohesive subgraphs discovery for social network visual analysis,'' Proceedings of the VLDB Endowment, vol. 6, no. 2, pp. 85-96, 2012.

J. Saad, ``Dense subgraph extraction with application to community detection,'' IEEE Transactions on Knowledge and Data Engineering, vol. 24, no. 7, pp. 1216-1230, 2010.

J. Huang, et al., ``Revealing density-based clustering structure from the core-connected tree of a network,'' IEEE Transactions on Knowledge and Data Engineering, vol. 25, no. 8, pp. 1876-1889, 2012.

M. C. Nascimento and A. C. de Carvalho, ``Spectral methods for graph clustering - A survey,'' European Journal of Operational Research, vol. 211, no. 2, pp. 221-231, 2011.

L. C. Freeman, et al., ``Centrality in valued graphs: A measure of betweenness based on network flow,'' Social Networks, vol. 13, no. 2, pp. 141-154, 1991.

M. M. Iqbal and K. Latha, ``An effective community-based link prediction model for improving accuracy in social networks,'' Journal of Intelligent & Fuzzy Systems, (Preprint), pp. 1-17, 2022.

A. Rusinowska, R. Berghammer, H. D. Swart, and M. Grabisch, ``Social networks: Prestige, centrality, and influence,'' Proc. of International Conference on Relational and Algebraic Methods in Computer Science, Springer, Berlin, Heidelberg, pp. 22-39, May 2011.

Kumaravel Kaliaperumal

Kumaravel Kaliaperumal is currently working an Assistant Professor with the Department of Orthodontics, Saveetha Dental College and Hospitals, Saveetha Institite of Medical and Technical Sciences (SIMATS), Saveetha University, Chennai, India. His research interests include soft computing application in power system Engineering.

Meenakshi L. Rathod

Meenakshi L. Rathod is currently working an Assistant professor with the Department of ECE, Dr.Ambedkar Institute of Technology, Bengaluru, Karnataka. She obtained her B.E. in electronics & communication engineering from P.D.A. College of Engineering, Gulbarga, in Feb.2002, an M.Tech. degree in VLSI design and embedded system from UTL Technological Ltd. VTU extension center under Visvesvaraya Technological University in 2009, and a Ph.D. degree under Visvesvaraya Technological University, Belagavi. She has published more than 20 researchpapers in the journals of international repute and 4 Patents filed. Her research interests include VLSI and Communications.

Leo Raju

Leo Raju is an Associate Professor in the Department of Electrical and Electronics Engineering with 24 years of teaching and 4 years of industrial experience. He received his B.E. (EEE) degree from Government College of Technology, Coimbatore, Bharathiyar University in 1994, an M.E. degree in computer science and engineering from Anna University, Chennai in 2005, and a Ph.D. degree from Anna University, Chennai. He is a recognized supervisor at Anna University. His research interests include Smart Grid, Micro-grid Energy Management, Multi-Agent Systems, IoT, Big Data Analytics, Soft Computing, Machine Learning, and Blockchain.

J. Mahil

J. Mahil is currently working as Professor in the Department of EEE at K.Ramakrishnan College of Technology, Trichy, Tamilnadu, India. She has completed her B.E. degree in electrical and electronics engineering, an M.E degree in process control and instrumentation, and a Ph.D degree in electrical engineering, from the Manonmaniam Sundaranar University, Annamalai University and Noorul Islam University, India, in 2001, 2004, and 2013, respectively. Her researches focus on Neural Networks, Evolutionary Optimization, Biomedical, Computational Intelligence and Bio-Signal Processing. She has 20 years of experience in teaching and research and published many research papers in papers in leading peer reviewed International Journals.

IEIE SPC IEIE Transactions on Smart Processing & Computing

Journal Search

Journal XML

Journal Information

Efficient Overlapping Community Detection Using MapReduce-based Fuzzy C-Means Clustering on Seed Nodes

Abstract

Keywords

1. Introduction

2. Related work

3. Proposed Architecture

4. Find the Every Shortest Pair Utilizing Giraph

5. Degree of Influence (DI) computation utilizing Map-Reduce Method

5.1 Finding Betweenness centrality utilizing Map Reduce

(1)

5.2 Map reduce dependent nearness centrality finding

(2)

5.3 Map reduce dependent centrality degree finding

(3)

5.4 Computing Level of Influence

(4)

(5)

(6)

(7)

6. Leader nodes identification

6.1 Relative Distance

(8)

(9)

7. Community creation

7.1 Node Vector Set Generation

7.2 Generation of Soft Community Clusters

7.3 Allocating leader nodes towards the communities

(10)

7.3.1 Hard community clusters Creation

8. Results with discussions

8.1 Description of the Dataset

8.2 Outcomes based upon datasets

9. Conclusion with future work

REFERENCES

Kumaravel Kaliaperumal

Meenakshi L. Rathod

Leo Raju

J. Mahil

Article Information (continued)

Keywords

IEIE SPC

IEIE Transactions on Smart Processing & Computing