Mobile QR Code QR CODE

2024

Acceptance Ratio

21%

Main Menu

※ The user interface design of www.ieiespc.org has been recently revised and updated. Please contact inter@theieie.org for any inquiries regarding paper submission.

Journal Search

IEIESPC(IEIE Transactions on Smart Processing and Computing)

IEIESPC Vol. 11, No. 04, p.276-283

ISSN (online) :

2287-5255

Received : 22 May 2022Revised : 21 June 2022Accepted : 04 July 2022

DOI :

https://doi.org/10.5573/IEIESPC.2022.11.4.276

Regular Paper

A Study on Parallel Clustering Algorithms based on MapReduce

ChughGarvit¹ BhatiNitesh Singh² KumarPuneet³ BhartiVishal³

(Department of Computer Science, Indian Institute of Technology, Jodhpur chugh.2@iitj.ac.in)
(Department of Computer Science, Delhi Technical Campus, Greater Noida (GGSIPU) niteshbhati07@gmail.com )
(Department of Computer Science, Chandigarh University, Mohali, India professor.pkumar@gmail.com, mevishalbharti@yahoo.com )

^* Corresponding Author: Nitesh Singh Bhati

License :

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.(www.theieie.org).

Abstract

Since the advent of the global computerized market, the volume of digital information has grown exponentially, as has the demand for storing it. As the price of storage devices decreases, the necessity to analyze vast quantities of unstructured digital data to retain only essential information increases. MapReduce is a programming paradigm for producing and generating massive information indices. Using MapReduce to produce meaningful clusters from such a massive amount of raw data is an efficient way to manage such voluminous amounts of data. On the other hand, the existing industry standard for data clustering algorithms presents significant obstacles. The conventional clustering calculation efficiently handles a great deal of information from various sources, such as online media, business, and the web. Nevertheless, the sequential count in clustering approaches is time-intensive in these conventional calculations. The wide varieties of K-Means, including K-Harmonic Means, are sensitive to forming cluster centers in huge datasets. This work suggests a logical evaluation of such calculations. It offers a study of the various k-means clustering algorithms employed in MapReduce, as well as the study on the introduction and the open challenges of parallelism in MapReduce.

Keywords

MapReduce, Clustering, K means, K harmonic, Parallel execution

1. Introduction

The industrial world in 2022 has accumulated billions of raw data points related to their policies, implementations, sales, and different product-related information. On the other hand, not all the data are useful. Some are irrelevant to the company and hence a burden/wastage of storage. Understanding this raw data to verify the relevancy, but the size of this raw data is huge. In addition, it is difficult for a human to describe and make meaning of such huge data sizes. Hence various techniques have been introduced that perform this task on behalf of humans. Such techniques follow the methodology of clustering, which is an unsupervised technique of forming groups of similar data to satisfy the objective of understanding the underlying structure in the available raw data. This technique has evolved and is used in various tools to make statistical analyses based on the output provided by clustering.

An exact solution using clustering is an NP-hard problem because it cannot be done in polynomial time, even with only two clusters ^[2]. Hence, a more elegant solution for analysis was proposed based on parallelism, i.e., parallel execution of data, which makes faster computation and efficient distribution of the tasks. This solution is based on MapReduce. MapReduce is a programming model for preparing and producing huge informational indices ^[3]. The MapReduce calculation can handle a large amount of raw data by producing meaningful clusters. The calculation is simple, flexible, fault-tolerant, and highly scalable. Hence, MapReduce is used in big data clustering ^[4].

The users have the capability of specifying the map. This map consists of the values of a key/value pair. This map is used to generate many key/value pairs, which are known as a set of intermediate pairs. On the other hand, there is a Reduce function, which works on this map and merges every intermediate key/pair value with the same intermediate key. This phenomenon is known as MapReduce ^[5]. Thus, parallelism only is insufficient, and there is a need to manage the exponential job creation time and the time required to perform big data shuffling. These time management issues can be managed by removing the many iterations on which clustering algorithms depend. For that purpose, different clustering algorithms have been used in different studies, based on different starting points and criteria leading to different outcomes and taxonomies of clustering algorithms. Some include K-Means, K-Harmonic Means, Fuzzy C-Means, and Hierarchical. Table 5 lists the advantages and disadvantages, and other usages for the aforementioned algorithms.

1.1 MapReduce

MapReduce is simple, flexible, fault-tolerant, and highly scalable. Hence, MapReduce is used in big data clustering. Model applications incorporate ordering records, breaking down Web access logs, and AI. The most widely used implementation of the MapReduce framework is Hadoop, a part of the Open Source community ^[6]. Fig. 1 presents the system as a file system, DFS, or Distributed file system, to form partitions in multiple machines. The partition is produced in an initial phase.

Fig. 1. MapReduce Functionality.

The word Count problem can be taken as the best example to explain the functionality of MapReduce. The word count problem dictates that different words can be taken as the input, and each type will be counted. The mapper class, in this case, will be used to tokenize the strings. The tokenization can be used further to sort the words. The sorting can also be numbered, making it a key-value pair. The reducer class will take this sorted list as the input and convert it into a list with the keys and the count of the keys. This example can be understood visually using Fig. 2.

Fig. 2. MapReduce Example.

1.2 Clustering Algorithms

Clustering is an unsupervised technique of clustering the data or forming groups of similar data to satisfy the objective of understanding the underlying structure in the available raw data. This technique has evolved and is being used as a tool by analysts for statistical analyses based on the output provided by clustering.

With the help of clustering, the expectation is that similar examples/data points in the dataset should be part of the same cluster. Similarly, the dissimilar examples are part of different clusters, as shown in Fig. 3. Thus, the dataset is broken down into smaller datasets, on which performing mathematical operations is easier. The different parallel clustering algorithms that are compatible with MapReduce, are discussed further and are mentioned below in Fig. 4.

Fig. 3. Clustering Procedure.

Fig. 4. Clustering Algorithms compatible for Parallelism.

1.2.1 K-means

The K-Means algorithm is a simple, computationally fast, and storage-efficient algorithm. The algorithm is very popular in terms of partitioning algorithms. It is straightforward, as it begins with initializing ‘k’ clusters centers, which are made randomly by sampling. After initialization, the similarity index is calculated, and similar data points are clustered together. The similarity index, in this case, is the distance. The data point nearest to a specific center is considered the most similar to that cluster generalization; hence, it is added to that cluster. If some data points are left un-clustered, new cluster centers are formed with the same remaining data points, and a loop is produced. This process continues until no centroid changes its location or no new centroid is produced. Owing to its simplicity, it cannot be used with highly complex datasets and is actually an inferior performer when the shapes of the dataset become larger, out of control, or unknown ^[7].

The algorithm of K-means can be found in Table 1.

Table 1. K-means Algorithm.

Step #	Steps
1.	Classify: Assign observations to the closest cluster center $Z_{i}\leftarrow _{j}^{argmin}{\left\| \left\| u_{j}-x_{i}\right\| \right\| }{_{2}^{2}}$
2.	Map: For each data point, given ({${{\mu}}$j}, xi), emit(zi, xi)
3.	Reduce: Average over all points in cluster j (zi= k) $u_{j}=\frac{1}{n_{j}}\sum _{i=1}^{k}x_{i}$

1.2.2 K-harmonic Means

The K-Harmonic Means algorithm is similar to the K-Means algorithm because it is also a center-based algorithm. On the other hand, the approach is different because it calculates harmonic averages of the distances between the data points and the centers, which increases the clustering quality. This converges faster than K-Means, but this algorithm requires a considerable number of iterations to reach convergence ^[7].

The algorithm of K Harmonic means can be found in Table 2.

Table 2. K Harmonic means Algorithm.

Step #	Steps
1.	KHM starts with random centers.
2.	Distance between each data point is calculated.
3.	New cluster centers are calculated.
4.	Calculate distance by $\varnothing \left(d_{j},C_{l}\right)=\left\| \left\| d_{j}-C_{l}\right\| \right\| $;$1\leq i\leq n,1\leq l\leq k$ Distance between data points and cluster centers is calculated by $\alpha _{i}$ $\alpha _{i}=\frac{1}{\left(\sum _{i=1}^{k}\frac{1}{\varnothing \left(d_{j},C_{l}\right)^{\left(a\right)}}\right)^{2}}$

1.2.3 Fuzzy C-means

The Fuzzy C Means algorithm is the lenient version of its K-Means counterpart. As the traditional clustering techniques require different data points to be mutually exclusive when clustered, known as the hard way, Fuzzy C Means allows data points to be added to more than one cluster simultaneously.

This technique is more natural, resembling a real-world scenario because the data points clinging to the boundaries cannot be attached to any one cluster. Instead, it uses a combination to represent the partial membership to the clusters. The objective is to classify the data set into ‘c’ clusters, assuming c is known beforehand. The condition for fuzzy partition is presented in Eq. (1) ^[8].

The algorithm of Fuzzy C means can be found in Table 3 below.

Table 3. Fuzzy C means Algorithm.

Step #	Steps
1.	Let $X=x_{1},x_{2},x_{3},x_{4}\ldots x_{n}$ be the set of data points and $V=v_{1},v_{2},v_{3},v_{4}\ldots v_{n}$ be the set of centers.
2.	Randomly select ‘c’ clusters
3.	Calculate fuzzy membership as $u_{ij}=1/\sum _{k=1}^{c}\left(d_{ij}/d_{ik}\right)^{\left(\frac{2}{m}-1\right)}$
4.	Compute the fuzzy centers using $v_{j}=\sum _{j=1}^{c}\left(u_{ij}\right)^{\left(m\right)}x_{i}/\sum _{j=1}^{c}\left(u_{ij}\right)^{\left(m\right)};1\leq i\leq m$
5.	Repeat steps 3 and 4 until the minimum J value is achieved.

1.2.4 Hierarchical Clustering

The Hierarchical Clustering algorithm is a part of the Fuzzy C Means clustering algorithm. It generates a hierarchy of partitions by agglomerative and divisive methods. The agglomerative method produces a cluster sequence by producing a merged cluster derived from the two higher hierarchy clusters. The divisive algorithm works the other way around ^[8]. Such a method of clustering is also known as a dendrogram, which is a straightforward, progressive clustering technique. This method is highly scalable, but the time complexity is very high as the concept of tree building is introduced. The algorithm of Hierarchical Clustering can be found in Table 4.

Table 4. Hierarchical Clustering Algorithm.

Step #	Steps
1	Data point set, $X=x_{1},ax_{2},ax_{3},ax_{4}\ldots ax_{n}$
2	Initializing disjoint clustering by level L(0) = 0 and m = 0 .
3	Least distance pair is to be found. d[(r),(s)] = min d[(i),(j)]
4	m = m + 1. L(m) = d[(r),(s)].
5	d[(k), (r, s)] = min(d[(k),(r)],d[(k),(s)]).
6	If one cluster left stop, else repeat from step 3.

1.2.5 The Present Contributions

The different clustering algorithms have been used in different studies, based on different starting points and criteria leading to different outcomes and taxonomies of clustering algorithms. On the other hand, their open challenges, advantages, and usefulness in parallel behavior are yet to be discussed. There have been various implementations of the algorithms mentioned in this study in a parallel manner. This study proposes the parallel behavior of these algorithms with MapReduce in parallel mode.

The remainder of this paper is arranged as follows: Section 2 envelopes the Related work, including recent advancements in the field. Section 3 formulates a comparison in which analysis of the various clustering-based MapReduce algorithms has been described. Finally, the Conclusion is described in Section 4.

2. Related Work

2.1 K-means

Lv et al. ^[9] implemented an experiment based on a sequential K-Means algorithm using C++, and another experiment based on a parallel K-Means algorithm using Hadoop running on Java. They aimed to analyze a large number of remote sensing images, which reached its limit when there were limitations in hardware resources and the tolerance of time-consuming produces a bottleneck in processing a large amount of RS images. This was overcome by using parallel execution-based algorithms.

As mentioned before, K-Means fail to work efficiently with large datasets and with data set formation with unknown shapes, Li et al. ^[10] proposed an improved K-Means based on an ensemble learning method of bagging. Their experiment helped overcome the inefficiency issue and sensitiveness to the outliers. The results of their experiments have shown that their approach is acceptable for a scalable model.

Shahravari and Jalili ^[11] performed an experiment using the mrk-means algorithm, which helps overcome the issue of the iterative nature of the traditional K-Means implementation of MapReduce. The I/O overhead increases tremendously when MapReduce is implemented with K-Means during each iteration, the whole data set is read, and this operation is repeated with every iteration. With mrk means, this process is reduced to a single pass solution which uses the reclustering technique, i.e., the dataset is only read once. Hence, it is very much faster than the traditional setup of MapReduce.

2.2 K-harmonic Means

Zhang et al. ^[12] proposed an experiment in which they compared the traditional K-means algorithm with the Expectation Minimization (EM) algorithm and the K-Harmonic Means algorithm to check which algorithm brings out the output in a better way in these three algorithms. According to their findings, the traditional K-Means and EM share the same flaw: the dependency on the initialization of the centers and their sensitivity towards the same. The K Harmonic Means algorithm overcame this issue by improving the clustering quality and providing a faster and better implementation with MapReduce.

Bagde and Tripathi ^[13] proposed a hybrid combination of the traditional K-Means and the K harmonic means algorithm to overcome the sensitivity towards initial points and local optimization and the over iterations because the algorithm runs for k times when not necessary. K-Harmonic Means is highly scalable and insensitive toward data points. Their experiment provided acceptable results.

Guo and Peng ^[14] proposed a better hybrid combination, including the K Harmonic means (KHM) algorithm, combined with dimensionality reduction. This combination increased the clustering results with less computational time and more efficient iterations. This experiment was proposed because of the incompatibility of the clustering algorithms with high-dimensional data. KHM is already suitable for large data sets but lacks higher dimensions. Hence, the most natural strategy to solve this problem is being used, which is dimensionality reduction. They used Principle Component Analysis for the basis of dimensionality reduction.

2.3 Fuzzy C-means

All the above-mentioned algorithms were part of hard clustering, i.e., the data points are strictly part of only one cluster, but that was not the natural case. In the real world, the data points on the cluster boundary cannot strictly be considered a part of that cluster. Hence, a soft clustering technique was required, which allows one data point to be a part of more than one cluster at a time. Ludwig ^[15] presented such an experiment using the fuzzy c means algorithm on MapReduce and investigated the scalability and accuracy in terms of parallel execution in such a soft clustering scenario. The results obtained show that fuzzy c means is more accurate and realistic, but it is unable to scale well.

Dai et al. ^[16] presented an experiment consisting of a canopy-clustering concept, which quickly analyzes the dataset provided to solve the scalability issue of fuzzy c means (FCM). With the help of the rapid acquisition property of the canopy-clustering algorithm ^[17], the convergence rate of the FCM has accelerated. The results based on their experiment have shown that this combination provides better clustering quality and higher operation speed.

2.4 Hierarchical Clustering

Similar to Lv et al. ^[9], Li et al. ^[18] proposed a highly scalable version of fuzzy c means for their research on underwater image segmentation. Because they had to deal with the rapid increase in data, they required a parallel execution of the tasks, for which MapReduce was convenient. Their experiment consisted of a two-layer distribution model to distribute data transfer and clean the bandwidth. Their experiment yielded results with better efficiency and worked well with high scalability.

MapReduce was further improved for data mining, which required faster computations and more memory usage. On the other hand, the current clustering techniques were not efficient in terms of the mentioned features. For that purpose, Sun et al. ^[19] proposed a hierarchical clustering method with batch updating and co-occurrence-based feature selection. These methods perform so that the computational time is reduced and noisy features are eliminated. Their experiment showed that the I/O overhead was reduced along with communication overhead, which reduced the total execution time to 1/15 of the previous one.

Based on the above work, Gao et al. ^[20] combined the hierarchical clustering algorithm with the neuron initialization method of the vector pressing self-organizing model. Their experiment was based on dividing the large text database into various data blocks and then distributing these blocks in a manner that they are executed/processed in a parallel fashion. Their experiment yielded higher efficiency and better performance in terms of text clustering and mining.

3. Analysis

3.1 Open Challenges

3.1.1 K-means

There have been recent publications about extending the k-Means algorithm. The extended version would include some extra background information. The background information can be of two types, must-link and cannot-link constraints, both at the instance level. In addition, there are numerous advancement speculations for the same; some include specifying additional background information in the form of constraints, which are very straightforward to implement. On the other hand, finding the feasible solutions for all the constraints necessary for a unique solution for NP-Complete provides results for the feasibility of clustering techniques under each type of constraint individually. The results also mention many types combined for the NP-Complete solution of clustering. The requirement is an iterative algorithm that minimizes the restricted vector quantization error without attempting to meet all constraints at each iteration.

3.1.2 K-harmonic Means

The initialization issue can be resolved using K-Harmonic, up to a certain level. On the other hand, the issue of ``Local minima trapping'' still exists. Several studies have been conducted to find solutions for the same, but the fundamental concept underlying proposed algorithms remains unsolved. Future studies can apply heuristics to produce non-local moves, which can be used for the cluster centers. These techniques can also choose the optimal hyper parameters for the optimal solution.

3.1.3 Fuzzy C-means

Several studies have pointed out some disadvantages associated with FCM: (1) Only point-based membership; (2) Loss of information; (3) Slower Convergence. The use of FCM is very efficient because it can blend in very well with other algorithms and produce numerous other approaches. This integration of techniques and interdisciplinary study may yield novel insights for FCM issue solutions.

3.1.4 Hierarchical Clustering

With hierarchical clustering, it is required that the distance metric, as well as the linking criterion, are mentioned specifically. Rarely is there a solid theoretical foundation for such decisions. Nevertheless, the applicability of the technique to modern research is questionable.

Identifying how to calculate a distance matrix when there are numerous data kinds is difficult. There is no simple method for calculating distance when the variables are both quantitative and quantitative. How would one calculate the distance between a 45-year-old man, a 10-year-old girl, and a 46-year-old woman, for example? There are formulae, but they involve arbitrary decisions.

3.2 Findings

This literature review examines the applications of Clustering in various fields with the interest of MapReduce, and experience in the same. In a large portion of the new research work done in the area, factors, such as precision, effectiveness, and other different properties are calculated on different datasets. Based on these arguments, specific published papers were included in this review paper. Table 5 presents a detailed summary of all of them in a tabular manner for the reader's sake.

Table 5. Summary of Related Work.

#	Author(s)	Technique	Application Area
1	Lv et al. ^[9]	K-Means	Analysis
2	Li et al. ^[10]	K-Means	Analysis
3	Shahravari & Jalili ^[11]	K-Means	Data Control
4	Zhang et al. ^[12]	K Harmonic Means	Network
5	Bagde & Tripathi ^[13]	K Harmonic Means	Comparison while analysis
6	Guo and Peng ^[14]	K Harmonic Means	Higher dimensions
7	Ludwig ^[15]	fuzzy c means	Analysis
8	Dai et al. ^[16]	fuzzy c means	Analysis
9	Li et al. ^[17]	fuzzy c means	Scalability of FCM
10	Sun et al. ^[18]	hierarchical clustering	Data mining
11	Gao et al. ^[19]	hierarchical clustering	Data mining

None of the researchers used typical dataset collection to analyze the outcomes, which makes it difficult to compare the results of various algorithms mentioned before. Sometimes, the requirement is scalability, so other disadvantages of the hierarchical clustering are overlooked and preferred over other variations of the K Means. In those cases, where a more realistic outlook is required and scalability is not an issue, Fuzzy C Means is preferred, and its higher accuracy makes it more desirable. The problems associated with hierarchical cluster analysis are addressed using more contemporary techniques, such as latent class analysis.

Table 6 compares all algorithms based on the Principle Used, Similarity Function, Advantage, Disadvantage, and Time Complexity.

Table 6. Algorithms Comparison.

Parameters	K-Means	K-Harmonic Means	Fuzzy C Means	Hierarchal Clustering
Principle Used	Iterative Hill Climbing	Harmonic Averaging	Optimization of object function	Cluster tree building
Similarity Function	Euclidean	Squared Euclidean	Euclidean	Euclidean
Advantage	Fast, Simple, Scalable	Robust	More accurate and realistic	High scalability
Dis - advantage	Sensitive to outliers	Trapped by convergence to a local optimum	Low scalability	High time complexity
Time Complexity	O(nKd)	O(nKd)	O(n)	O(n$^{3}$)

4. Conclusion

The current trends in the digital market have caused a boost in the size of digital information. In addition, the need to deal with the insightful analysis for the colossal measure of raw digital data is expanding to keep only the relevant data. Relevancy can be decided by clustering, and MapReduce can be used to process this huge data efficiently because of the parallel distribution technique. On the other hand, the use of clustering in MapReduce is based on different techniques because of technological advancements. In this review paper, these techniques have been studied. Based on an examination of different K-Means variations, i.e., K-Harmonic Means, Fuzzy C-Means, and Hierarchical, it can be concluded that the traditional clustering algorithm is insufficient for data clustering. The traditional clustering algorithm is also insufficient in the integrated approach, where amalgamation with multiple other algorithms is required. More real clusters and reduced execution time is expected from hybrid models. The crossover of these clustering algorithms used in MapReduce can provide more exact outcomes because it can deal with considerable unstructured dispersed information in a parallel fashion. Future researchers may look at the hybrid approach, bringing out better results than the traditional counterparts.

REFERENCES

Ajin V. W., Kumar L. D., 2016, May, Big data and clustering algorithms., In 2016 International Conference on Research Advances in Integrated Navigation Systems (RAINS) (pp. 1-5). IEEE.

Drineas P., Frieze A., Kannan R., et al., 2004, Clustering large graphs via the singular value decomposition[J], Mach Learn, Vol. 56, No. 1-3, pp. 9-33

Dean J., Ghemawat S., 2008, MapReduce: simplified data processing on large clusters[J], Commun ACM, Vol. 51, No. 1, pp. 107-113

Ekanayake J., Pallickara S., Fox G., 2008, Map reduce for data intensive scientific analyses [C], eScience, eScience’08. IEEE fourth international conference on. IEEE 2008, pp. 277-284

Vattani A., 2011, K-means requires exponentially many iterations even in the plane [J], Discret Comput Geom, Vol. 45, No. 4, pp. 596-616

Vernica R., Carey M. J., Li C., 2010, Efficient parallel set-similarity joins using mapreduce, in SIGMOD

Kurasova O., Marcinkevicius V., Medvedev V., Rapecka A., Stefanovic P., 2014, November, Strategies for big data clustering, In 2014 IEEE 26th international conference on tools with artificial intelligence, pp. 740-747

Ludwig S. A., 2015, MapReduce-based fuzzy c-means clustering algorithm: implementation and scalability, International journal of machine learning and cybernetics, Vol. 6, No. 6, pp. 923-934

Lv Z., Hu Y., Zhong H., Wu J., Li B., Zhao H., 2010, October, Parallel k-means clustering of remote sensing images based on mapreduce, In International Conference on Web Information Systems and Mining (pp. 162-170). Springer, Berlin, Heidelberg.

Li H. G., Wu G. Q., Hu X. G., Zhang J., Li L., Wu X., 2011 January, K-means clustering with bagging and mapreduce, In 2011 44th Hawaii International Conference on System Sciences (pp. 1-8). IEEE.

Shahrivari S., Jalili S., 2016, Single-pass and linear-time k-means clustering based on MapReduce, Information Systems, Vol. 60, pp. 1-12

Zhang B., Hsu M., Dayal U., 1999, K-harmonic means-a data clustering algorithm, Hewlett-Packard Labs Technical Report HPL-1999-124, 55

Bagde U., Tripathi P., 2018 February, An analytic survey on mapreduce based k-means and its hybrid clustering algorithms, In 2018 second international conference on computing methodologies and communication (iccmc) (pp. 32-36). IEEE.

Guo C., Peng L., 2008 October, A hybrid clustering algorithm based on dimensional reduction and k-harmonic means, In 2008 4th International Conference on Wireless Communications, Networking and Mobile Computing (pp. 1-4). IEEE.

Ludwig S. A., 2015, MapReduce-based fuzzy c-means clustering algorithm: implementation and scalability, International journal of machine learning and cybernetics, Vol. 6, No. 6, pp. 923-934

Dai W., Yu C., Jiang Z., 2016, An improved hybrid Canopy-Fuzzy C-means clustering algorithm based on MapReduce model, Journal of Computing Science and Engineering, Vol. 10, No. 1, pp. 1-8

Kumar A., Ingle Y. S., Pande A., Dhule P., 2014, Canopy clustering: a review on pre-clustering approach to K-means clustering, Int. J. Innov. Adv. Comput. Sci.(IJIACS), Vol. 3, No. 5, pp. 22-29

Li X., Song J., Zhang F., Ouyang X., Khan S. U., 2016, MapReduce-based fast fuzzy c-means algorithm for large-scale underwater image segmentation, Future Generation Computer Systems, Vol. 65, pp. 90-101

Sun T., Shu C., Li F., Yu H., Ma L., Fang Y., 2009, December, An efficient hierarchical clustering method for large datasets with map-reduce, In 2009 International conference on parallel and distributed computing, applications and technologies (pp. 494-499). IEEE.

Gao H., Jiang J., She L., Fu Y., 2010, A new agglomerative hierarchical clustering algorithm implementation based on the map reduce framework, International Journal of Digital Content Technology and its Applications, Vol. 4, No. 3, pp. 95-100

Author

Garvit Chugh

Garvit Chugh received the bachelor's degree in computer engineering from Guru Gobind Singh Indraprastha University in 2020, the master's degree in computer engineering from the Indian Institute of Technology, Jodhpur in 2022, and is currently a scholar for his philosophy of doctorate degree in Mobile and Pervasive Computing in Computer Engineering from a joint program of the Indian Institute of Technology Jodhpur, and Kharagpur, respectively. He is experienced as an Android developer. His research areas include mobile and wireless computing, IoT, Telecom Networks, software design, and operating systems. He has published various research papers in international journals and conferences and has been serving as a reviewer for many highly-respected journals. He considers himself a ‘forever student,’ eager to both build on his academic foundations in computers and engineering and stay in tune with the latest technical strategies through continued coursework and professional development. Garvit has worked in various IT sectors in the Indian Govt. like Centre for Railway Information and Systems, Airports Authority of India, etc., as an Android developer and led the teams in every project that was assigned.

Nitesh Singh Bhati

Nitesh Singh Bhati received his doctorate from GGISP University, Delhi. He obtained his M. Tech in CSE and B. Tech in Computer Science and Engineering from the G.G.S.I.P. University Delhi. He is working as an Assistant Professor in the Department of Computer Science and Engineering of Delhi Technical Campus, UP. He has published various research papers in international journals and conferences. His current research area is information security.

Puneet Kumar

Puneet Kumar is a Professor in the Department of Computer Science & Engineering at University Institute of Engineering, Chandigarh University, Mohali, India. He has completed his Master’s and Ph.D. in Computer Science, and believes in the philosophy of interdisciplinary research. He has also completed a certificate course on intellectual property rights from WIPO Academy, Geneva. He has more than 18 years of teaching, research, and industrial experience. His major research interests are Machine Learning, Data Science, and e-government. He has published various research papers and articles in national and international journals, and his papers are widely cited by various stakeholders across the world. He is the recipient of several software copyrights from the Ministry of Human Resource and Development, Government of India. He has also published books on e-governance titled “E-Governance in India: Problems, Prototypes and Prospects”, “Stances of e-Government: Policies, Processes and Technology” and “Artificial Intelligence and Global Society: Impact and Practices” published by CRC Press, Taylor and Francis Group. Dr. Kumar is also the active member of the professional societies Computer Society of India (CSI) and Association for Computing Machinery (ACM).

Vishal Bharti

Vishal Bharti is working as Professor and Additional Director in Dept of CSE at Chandigarh University, Mohali, Punjab. He completed his Ph.D. in 2016 in the area of Information Security. He did his M.Tech. and B.E. from Birla Institute of Technology, Mesra. Ranchi. He also holds Doctorate in Management Studies in addition to MBA in IT and E-MBA in S&M. His area of specialization is Cyber Security, Network Security and Distributed Computing. He is having a mixed bag of experience of 16+ Years in both Academia and IT Industry He have seven filed patents, twenty eight Copyrights and 13 Govt. Grants(SERB, DST, NSTEDB, EDI etc.) and two seed money grants to his credit. He published seventy plus research papers at both National & International level like IEEE, Springer, Taylor & Francis and was awarded with Best Faculty Award 2010, Academic Leadership Award 2019, Emerging Leader in Higher Education 2019, Best Young HoD of the Year 2019, Best Computer Teacher Award 2019, Research Excellence Award in 2020, Outstanding contribution in promoting Education in Rural Areas Award in 2021 and Best Young Director of the in 2021.

IEIE SPC IEIE Transactions on Smart Processing & Computing

Journal Search

Journal XML

Journal Information

A Study on Parallel Clustering Algorithms based on MapReduce

Abstract

Keywords

1. Introduction

1.1 MapReduce

Fig. 1. MapReduce Functionality.

Fig. 2. MapReduce Example.

1.2 Clustering Algorithms

Fig. 3. Clustering Procedure.

Fig. 4. Clustering Algorithms compatible for Parallelism.

1.2.1 K-means

Table 1. K-means Algorithm.

1.2.2 K-harmonic Means

Table 2. K Harmonic means Algorithm.

1.2.3 Fuzzy C-means

Table 3. Fuzzy C means Algorithm.

1.2.4 Hierarchical Clustering

Table 4. Hierarchical Clustering Algorithm.

1.2.5 The Present Contributions

2. Related Work

2.1 K-means

2.2 K-harmonic Means

2.3 Fuzzy C-means

2.4 Hierarchical Clustering

3. Analysis

3.1 Open Challenges

3.1.1 K-means

3.1.2 K-harmonic Means

3.1.3 Fuzzy C-means

3.1.4 Hierarchical Clustering

3.2 Findings

Table 5. Summary of Related Work.

Table 6. Algorithms Comparison.

4. Conclusion

REFERENCES

Author

Garvit Chugh

Nitesh Singh Bhati

Puneet Kumar

Vishal Bharti

Article Information (continued)

Keywords

IEIE SPC

IEIE Transactions on Smart Processing & Computing