Since the advent of the global computerized market, the volume of digital information has grown exponentially, as has the demand for storing it. As the price of storage devices decreases, the necessity to analyze vast quantities of unstructured digital data to retain only essential information increases. MapReduce is a programming paradigm for producing and generating massive information indices. Using MapReduce to produce meaningful clusters from such a massive amount of raw data is an efficient way to manage such voluminous amounts of data. On the other hand, the existing industry standard for data clustering algorithms presents significant obstacles. The conventional clustering calculation efficiently handles a great deal of information from various sources, such as online media, business, and the web. Nevertheless, the sequential count in clustering approaches is time-intensive in these conventional calculations. The wide varieties of K-Means, including K-Harmonic Means, are sensitive to forming cluster centers in huge datasets. This work suggests a logical evaluation of such calculations. It offers a study of the various k-means clustering algorithms employed in MapReduce, as well as the study on the introduction and the open challenges of parallelism in MapReduce.

※ The user interface design of www.jsts.org has been recently revised and updated. Please contact inter@theieie.org for any inquiries regarding paper submission.

### Journal Search

- (Department of Computer Science, Indian Institute of Technology, Jodhpur chugh.2@iitj.ac.in)
- (Department of Computer Science, Delhi Technical Campus, Greater Noida (GGSIPU) niteshbhati07@gmail.com )
- (Department of Computer Science, Chandigarh University, Mohali, India professor.pkumar@gmail.com, mevishalbharti@yahoo.com )

## 1. Introduction

The industrial world in 2022 has accumulated billions of raw data points related to their policies, implementations, sales, and different product-related information. On the other hand, not all the data are useful. Some are irrelevant to the company and hence a burden/wastage of storage. Understanding this raw data to verify the relevancy, but the size of this raw data is huge. In addition, it is difficult for a human to describe and make meaning of such huge data sizes. Hence various techniques have been introduced that perform this task on behalf of humans. Such techniques follow the methodology of clustering, which is an unsupervised technique of forming groups of similar data to satisfy the objective of understanding the underlying structure in the available raw data. This technique has evolved and is used in various tools to make statistical analyses based on the output provided by clustering.

With the help of clustering, the expectation is that similar examples/data points
in the dataset should be part of the same cluster. Similarly, the dissimilar examples
are part of different clusters. Thus, the dataset is broken down into smaller datasets,
making mathematical operations easier ^{[1]}.

An exact solution using clustering is an NP-hard problem because it cannot be done
in polynomial time, even with only two clusters ^{[2]}. Hence, a more elegant solution for analysis was proposed based on parallelism, i.e.,
parallel execution of data, which makes faster computation and efficient distribution
of the tasks. This solution is based on MapReduce. MapReduce is a programming model
for preparing and producing huge informational indices ^{[3]}. The MapReduce calculation can handle a large amount of raw data by producing meaningful
clusters. The calculation is simple, flexible, fault-tolerant, and highly scalable.
Hence, MapReduce is used in big data clustering ^{[4]}.

The users have the capability of specifying the map. This map consists of the values
of a key/value pair. This map is used to generate many key/value pairs, which are
known as a set of intermediate pairs. On the other hand, there is a Reduce function,
which works on this map and merges every intermediate key/pair value with the same
intermediate key. This phenomenon is known as MapReduce ^{[5]}. Thus, parallelism only is insufficient, and there is a need to manage the exponential
job creation time and the time required to perform big data shuffling. These time
management issues can be managed by removing the many iterations on which clustering
algorithms depend. For that purpose, different clustering algorithms have been used
in different studies, based on different starting points and criteria leading to different
outcomes and taxonomies of clustering algorithms. Some include K-Means, K-Harmonic
Means, Fuzzy C-Means, and Hierarchical. Table 5 lists the advantages and disadvantages, and other usages for the aforementioned algorithms.

### 1.1 MapReduce

MapReduce is simple, flexible, fault-tolerant, and highly scalable. Hence, MapReduce
is used in big data clustering. Model applications incorporate ordering records, breaking
down Web access logs, and AI. The most widely used implementation of the MapReduce
framework is Hadoop, a part of the Open Source community ^{[6]}. Fig. 1 presents the system as a file system, DFS, or Distributed file system, to form partitions
in multiple machines. The partition is produced in an initial phase.

The word Count problem can be taken as the best example to explain the functionality of MapReduce. The word count problem dictates that different words can be taken as the input, and each type will be counted. The mapper class, in this case, will be used to tokenize the strings. The tokenization can be used further to sort the words. The sorting can also be numbered, making it a key-value pair. The reducer class will take this sorted list as the input and convert it into a list with the keys and the count of the keys. This example can be understood visually using Fig. 2.

### 1.2 Clustering Algorithms

Clustering is an unsupervised technique of clustering the data or forming groups of similar data to satisfy the objective of understanding the underlying structure in the available raw data. This technique has evolved and is being used as a tool by analysts for statistical analyses based on the output provided by clustering.

With the help of clustering, the expectation is that similar examples/data points in the dataset should be part of the same cluster. Similarly, the dissimilar examples are part of different clusters, as shown in Fig. 3. Thus, the dataset is broken down into smaller datasets, on which performing mathematical operations is easier. The different parallel clustering algorithms that are compatible with MapReduce, are discussed further and are mentioned below in Fig. 4.

#### 1.2.1 K-means

The K-Means algorithm is a simple, computationally fast, and storage-efficient algorithm.
The algorithm is very popular in terms of partitioning algorithms. It is straightforward,
as it begins with initializing ‘k’ clusters centers, which are made randomly by sampling.
After initialization, the similarity index is calculated, and similar data points
are clustered together. The similarity index, in this case, is the distance. The data
point nearest to a specific center is considered the most similar to that cluster
generalization; hence, it is added to that cluster. If some data points are left un-clustered,
new cluster centers are formed with the same remaining data points, and a loop is
produced. This process continues until no centroid changes its location or no new
centroid is produced. Owing to its simplicity, it cannot be used with highly complex
datasets and is actually an inferior performer when the shapes of the dataset become
larger, out of control, or unknown ^{[7]}.

The algorithm of K-means can be found in Table 1.

##### Table 1. K-means Algorithm.

#### 1.2.2 K-harmonic Means

The K-Harmonic Means algorithm is similar to the K-Means algorithm because it is also
a center-based algorithm. On the other hand, the approach is different because it
calculates harmonic averages of the distances between the data points and the centers,
which increases the clustering quality. This converges faster than K-Means, but this
algorithm requires a considerable number of iterations to reach convergence ^{[7]}.

The algorithm of K Harmonic means can be found in Table 2.

##### Table 2. K Harmonic means Algorithm.

#### 1.2.3 Fuzzy C-means

The Fuzzy C Means algorithm is the lenient version of its K-Means counterpart. As the traditional clustering techniques require different data points to be mutually exclusive when clustered, known as the hard way, Fuzzy C Means allows data points to be added to more than one cluster simultaneously.

This technique is more natural, resembling a real-world scenario because the data
points clinging to the boundaries cannot be attached to any one cluster. Instead,
it uses a combination to represent the partial membership to the clusters. The objective
is to classify the data set into ‘c’ clusters, assuming c is known beforehand. The
condition for fuzzy partition is presented in Eq. (1) ^{[8]}.

The algorithm of Fuzzy C means can be found in Table 3 below.

##### Table 3. Fuzzy C means Algorithm.

#### 1.2.4 Hierarchical Clustering

The Hierarchical Clustering algorithm is a part of the Fuzzy C Means clustering algorithm.
It generates a hierarchy of partitions by agglomerative and divisive methods. The
agglomerative method produces a cluster sequence by producing a merged cluster derived
from the two higher hierarchy clusters. The divisive algorithm works the other way
around ^{[8]}. Such a method of clustering is also known as a dendrogram, which is a straightforward,
progressive clustering technique. This method is highly scalable, but the time complexity
is very high as the concept of tree building is introduced. The algorithm of Hierarchical
Clustering can be found in Table 4.

##### Table 4. Hierarchical Clustering Algorithm.

#### 1.2.5 The Present Contributions

The different clustering algorithms have been used in different studies, based on different starting points and criteria leading to different outcomes and taxonomies of clustering algorithms. On the other hand, their open challenges, advantages, and usefulness in parallel behavior are yet to be discussed. There have been various implementations of the algorithms mentioned in this study in a parallel manner. This study proposes the parallel behavior of these algorithms with MapReduce in parallel mode.

The remainder of this paper is arranged as follows: Section 2 envelopes the Related work, including recent advancements in the field. Section 3 formulates a comparison in which analysis of the various clustering-based MapReduce algorithms has been described. Finally, the Conclusion is described in Section 4.

## 2. Related Work

### 2.1 K-means

Lv et al. ^{[9]} implemented an experiment based on a sequential K-Means algorithm using C++, and
another experiment based on a parallel K-Means algorithm using Hadoop running on Java.
They aimed to analyze a large number of remote sensing images, which reached its limit
when there were limitations in hardware resources and the tolerance of time-consuming
produces a bottleneck in processing a large amount of RS images. This was overcome
by using parallel execution-based algorithms.

As mentioned before, K-Means fail to work efficiently with large datasets and with
data set formation with unknown shapes, Li et al. ^{[10]} proposed an improved K-Means based on an ensemble learning method of bagging. Their
experiment helped overcome the inefficiency issue and sensitiveness to the outliers.
The results of their experiments have shown that their approach is acceptable for
a scalable model.

Shahravari and Jalili ^{[11]} performed an experiment using the mrk-means algorithm, which helps overcome the issue
of the iterative nature of the traditional K-Means implementation of MapReduce. The
I/O overhead increases tremendously when MapReduce is implemented with K-Means during
each iteration, the whole data set is read, and this operation is repeated with every
iteration. With mrk means, this process is reduced to a single pass solution which
uses the reclustering technique, i.e., the dataset is only read once. Hence, it is
very much faster than the traditional setup of MapReduce.

### 2.2 K-harmonic Means

Zhang et al. ^{[12]} proposed an experiment in which they compared the traditional K-means algorithm with
the Expectation Minimization (EM) algorithm and the K-Harmonic Means algorithm to
check which algorithm brings out the output in a better way in these three algorithms.
According to their findings, the traditional K-Means and EM share the same flaw: the
dependency on the initialization of the centers and their sensitivity towards the
same. The K Harmonic Means algorithm overcame this issue by improving the clustering
quality and providing a faster and better implementation with MapReduce.

Bagde and Tripathi ^{[13]} proposed a hybrid combination of the traditional K-Means and the K harmonic means
algorithm to overcome the sensitivity towards initial points and local optimization
and the over iterations because the algorithm runs for k times when not necessary.
K-Harmonic Means is highly scalable and insensitive toward data points. Their experiment
provided acceptable results.

Guo and Peng ^{[14]} proposed a better hybrid combination, including the K Harmonic means (KHM) algorithm,
combined with dimensionality reduction. This combination increased the clustering
results with less computational time and more efficient iterations. This experiment
was proposed because of the incompatibility of the clustering algorithms with high-dimensional
data. KHM is already suitable for large data sets but lacks higher dimensions. Hence,
the most natural strategy to solve this problem is being used, which is dimensionality
reduction. They used Principle Component Analysis for the basis of dimensionality
reduction.

### 2.3 Fuzzy C-means

All the above-mentioned algorithms were part of hard clustering, i.e., the data points
are strictly part of only one cluster, but that was not the natural case. In the real
world, the data points on the cluster boundary cannot strictly be considered a part
of that cluster. Hence, a soft clustering technique was required, which allows one
data point to be a part of more than one cluster at a time. Ludwig ^{[15]} presented such an experiment using the fuzzy c means algorithm on MapReduce and investigated
the scalability and accuracy in terms of parallel execution in such a soft clustering
scenario. The results obtained show that fuzzy c means is more accurate and realistic,
but it is unable to scale well.

Dai et al. ^{[16]} presented an experiment consisting of a canopy-clustering concept, which quickly
analyzes the dataset provided to solve the scalability issue of fuzzy c means (FCM).
With the help of the rapid acquisition property of the canopy-clustering algorithm
^{[17]}, the convergence rate of the FCM has accelerated. The results based on their experiment
have shown that this combination provides better clustering quality and higher operation
speed.

### 2.4 Hierarchical Clustering

Similar to Lv et al. ^{[9]}, Li et al. ^{[18]} proposed a highly scalable version of fuzzy c means for their research on underwater
image segmentation. Because they had to deal with the rapid increase in data, they
required a parallel execution of the tasks, for which MapReduce was convenient. Their
experiment consisted of a two-layer distribution model to distribute data transfer
and clean the bandwidth. Their experiment yielded results with better efficiency and
worked well with high scalability.

MapReduce was further improved for data mining, which required faster computations
and more memory usage. On the other hand, the current clustering techniques were not
efficient in terms of the mentioned features. For that purpose, Sun et al. ^{[19]} proposed a hierarchical clustering method with batch updating and co-occurrence-based
feature selection. These methods perform so that the computational time is reduced
and noisy features are eliminated. Their experiment showed that the I/O overhead was
reduced along with communication overhead, which reduced the total execution time
to 1/15 of the previous one.

Based on the above work, Gao et al. ^{[20]} combined the hierarchical clustering algorithm with the neuron initialization method
of the vector pressing self-organizing model. Their experiment was based on dividing
the large text database into various data blocks and then distributing these blocks
in a manner that they are executed/processed in a parallel fashion. Their experiment
yielded higher efficiency and better performance in terms of text clustering and mining.

## 3. Analysis

### 3.1 Open Challenges

#### 3.1.1 K-means

There have been recent publications about extending the k-Means algorithm. The extended version would include some extra background information. The background information can be of two types, must-link and cannot-link constraints, both at the instance level. In addition, there are numerous advancement speculations for the same; some include specifying additional background information in the form of constraints, which are very straightforward to implement. On the other hand, finding the feasible solutions for all the constraints necessary for a unique solution for NP-Complete provides results for the feasibility of clustering techniques under each type of constraint individually. The results also mention many types combined for the NP-Complete solution of clustering. The requirement is an iterative algorithm that minimizes the restricted vector quantization error without attempting to meet all constraints at each iteration.

#### 3.1.2 K-harmonic Means

The initialization issue can be resolved using K-Harmonic, up to a certain level. On the other hand, the issue of ``Local minima trapping'' still exists. Several studies have been conducted to find solutions for the same, but the fundamental concept underlying proposed algorithms remains unsolved. Future studies can apply heuristics to produce non-local moves, which can be used for the cluster centers. These techniques can also choose the optimal hyper parameters for the optimal solution.

#### 3.1.3 Fuzzy C-means

Several studies have pointed out some disadvantages associated with FCM: (1) Only point-based membership; (2) Loss of information; (3) Slower Convergence. The use of FCM is very efficient because it can blend in very well with other algorithms and produce numerous other approaches. This integration of techniques and interdisciplinary study may yield novel insights for FCM issue solutions.

#### 3.1.4 Hierarchical Clustering

With hierarchical clustering, it is required that the distance metric, as well as the linking criterion, are mentioned specifically. Rarely is there a solid theoretical foundation for such decisions. Nevertheless, the applicability of the technique to modern research is questionable.

Identifying how to calculate a distance matrix when there are numerous data kinds is difficult. There is no simple method for calculating distance when the variables are both quantitative and quantitative. How would one calculate the distance between a 45-year-old man, a 10-year-old girl, and a 46-year-old woman, for example? There are formulae, but they involve arbitrary decisions.

### 3.2 Findings

This literature review examines the applications of Clustering in various fields with the interest of MapReduce, and experience in the same. In a large portion of the new research work done in the area, factors, such as precision, effectiveness, and other different properties are calculated on different datasets. Based on these arguments, specific published papers were included in this review paper. Table 5 presents a detailed summary of all of them in a tabular manner for the reader's sake.

##### Table 5. Summary of Related Work.

# |
Author(s) |
Technique |
Application Area |

1 |
Lv et al. |
K-Means |
Analysis |

2 |
Li et al. |
K-Means |
Analysis |

3 |
Shahravari & Jalili |
K-Means |
Data Control |

4 |
Zhang et al. |
K Harmonic Means |
Network |

5 |
Bagde & Tripathi |
K Harmonic Means |
Comparison while analysis |

6 |
Guo and Peng |
K Harmonic Means |
Higher dimensions |

7 |
Ludwig |
fuzzy c means |
Analysis |

8 |
Dai et al. |
fuzzy c means |
Analysis |

9 |
Li et al. |
fuzzy c means |
Scalability of FCM |

10 |
Sun et al. |
hierarchical clustering |
Data mining |

11 |
Gao et al. |
hierarchical clustering |
Data mining |

None of the researchers used typical dataset collection to analyze the outcomes, which makes it difficult to compare the results of various algorithms mentioned before. Sometimes, the requirement is scalability, so other disadvantages of the hierarchical clustering are overlooked and preferred over other variations of the K Means. In those cases, where a more realistic outlook is required and scalability is not an issue, Fuzzy C Means is preferred, and its higher accuracy makes it more desirable. The problems associated with hierarchical cluster analysis are addressed using more contemporary techniques, such as latent class analysis.

Table 6 compares all algorithms based on the Principle Used, Similarity Function, Advantage, Disadvantage, and Time Complexity.

##### Table 6. Algorithms Comparison.

## 4. Conclusion

The current trends in the digital market have caused a boost in the size of digital information. In addition, the need to deal with the insightful analysis for the colossal measure of raw digital data is expanding to keep only the relevant data. Relevancy can be decided by clustering, and MapReduce can be used to process this huge data efficiently because of the parallel distribution technique. On the other hand, the use of clustering in MapReduce is based on different techniques because of technological advancements. In this review paper, these techniques have been studied. Based on an examination of different K-Means variations, i.e., K-Harmonic Means, Fuzzy C-Means, and Hierarchical, it can be concluded that the traditional clustering algorithm is insufficient for data clustering. The traditional clustering algorithm is also insufficient in the integrated approach, where amalgamation with multiple other algorithms is required. More real clusters and reduced execution time is expected from hybrid models. The crossover of these clustering algorithms used in MapReduce can provide more exact outcomes because it can deal with considerable unstructured dispersed information in a parallel fashion. Future researchers may look at the hybrid approach, bringing out better results than the traditional counterparts.

### REFERENCES

## Author

Garvit Chugh received the bachelor's degree in computer engineering from Guru Gobind Singh Indraprastha University in 2020, the master's degree in computer engineering from the Indian Institute of Technology, Jodhpur in 2022, and is currently a scholar for his philosophy of doctorate degree in Mobile and Pervasive Computing in Computer Engineering from a joint program of the Indian Institute of Technology Jodhpur, and Kharagpur, respectively. He is experienced as an Android developer. His research areas include mobile and wireless computing, IoT, Telecom Networks, software design, and operating systems. He has published various research papers in international journals and conferences and has been serving as a reviewer for many highly-respected journals. He considers himself a ‘forever student,’ eager to both build on his academic foundations in computers and engineering and stay in tune with the latest technical strategies through continued coursework and professional development. Garvit has worked in various IT sectors in the Indian Govt. like Centre for Railway Information and Systems, Airports Authority of India, etc., as an Android developer and led the teams in every project that was assigned.

Nitesh Singh Bhati received his doctorate from GGISP University, Delhi. He obtained his M. Tech in CSE and B. Tech in Computer Science and Engineering from the G.G.S.I.P. University Delhi. He is working as an Assistant Professor in the Department of Computer Science and Engineering of Delhi Technical Campus, UP. He has published various research papers in international journals and conferences. His current research area is information security.

Puneet Kumar is a Professor in the Department of Computer Science & Engineering at University Institute of Engineering, Chandigarh University, Mohali, India. He has completed his Master’s and Ph.D. in Computer Science, and believes in the philosophy of interdisciplinary research. He has also completed a certificate course on intellectual property rights from WIPO Academy, Geneva. He has more than 18 years of teaching, research, and industrial experience. His major research interests are Machine Learning, Data Science, and e-government. He has published various research papers and articles in national and international journals, and his papers are widely cited by various stakeholders across the world. He is the recipient of several software copyrights from the Ministry of Human Resource and Development, Government of India. He has also published books on e-governance titled “E-Governance in India: Problems, Prototypes and Prospects”, “Stances of e-Government: Policies, Processes and Technology” and “Artificial Intelligence and Global Society: Impact and Practices” published by CRC Press, Taylor and Francis Group. Dr. Kumar is also the active member of the professional societies Computer Society of India (CSI) and Association for Computing Machinery (ACM).

Vishal Bharti is working as Professor and Additional Director in Dept of CSE at Chandigarh University, Mohali, Punjab. He completed his Ph.D. in 2016 in the area of Information Security. He did his M.Tech. and B.E. from Birla Institute of Technology, Mesra. Ranchi. He also holds Doctorate in Management Studies in addition to MBA in IT and E-MBA in S&M. His area of specialization is Cyber Security, Network Security and Distributed Computing. He is having a mixed bag of experience of 16+ Years in both Academia and IT Industry He have seven filed patents, twenty eight Copyrights and 13 Govt. Grants(SERB, DST, NSTEDB, EDI etc.) and two seed money grants to his credit. He published seventy plus research papers at both National & International level like IEEE, Springer, Taylor & Francis and was awarded with Best Faculty Award 2010, Academic Leadership Award 2019, Emerging Leader in Higher Education 2019, Best Young HoD of the Year 2019, Best Computer Teacher Award 2019, Research Excellence Award in 2020, Outstanding contribution in promoting Education in Rural Areas Award in 2021 and Best Young Director of the in 2021.