Methodology for Determining the Optimal Clustering Algorithm for Software Quality Verification

Vladyslav Parashchenko; Oleh Berest

doi:10.47839/ijc.24.2.4024

Authors

Vladyslav Parashchenko
Oleh Berest

DOI:

https://doi.org/10.47839/ijc.24.2.4024

Keywords:

software quality, software metrics, clustering, K-Means, DBSCAN, OPTICS, Affinity Propagation, Gaussian Mixture, DSS

Abstract

The article examines methodologies for evaluating the quality of clustering algorithms used to identify patterns within codebases in the context of a decision support system (DSS) module for software quality verification in information and communication systems. A novel feature dictionary is introduced, wherein evaluation metrics represent a software class as an implementation vector. These metrics are preselected based on the most salient characteristics of programming code. The five widely recognized clustering algorithms - namely, K-Means, DBSCAN, OPTICS, Affinity Propagation, and Gaussian Mixture Models are evaluated in this study. The proposed methodology is applied to five Java application projects that implement diverse architectural solutions and software patterns. These applications are distributed under an open license and are readily accessible for research purposes. The source code of the selected software is transformed into vectors by extracting relevant code characteristics, thereby facilitating subsequent training. The results obtained confirmed the suitability of the proposed feature vector, and the optimal clustering model was subsequently selected for integration into the decision-making system module for quality assessment in information and communication systems.

References

R. E. S. Santos, F. Q. B. da Silva, M. T. Baldassarre, and C. V. C. de Magalhães, “Benefits and limitations of project-to-project job rotation in software organizations: A synthesis of evidence,” Inf Softw Technol, vol. 89, pp. 78–96, 2017. https://doi.org/10.1016/j.infsof.2017.04.006.

P. Silva de Garcia, M. Oliveira, and K. Brohman, “Knowledge sharing, hiding and hoarding: how are they related?” Knowledge Management Research & Practice, vol. 20, no. 3, pp. 339–351, 2022. https://doi.org/10.1080/14778238.2020.1774434.

D. Thomas and A. Hunt, The pragmatic programmer, Addison-Wesley Professional, 2019.

A. A. Yahya, A. Osman, “Using data mining techniques to guide academic programs design and assessment,” Procedia Computer Science, vol. 163, pp. 472-481, 2019. https://doi.org/10.1016/j.procs.2019.12.130.

J. A. Hartigan and M. A. Wong, “A k-means clustering algorithm,” Appl Stat, vol. 28, no. 1, pp. 100–108, 1979. http://dx.doi.org/10.2307/2346830.

B. Karthikeyan, D. J. George, G. Manikandan, and T. Thomas, “A comparative study on K-means clustering and agglomerative hierarchical clustering,” International Journal of Emerging Trends in Engineering Research, vol. 8, no. 5, pp. 1600-1604, 2020. http://dx.doi.org/10.30534/ijeter/2020/20852020.

M. Ahmed, R. Seraj, and S. M. S. Islam, “The k-means algorithm: A comprehensive survey and performance evaluation,” Electronics (Basel), vol. 9, no. 8, p. 1295, 2020. https://doi.org/10.3390/electronics9081295.

R. S. V Chandrasekar and G. A. Britto, “Comprehensive review on density-based clustering algorithm in data mining,” Int J Res Anal, vol. 6, no. 2, pp. 5–9, 2019.

S. Weng, J. Gou, and Z. Fan, “h-DBSCAN: A simple fast DBSCAN algorithm for big data,” Proceedings of the Asian Conference on Machine Learning, PMLR, 2021, pp. 81–96.

M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander, “OPTICS: Ordering points to identify the clustering structure,” ACM Sigmod Record, vol. 28, no. 2, pp. 49–60, 1999. http://dx.doi.org/10.1145/304182.304187.

Z. Deng, Y. Hu, M. Zhu, X. Huang, and B. Du, “A scalable and fast OPTICS for clustering trajectory big data,” Cluster Comput, vol. 18, pp. 549–562, 2015. http://dx.doi.org/10.1007/s10586-014-0413-9.

C. A. Bouman, M. Shapiro, G. W. Cook, C. B. Atkins, and H. Cheng, “Cluster: An unsupervised algorithm for modeling Gaussian mixtures.” 1997. [Online]. Available at: https://engineering.purdue.edu/~bouman/software/cluster/manual.pdf.

Y. Zhang et al., “Gaussian mixture model clustering with incomplete data,” ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), vol. 17, no. 1s, pp. 1–14, 2021. https://doi.org/10.1145/3408318.

B. J. Frey and D. Dueck, “Clustering by passing messages between data points,” Science, vol. 315, no. 5814, pp. 972–976, 2007. https://doi.org/10.1126/science.1136800.

K. R. Shahapure and C. Nicholas, “Cluster quality analysis using silhouette score,” Proceedings of the 2020 IEEE 7th international conference on data science and advanced analytics (DSAA), 2020, pp. 747–748. https://doi.org/10.1109/DSAA49011.2020.00096.

D. L. Davies and D. W. Bouldin, “A cluster separation measure,” IEEE Trans Pattern Anal Mach Intell, no. 2, pp. 224–227, 1979. https://doi.org/10.1109/TPAMI.1979.4766909.

T. Caliński, and J. Harabasz, “A dendrite method for cluster analysis,” Communications in Statistics, vol. 3, no. 1, pp. 1–27, 1974. https://doi.org/10.1080/03610927408827101.

A. Vysala and D. J. Gomes, “Evaluating and validating cluster results,” Proceedings of the 9th International Conference on Data Mining & Knowledge Management Process (CDKP’2020), 2020, https://doi.org/10.5121/csit.2020.100904.

A. Shafeeq and K. S. Hareesha, “Dynamic clustering of data with modified k-means algorithm,” Proceedings of the 2012 Conference on Information and Computer Networks, 2012, pp. 221–225. http://dx.doi.org/10.13140/2.1.4972.3840.

M. A. Masud, M. M. Rahman, S. Bhadra, and S. Saha, “Improved k-means algorithm using density estimation,” Proceedings of the 2019 IEEE International Conference on Sustainable Technologies for Industry 4.0 (STI), 2019, pp. 1–6. https://doi.org/10.1109/STI47673.2019.9068033.

R. C. Martin, Clean architecture, Prentice Hall, 2017.

E. N. H. Kirgil та T. E. Ayyildiz, “Analysis of lack of cohesion in methods (LCOM): A case study,” Proceedings of the 2021 2nd IEEE Int. Inform. Softw. Eng. Conf. (IISEC), Ankara, Turkey, 16–17 December 2021, pp. 1-4. https://doi.org/10.1109/IISEC54230.2021.9672419.

M. Ďuračík, E. Kršák, and P. Hrkút, “Searching source code fragments using incremental clustering,” Concurr Comput, vol. 32, no. 13, p. e5416, 2020. https://doi.org/10.1002/cpe.5416.

Y. Amaliah, W. Musu, and M. Fadlan, “Auto clustering source code to detect plagiarism of student programming assignments in Java programming language,” Proceedings of the 2021 3rd IEEE International Conference on Cybernetics and Intelligent System (ICORIS), 2021, pp. 1–6. https://doi.org/10.1109/ICORIS52787.2021.9649465.

B. Mathur and M. Kaushik, “In object-oriented software framework improving maintenance exercises through k-means clustering approach,” Proceedings of the 2018 3rd IEEE International Conference on Internet of Things: Smart Innovation and Usages (IoT-SIU), 2018, pp. 1–7. https://doi.org/10.1109/IoT-SIU.2018.8519897.

P. Hrkút, M. Ďuračík, M. Mikušová, M. Callejas-Cuervo, and J. Zukowska, “Increasing K-means clustering algorithm effectivity for using in source code plagiarism detection,” Proceedings of the International Conference on Smart Technologies, Systems and Applications, Springer, 2019, pp. 120–131.

M. Tufano, C. Watson, G. Bavota, M. Di Penta, M. White, and D. Poshyvanyk, “Deep learning similarities from different representations of source code,” Proceedings of the 15th International Conference on Mining Software Repositories, 2018, pp. 542–553. https://doi.org/10.1145/3196398.3196431.

M. Hägglund, F. J. Pena, S. Pashami, A. Al-Shishtawy, and A. H. Payberah, “Coclubert: Clustering machine learning source code,” Proceedings of the 2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA), 2021, pp. 151–158. https://doi.org/10.1109/ICMLA52953.2021.00031.

E. Ozdemir, “A general overview of RESTful web services,” Advances in Systems Analysis, Software Engineering, and High Performance Computing. IGI Glob., 2020, pp. 133–165. https://doi.org/10.4018/978-1-7998-2142-7.ch006.

P. Mandani, Lolith Raj B. K., Nithyananda R. Shetty and Rahul T. N., “A comprehensive analysis of GraphQL,” SSRN Electron. J., 2024. https://doi.org/10.2139/ssrn.4915678.

J. Juneau, RESTful Web Services. In: Java EE 8 Recipes. Apress, Berkeley, CA. https://doi.org/10.1007/978-1-4842-3594-2_15.

N. S. P. K. Yadati, “Architecture Design (MVVM + Clean Architecture),” J. Artif. Intell., Mach. Learn. Data Sci., vol. 1, no. 3, pp. 703–706, 2023. https://doi.org/10.51219/JAIMLD/naga-satya-praveen-kumar-yadati/177.

R. F. García, “MVP: Model–View–Presenter,” iOS Architecture Patterns, Berkeley, CA: Apress, 2023, pp. 107–144. https://doi.org/10.1007/978-1-4842-9069-9_3.

M. Greenacre, P. J. F. Groenen, T. Hastie, A. I. D’Enza, A. Markos and E. Tuzhilina, “Principal component analysis,” Nature Rev. Methods Primers, vol. 2, no. 1, 2022. https://doi.org/10.1038/s43586-022-00184-w.

International Journal of Computing

Methodology for Determining the Optimal Clustering Algorithm for Software Quality Verification

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Information