A Comparative Analysis of Data Stream Clustering Algorithms

Authors

  • Tajudeen Akanbi Akinosho
  • Elias Tabane
  • Zenghui Wang

DOI:

https://doi.org/10.47839/ijc.22.4.3350

Keywords:

Data stream clustering, DenStream, CluStream, ClusTree, MOA, Python

Abstract

This study compares the performance of stream clustering algorithms (DenStream, CluStream, ClusTree) on Massive Online Analysis (MOA) using synthetic and real-world datasets. The algorithms are compared in the presence on noise level [0%, 10%, 30%] on the synthetic data. DenStream epsilon parameter was tune to 0.01 and 0.03 to improve its performance. We use the performance evaluation metrics CMM, F1-P, F1-R, Purity, Silhouette Coefficient, and Rand statistic. On synthetic data, our results show that ClusTree outperformed CluStream and DenStream on the almost all the metrics except in Purity and Silhouette were DenStream performs better at noise levels (10% and 30%). ClusTree outperform CluStream and DenStream on Forest Cover type dataset on metrics CMM, F1-P, F1-R, Silhouette Coefficient, and Rand statistic with 90%, 74%, 77% and 89% respectively. However, the tune DenStream epsilon parameter shows some improvements. On electricity data, DenStream outperform CluStream and ClusTree at epsilon parameter (0.03 and 0.05) on metrics F1-P, F1-R, and Purity. The investigation of DenStream epsilon parameter (0.03 and 0.05) on RandomBRF Generator with noise level [0%, 10%, 30%] shows that DenStream with epsilon 0.03 outperform other parameter adjustment.

References

A. Bifet, J. Read, G. Holmes, B. Pfahringer, Chapter 1: Streaming Data Mining with Massive Online Analytics (MOA). Data Mining in Time Series and Streaming Databases, 2018, pp. 1-25. https://doi.org/10.1142/9789813228047_0001

A. Bifet, G. Holmes, B. Pfahringer, J. Read, P. Kranen, H. Kremer, T. Jansen, and T. Seidl, “MOA: A real-time analytics open-source framework.” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases, September 2011, pp. 617-620. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23808-6_41

J. Wijffels, RMOA: Connect R to MOA to Perform Streaming Classifications. R package version 1.0, 2014. [Online]. Available at: https://CRAN.R-project.org/package=RMOA.

M. Hahsler, M. Bolanos, and J. Forrest, “Introduction to stream: An extensible framework for data stream clustering research with R,” Journal of Statistical Software, vol. 76, no. 14, pp. 1-50, 2017. https://doi.org/10.18637/jss.v076.i14.

M. Hahsler, M. Bolanos, and J. Forrest, streamMOA: Interface for MOA Stream Clustering Algorithms. R package version, 2015, 51 p.

J. Montiel, J. Read, A. Bifet, and T. Abdessalem, “Scikit-multiflow: A multi-output streaming framework,” Journal of Machine Learning Research, vol. 19, issue 72, pp. 1-5, 2018.

J. Montiel, M. Halford, S. M. Mastelini, G. Bolmier, R. Sourty, R. Vaysse, A. Zouitine, A., H. Murilo Gomes, J. Read, T. Abdessalem, A. Bifet, “River: machine learning for streaming data in Python,” Journal of Machine Learning Research, no. 22, pp. 1-8, 2021.

S. Mansalis, E. Ntoutsi, N. Pelekis, and Y. Theodoridis, “An evaluation of data stream clustering algorithms,” Statistical Analysis and Data Mining: The ASA Data Science Journal, vol. 11, no. 4, pp. 167-187, 2018. https://doi.org/10.1002/sam.11380.

L. S. Agrawal, and D. S. Adane, “Models and issues in data stream mining,” International Journal on Computational Science & Applications (IJCSA), vol. 9, no. 1, pp. 6-10, 2016.

T. Zhang, R. Ramakrishnan, M. Livny, “Birch: an efficient data clustering method for very large databases,” Proceeding of the SIGMOD, 1996, pp. 103-114. https://doi.org/10.1145/235968.233324.

H. Kremer, P. Kranen, T. Jansen, T. Seidl, A. Bifet, G. Holmes, and B. Pfahringer, “An effective evaluation measure for clustering on evolving data streams,” Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, August 2011, pp. 868-876. https://doi.org/10.1145/2020408.2020555

A. Amini, T. Y. Wah, M. R. Saybani, and S. R. A. S. Yazdi, “A study of density-grid based clustering algorithms on data streams,” Proceedings of the IEEE 2011 Eighth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), July 2011, vol. 3, pp. 1652-1656. https://doi.org/10.1109/FSKD.2011.6019867

R. W. Hyde, P. Angelov, A. R. Mackenzie, F. Nie, “Fully online clustering of evolving data streams into arbitrarily shaped clusters,” Inf Sci (N Y), 382, pp. 96-114, 2017. https://doi.org/10.1016/j.ins.2016.12.004

N. B. Roa, L. Travé-massuyès, V. Grisales, “A novel algorithm for dynamic clustering: properties and performance,” Proceedings of the 15th IEEE International Conference on Machine Learning and Applications (ICMLA), 2019, pp. 565-570.

R. Ahmed, G. Dalkılıç, Y. Erten, “DGStream: High quality and efficiency stream clustering algorithm,” Expert Syst Appl, vol. 141, pp. 112947-112959, 2019. https://doi.org/10.1016/j.eswa.2019.112947

F. Cao, M. Ester, W. Qian, A. Zhou, “Density-based clustering over an evolving data stream with noise,” Proceedings of the SIAM Conference on Data Mining, 2006, pp. 328–339. https://doi.org/10.1137/1.9781611972764.29.

C. C. Aggarwal, J. Han, J. Wang, P. S. Yu, “A framework for clustering evolving data streams,” Proceedings of the 29th International Conference on Very Large Data Bases, Berlin, Germany, September 9-12, 2003, pp. 81–92. https://doi.org/10.1016/B978-012722442-8/50016-1.

P. Kranen, I. Assent, C. Baldauf, T. Seidl, “The ClusTree: Indexing micro-clusters for anytime stream mining,” Knowl Inf Syst, vol. 29, pp. 249–272, 2011. https://doi.org/10.1007/s10115-010-0342-8.

D. Krasnov, D. Davis, K. Malott, Y. Chen, X. Shi,. and A. Wong, “Fuzzy c-means clustering: A review of applications in breast cancer detection,” Entropy, vol. 25, issue 7, p.1021. 2023. https://doi.org/10.3390/e25071021.

A.A. Ewees, M. Abd Elaziz, M.A. Al-Qaness, H.A. Khalil, and S. Kim, “Improved artificial bee colony using sine-cosine algorithm for multi-level thresholding image segmentation,” IEEE Access, vol. 8, pp. 26304-26315, 2020. https://doi.org/10.1109/ACCESS.2020.2971249.

S. Mashtalir, O. Mikhnova, & M. Stolbovyi, “Multidimensional sequence clustering with adaptive iterative dynamic time warping,” International Journal of Computing, vol. 18, issue 1, pp. 53-59, 2019. https://doi.org/10.47839/ijc.18.1.1273.

A. Moitra, N. O. Malott, P. A. Wilsey, “Persistent homology on streaming data,” Proceedings of the IEEE International Conference on Data Mining Workshops (ICDMW), 2020, pp. 636–643. https://doi.org/10.1109/ICDMW51313.2020.00090.

M. Shindler, A. Wong, & A. Meyerson, “Fast and accurate k-means for large datasets,” Advances in Neural Information Processing Systems, 2375–2383, 2011.

J. Sui, Z. Liu, A. Jung, L. Liu, & X. Li, “Dynamic clustering scheme for evolving data streams based on improved STRAP,” IEEE Access, vol. 6, pp. 46157–46166, 2018. https://doi.org/10.1109/ACCESS.2018.2864553

F. H. Y. Nakagawa, S. Barbon Junior, & B. B. Zarpelao, “Attack detection in smart home IoT networks using CluStream and Page-Hinkley test,” Proceedings of the 2021 IEEE Latin-American Conference on Communications, LATINCOM’2021, Santo Domingo, Dominican Republic, 2021, pp. 1-6. https://doi.org/10.1109/LATINCOM53176.2021.9647769

J. Fang, C. Chan, K. Owzar, L. Wang, D. Qin, Q. J. Li, & J. Xie, “Clustering Deviation Index (CDI): a robust and accurate internal measure for evaluating scRNA-seq data clustering,” Genome Biology, vol. 23, issue 1, article number 269, 2022. https://doi.org/10.1186/s13059-022-02825-5

E. S. Page, “Continuous inspection schemes,” Biometrika, vol. 41 (1/2), pp. 100-115, 1954. https://doi.org/10.1093/biomet/41.1-2.100.

M. Carnein, H. Trautmann, A. Bifet, B. Pfahringer, “Towards automated configuration of stream clustering algorithms,” Communications in Computer and Information Science 1167 CCIS, 2020, pp. 137–143. https://doi.org/10.1007/978-3-030-43823-4_12

M. Carnein, H. Trautmann, A. Bifet, B. Pfahringer, “confstream: Automated algorithm selection and configuration of stream clustering algorithms,” Proceedings of the 14th International Conference Learning and Intelligent Optimization, LION 14, Athens, Greece, May 24–28, 2020, Revised Selected Papers 14, pp. 80–95. https://doi.org/10.1007/978-3-030-53552-0_10.

Downloads

Published

2023-12-31

How to Cite

Akinosho, T. A., Tabane, E., & Wang, Z. (2023). A Comparative Analysis of Data Stream Clustering Algorithms. International Journal of Computing, 22(4), 439-446. https://doi.org/10.47839/ijc.22.4.3350

Issue

Section

Articles