Text Document Dimensionality Reduction and Classification using R8, R21578 Data Sets and Machine Learning Models

Suresh Reddy Gali; Sreenivasa Rao Annaluri; N. Sudhakar Yadav; Kranthi Kiran Jeevangar; Bhuvana Manchikatla; Dhanush Gummadavalli; Naga Shivani Karra

doi:10.47839/ijc.24.2.4012

Authors

Suresh Reddy Gali
Sreenivasa Rao Annaluri
N. Sudhakar Yadav
Kranthi Kiran Jeevangar
Bhuvana Manchikatla
Dhanush Gummadavalli
Naga Shivani Karra

DOI:

https://doi.org/10.47839/ijc.24.2.4012

Keywords:

Decision tree, KNN, Logistic regression, Machine learning

Abstract

In today's world of vast textual information, the ability to categorize documents based on their content is crucial. A text classification system that automatically assigns predefined categories to documents proves invaluable in managing the vast volume of text data. This study uses natural language processing (NLP) and machine learning and decides which algorithm generates high accuracy in classification. The main goal is to develop a system that is used to classify text documents accurately. A text classification system can make accessing a required document simple and information retrieval fast. This study describes the working of different classification algorithms and evaluates their accuracy. Before feeding the dataset to the classification models, selecting the right features is important. This study focuses on features that are crucial for classification and eliminates unnecessary words using proper preprocessing approaches. It uses information gain to select the important features. Among the considered algorithms, logistic regression has given top results with 98.49% balanced accuracy, followed by KNN with 96.81% and decision tree with 93.30%. Thus, by reducing the dimensionality of the documents before feeding them to classification models, this study aims to provide a method to classify them.

References

R. Devakunchari, “Analysis on big data over the years,” 2014. [Online]. Available at: https://www.semanticscholar.org/paper/Analysis-on-big-data-over-the-years-Devakunchari/d0401ba28280625e22b6e1cbe0b43af8d5e9ad93.

T. T. Dien, B. H. Loc, and N. Thai-Nghe, “Article classification using natural language processing and machine learning,” Proceedings of the 2019 IEEE International Conference on Advanced Computing and Applications (ACOMP), Nha Trang, Vietnam, Nov. 2019, pp. 78–84. https://doi.org/10.1109/ACOMP.2019.00019.

R. Angelova and G. Weikum, “Graph-based text classification: learn from your neighbors,” Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle Washington, USA, Aug. 2006, pp. 485–492. https://doi.org/10.1145/1148170.1148254.

A. Chahar, N. Patil, D. Walunj, Sai Rohith T, R. Shah, and H. Saratkar, “An Indispensable Contemplation on Natural Language Processing Using Ensemble Techniques for Text Classification,” Proceedings of the 2022 8th IEEE International Conference on Advanced Computing and Communication Systems (ICACCS), Coimbatore, India, Mar. 2022, pp. 406–410. https://doi.org/10.1109/ICACCS54159.2022.9785015.

K. M. Chaitrashree, T. N. Sneha, S. R. Tanushree, G. R. Usha, and T. C. Pramod, “Unstructured medical text classification using machine learning and deep learning approaches,” Proceedings of the 2021 IEEE International Conference on Recent Trends on Electronics, Information, Communication & Technology (RTEICT), Bangalore, India, Aug. 2021, pp. 429-433. https://doi.org/10.1109/RTEICT52294.2021.9573667.

F. Gorunescu, “Classification and decision trees,” in Data Mining: Concepts, Models and Techniques, F. Gorunescu, Ed., Berlin, Heidelberg: Springer, 2011, pp. 159–183. https://doi.org/10.1007/978-3-642-19721-5_4.

G. SureshReddy, T. V. Rajinikanth, and A. A. Rao, “Design and analysis of novel similarity measure for clustering and classification of high dimensional text documents,” Proceedings of the Proceedings of the 15th ACM International Conference on Computer Systems and Technologies, Ruse, Bulgaria, Jun. 2014, pp. 194–201. https://doi.org/10.1145/2659532.2659615.

Pm. Lavanya and E. Sasikala, “Deep learning techniques on text classification using natural language processing (NLP) in social healthcare network: A comprehensive survey,” Proceedings of the 2021 3rd IEEE International Conference on Signal Processing and Communication (ICPSC), Coimbatore, India, May 2021, pp. 603–609. https://doi.org/10.1109/ICSPC51351.2021.9451752.

H. Guan, B. Xiao, J. Zhou, M. Guo, and T. Yang, “Fast dimension reduction for document classification based on imprecise spectrum analysis,” Proceedings of the 19th ACM International Conference on Information and Knowledge Management, Toronto, Canada, Oct. 2010, pp. 1753–1756. https://doi.org/10.1145/1871437.1871721.

G. S. Reddy, “Dimensionality reduction approach for high dimensional text documents,” Proceedings of the 2016 IEEE International Conference on Engineering & MIS (ICEMIS), Agadir, Morocco, Sep. 2016, pp. 1–6. https://doi.org/10.1109/ICEMIS.2016.7745364.

K. Torkkola, “Discriminative features for text document classification,” Form. Pattern Anal. Appl., vol. 6, no. 4, pp. 301-308, 2004. https://doi.org/10.1007/s10044-003-0196-8.

Z. Li et al., “A unified understanding of deep NLP models for text classification,” IEEE Trans. Vis. Comput. Graph., vol. 28, no. 12, pp. 4980–4994, 2022. https://doi.org/10.1109/TVCG.2022.3184186.

R. Mao, W. L. Miranker, and D. P. Miranker, “Dimension reduction for distance-based indexing,” Proceedings of the Third ACM International Conference on SImilarity Search and APplications, Istanbul, Turkey, Sep. 2010, pp. 25–32. https://doi.org/10.1145/1862344.1862349.

J. Kolluri, S. Razia, and S. R. Nayak, “Text classification using machine learning and deep learning models,” SSRN Electron. J., 2020. https://doi.org/10.2139/ssrn.3618895.

S.-B. Kim, K.-S. Han, H.-C. Rim, and S. H. Myaeng, “Some effective techniques for naive Bayes text classification,” IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 11, pp. 1457–1466, Nov. 2006. https://doi.org/10.1109/TKDE.2006.180.

G. Pang, H. Jin, and S. Jiang, “An effective class-centroid-based dimension reduction method for text classification,” Proceedings of the 22nd ACM International Conference on World Wide Web, Rio de Janeiro, Brazil, May 2013, pp. 223–224. https://doi.org/10.1145/2487788.2487903.

K. Shah, H. Patel, D. Sanghvi, and M. Shah, “A comparative analysis of logistic regression, random forest and KNN models for the text classification,” Augment. Hum. Res., vol. 5, no. 1, p. 12, 2020. https://doi.org/10.1007/s41133-020-00032-0.

Y. V. Singh, P. Naithani, P. Ansari, and P. Agnihotri, “News classification system using machine learning approach,” Proceedings of the 2021 3rd IEEE International Conference on Advances in Computing, Communication Control and Networking (ICAC3N), Greater Noida, India, Dec. 2021, pp. 186–188. https://doi.org/10.1109/ICAC3N53548.2021.9725409.

D. Patil, R. Lokare, and S. Patil, “An overview of text representation techniques in text classification using deep learning models,” Proceedings of the 2022 3rd IEEE International Conference for Emerging Technology (INCET), Belgaum, India, May 2022, pp. 1–4. https://doi.org/10.1109/INCET54531.2022.9825389.

Y. Zheng, “An exploration on text classification with classical machine learning algorithm,” Proceedings of the 2019 IEEE International Conference on Machine Learning, Big Data and Business Intelligence (MLBDBI), Taiyuan, China, Nov. 2019, pp. 81–85. https://doi.org/10.1109/MLBDBI48998.2019.00023.

S. K. Mohapatra, P. K. Sarangi, P. K. Sarangi, P. Sahu, and B. K. Sahoo, “Text classification using NLP based machine learning approach,” Proceedings of the International Conference on Recent Innovations in Science and Technology (RIST’2021), Malappuram, India, 2022, p. 020006. https://doi.org/10.1063/5.0080301.

H. Sharma, “Improving natural language processing tasks by using machine learning techniques,” Proceedings of the 2021 5th IEEE International Conference on Information Systems and Computer Networks (ISCON), Mathura, India, Oct. 2021, pp. 1–5. https://doi.org/10.1109/ISCON52037.2021.9702447.

Y. Zhang et al., “Weakly supervised multi-label classification of full-text scientific papers,” Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Long Beach, USA, Aug. 2023, pp. 3458–3469. https://doi.org/10.1145/3580305.3599544.

K. Taneja and J. Vashishtha, “Comparison of transfer learning and traditional machine learning approach for text classification,” Proceedings of the 2022 9th IEEE International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, India, Mar. 2022, pp. 195–200. https://doi.org/10.23919/INDIACom54597.2022.9763279.

D. Nunes De Oliveira and L. H. D. C. Merschmann, “An Auto-ML approach applied to text classification,” Proceedings of the ACM Brazilian Symposium on Multimedia and the Web, Curitiba, Brazil, Nov. 2022, pp. 108–116. https://doi.org/10.1145/3539637.3557054.

D. Weisburd, D. B. Wilson, A. Wooditch, and C. Britt, “Logistic regression,” Advanced Statistics in Criminology and Criminal Justice, Cham: Springer International Publishing, 2022, pp. 127–185. https://doi.org/10.1007/978-3-030-67738-1.

J. R. Quinlan, C4.5: Programs for Machine Learning. in the Morgan Kaufmann Series in Machine Learning, San Mateo, California: Morgan Kaufmann Publishers, 2014.

A. Rizka, S. Efendi, and P. Sirait, “Gain ratio in weighting attributes on simple additive weighting,” IOP Conf. Ser. Mater. Sci. Eng., vol. 420, p. 012099, 2018. https://doi.org/10.1088/1757-899X/420/1/012099.

A. Rizka, S. Efendi, and P. Sirait, “Gain ratio in weighting attributes on simple additive weighting,” IOP Conf. Ser. Mater. Sci. Eng., vol. 420, p. 012099, 2018ю https://doi.org/10.1088/1757-899X/420/1/012099.

D. Singh, A. Bhure, S. Mamtani and C. Krishna Mohan, “Fast-BoW: Scaling bag-of-visual-words generation,” Proceedings of the British Machine Vision Conference (BMVC), 2018, pp. 287.

P. Jeripothula, C. Vishnu, and C. Krishna Mohan, “Attentive contextual network for image captioning,” Proceedings of the IEEE Int. Joint Conf. on Neural Networks (IJCNN), 2021, pp. 1-8. https://doi.org/10.1109/IJCNN52387.2021.9533970.

E. P. Ijjina and C. Krishna Mohan, “Human action recognition in RGB-D using motion sequence and deep learning,” Pattern Recognition, vol. 72, pp. 504-516, 2017. https://doi.org/10.1016/j.patcog.2017.07.013.

International Journal of Computing

Text Document Dimensionality Reduction and Classification using R8, R21578 Data Sets and Machine Learning Models

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Information