Text Document Dimensionality Reduction and Classification using R8, R21578 Data Sets and Machine Learning Models
Keywords:
Decision tree, KNN, Logistic regression, Machine learningAbstract
In today's world of vast textual information, the ability to categorize documents based on their content is crucial. A text classification system that automatically assigns predefined categories to documents proves invaluable in managing the vast volume of text data. This study uses natural language processing (NLP) and machine learning and decides which algorithm generates high accuracy in classification. The main goal is to develop a system that is used to classify text documents accurately. A text classification system can make accessing a required document simple and information retrieval fast. This study describes the working of different classification algorithms and evaluates their accuracy. Before feeding the dataset to the classification models, selecting the right features is important. This study focuses on features that are crucial for classification and eliminates unnecessary words using proper preprocessing approaches. It uses information gain to select the important features. Among the considered algorithms, logistic regression has given top results with 98.49% balanced accuracy, followed by KNN with 96.81% and decision tree with 93.30%. Thus, by reducing the dimensionality of the documents before feeding them to classification models, this study aims to provide a method to classify them.
References
R. Devakunchari, “Analysis on big data over the years,” 2014. [Online]. Available at: https://www.semanticscholar.org/paper/Analysis-on-big-data-over-the-years-Devakunchari/d0401ba28280625e22b6e1cbe0b43af8d5e9ad93.
T. T. Dien, B. H. Loc, and N. Thai-Nghe, “Article classification using natural language processing and machine learning,” Proceedings of the 2019 IEEE International Conference on Advanced Computing and Applications (ACOMP), Nha Trang, Vietnam, Nov. 2019, pp. 78–84. https://doi.org/10.1109/ACOMP.2019.00019.
R. Angelova and G. Weikum, “Graph-based text classification: learn from your neighbors,” Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seattle Washington, USA, Aug. 2006, pp. 485–492. https://doi.org/10.1145/1148170.1148254.
A. Chahar, N. Patil, D. Walunj, Sai Rohith T, R. Shah, and H. Saratkar, “An Indispensable Contemplation on Natural Language Processing Using Ensemble Techniques for Text Classification,” Proceedings of the 2022 8th IEEE International Conference on Advanced Computing and Communication Systems (ICACCS), Coimbatore, India, Mar. 2022, pp. 406–410. https://doi.org/10.1109/ICACCS54159.2022.9785015.
K. M. Chaitrashree, T. N. Sneha, S. R. Tanushree, G. R. Usha, and T. C. Pramod, “Unstructured medical text classification using machine learning and deep learning approaches,” Proceedings of the 2021 IEEE International Conference on Recent Trends on Electronics, Information, Communication & Technology (RTEICT), Bangalore, India, Aug. 2021, pp. 429-433. https://doi.org/10.1109/RTEICT52294.2021.9573667.
F. Gorunescu, “Classification and decision trees,” in Data Mining: Concepts, Models and Techniques, F. Gorunescu, Ed., Berlin, Heidelberg: Springer, 2011, pp. 159–183. https://doi.org/10.1007/978-3-642-19721-5_4.
G. SureshReddy, T. V. Rajinikanth, and A. A. Rao, “Design and analysis of novel similarity measure for clustering and classification of high dimensional text documents,” Proceedings of the Proceedings of the 15th ACM International Conference on Computer Systems and Technologies, Ruse, Bulgaria, Jun. 2014, pp. 194–201. https://doi.org/10.1145/2659532.2659615.
Pm. Lavanya and E. Sasikala, “Deep learning techniques on text classification using natural language processing (NLP) in social healthcare network: A comprehensive survey,” Proceedings of the 2021 3rd IEEE International Conference on Signal Processing and Communication (ICPSC), Coimbatore, India, May 2021, pp. 603–609. https://doi.org/10.1109/ICSPC51351.2021.9451752.
H. Guan, B. Xiao, J. Zhou, M. Guo, and T. Yang, “Fast dimension reduction for document classification based on imprecise spectrum analysis,” Proceedings of the 19th ACM International Conference on Information and Knowledge Management, Toronto, Canada, Oct. 2010, pp. 1753–1756. https://doi.org/10.1145/1871437.1871721.
G. S. Reddy, “Dimensionality reduction approach for high dimensional text documents,” Proceedings of the 2016 IEEE International Conference on Engineering & MIS (ICEMIS), Agadir, Morocco, Sep. 2016, pp. 1–6. https://doi.org/10.1109/ICEMIS.2016.7745364.
K. Torkkola, “Discriminative features for text document classification,” Form. Pattern Anal. Appl., vol. 6, no. 4, pp. 301-308, 2004. https://doi.org/10.1007/s10044-003-0196-8.
Z. Li et al., “A unified understanding of deep NLP models for text classification,” IEEE Trans. Vis. Comput. Graph., vol. 28, no. 12, pp. 4980–4994, 2022. https://doi.org/10.1109/TVCG.2022.3184186.
R. Mao, W. L. Miranker, and D. P. Miranker, “Dimension reduction for distance-based indexing,” Proceedings of the Third ACM International Conference on SImilarity Search and APplications, Istanbul, Turkey, Sep. 2010, pp. 25–32. https://doi.org/10.1145/1862344.1862349.
J. Kolluri, S. Razia, and S. R. Nayak, “Text classification using machine learning and deep learning models,” SSRN Electron. J., 2020. https://doi.org/10.2139/ssrn.3618895.
S.-B. Kim, K.-S. Han, H.-C. Rim, and S. H. Myaeng, “Some effective techniques for naive Bayes text classification,” IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 11, pp. 1457–1466, Nov. 2006. https://doi.org/10.1109/TKDE.2006.180.
G. Pang, H. Jin, and S. Jiang, “An effective class-centroid-based dimension reduction method for text classification,” Proceedings of the 22nd ACM International Conference on World Wide Web, Rio de Janeiro, Brazil, May 2013, pp. 223–224. https://doi.org/10.1145/2487788.2487903.
K. Shah, H. Patel, D. Sanghvi, and M. Shah, “A comparative analysis of logistic regression, random forest and KNN models for the text classification,” Augment. Hum. Res., vol. 5, no. 1, p. 12, 2020. https://doi.org/10.1007/s41133-020-00032-0.
Y. V. Singh, P. Naithani, P. Ansari, and P. Agnihotri, “News classification system using machine learning approach,” Proceedings of the 2021 3rd IEEE International Conference on Advances in Computing, Communication Control and Networking (ICAC3N), Greater Noida, India, Dec. 2021, pp. 186–188. https://doi.org/10.1109/ICAC3N53548.2021.9725409.
D. Patil, R. Lokare, and S. Patil, “An overview of text representation techniques in text classification using deep learning models,” Proceedings of the 2022 3rd IEEE International Conference for Emerging Technology (INCET), Belgaum, India, May 2022, pp. 1–4. https://doi.org/10.1109/INCET54531.2022.9825389.
Y. Zheng, “An exploration on text classification with classical machine learning algorithm,” Proceedings of the 2019 IEEE International Conference on Machine Learning, Big Data and Business Intelligence (MLBDBI), Taiyuan, China, Nov. 2019, pp. 81–85. https://doi.org/10.1109/MLBDBI48998.2019.00023.
S. K. Mohapatra, P. K. Sarangi, P. K. Sarangi, P. Sahu, and B. K. Sahoo, “Text classification using NLP based machine learning approach,” Proceedings of the International Conference on Recent Innovations in Science and Technology (RIST’2021), Malappuram, India, 2022, p. 020006. https://doi.org/10.1063/5.0080301.
H. Sharma, “Improving natural language processing tasks by using machine learning techniques,” Proceedings of the 2021 5th IEEE International Conference on Information Systems and Computer Networks (ISCON), Mathura, India, Oct. 2021, pp. 1–5. https://doi.org/10.1109/ISCON52037.2021.9702447.
Y. Zhang et al., “Weakly supervised multi-label classification of full-text scientific papers,” Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Long Beach, USA, Aug. 2023, pp. 3458–3469. https://doi.org/10.1145/3580305.3599544.
K. Taneja and J. Vashishtha, “Comparison of transfer learning and traditional machine learning approach for text classification,” Proceedings of the 2022 9th IEEE International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, India, Mar. 2022, pp. 195–200. https://doi.org/10.23919/INDIACom54597.2022.9763279.
D. Nunes De Oliveira and L. H. D. C. Merschmann, “An Auto-ML approach applied to text classification,” Proceedings of the ACM Brazilian Symposium on Multimedia and the Web, Curitiba, Brazil, Nov. 2022, pp. 108–116. https://doi.org/10.1145/3539637.3557054.
D. Weisburd, D. B. Wilson, A. Wooditch, and C. Britt, “Logistic regression,” Advanced Statistics in Criminology and Criminal Justice, Cham: Springer International Publishing, 2022, pp. 127–185. https://doi.org/10.1007/978-3-030-67738-1.
J. R. Quinlan, C4.5: Programs for Machine Learning. in the Morgan Kaufmann Series in Machine Learning, San Mateo, California: Morgan Kaufmann Publishers, 2014.
A. Rizka, S. Efendi, and P. Sirait, “Gain ratio in weighting attributes on simple additive weighting,” IOP Conf. Ser. Mater. Sci. Eng., vol. 420, p. 012099, 2018. https://doi.org/10.1088/1757-899X/420/1/012099.
A. Rizka, S. Efendi, and P. Sirait, “Gain ratio in weighting attributes on simple additive weighting,” IOP Conf. Ser. Mater. Sci. Eng., vol. 420, p. 012099, 2018ю https://doi.org/10.1088/1757-899X/420/1/012099.
D. Singh, A. Bhure, S. Mamtani and C. Krishna Mohan, “Fast-BoW: Scaling bag-of-visual-words generation,” Proceedings of the British Machine Vision Conference (BMVC), 2018, pp. 287.
P. Jeripothula, C. Vishnu, and C. Krishna Mohan, “Attentive contextual network for image captioning,” Proceedings of the IEEE Int. Joint Conf. on Neural Networks (IJCNN), 2021, pp. 1-8. https://doi.org/10.1109/IJCNN52387.2021.9533970.
E. P. Ijjina and C. Krishna Mohan, “Human action recognition in RGB-D using motion sequence and deep learning,” Pattern Recognition, vol. 72, pp. 504-516, 2017. https://doi.org/10.1016/j.patcog.2017.07.013.
Downloads
Published
How to Cite
Issue
Section
License
International Journal of Computing is an open access journal. Authors who publish with this journal agree to the following terms:• Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
• Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
• Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.