Harnessing Pretrained Models for Arabic Idiomatic Expression Identification: LLMs

Authors

  • Salma Tace
  • Mossab Batal
  • Soumaya Ounacer
  • Sanaa El Filali
  • Mohamed Azouazi

Keywords:

Arabic Idiomatic Expressions, AraBERT, Multilingual BERT (mBERT), LLMS, Arabic Natural Language Processing, Deep learning, Accuracy, F1 score

Abstract

Researchers have increasingly focused on idiomatic expressions in recent years, particularly Arabic idiomatic expressions. These phrases, often derived from ancient stories, are characterized by deeply idiomatic and non-compositional meanings. In this study, we explore the capabilities of large language models (LLMs) to understand and identify these expressions. After collecting data on Arabic idiomatic expressions, we carried out a preprocessing phase. We conducted a comprehensive set of experiments comparing two models, ChatGPT 4 and Arabic Bidirectional Encoder Representations from Transformers (AraBERT). Using 80% of the data for training and 20% for testing, our results reveal the strong ability of LLMs to identify idiomatic expressions, with performance reaching up to 95% in terms of F1 score and accuracy. In the second part of our study, we evaluate the efficacy of the pretrained AraBERT model in detecting idiomatic expressions, comparing it to baseline models, namely Convolutional Neural Network - Long Short-Term Memory (CNN-LSTM) and Bidirectional Long Short-Term Memory (BiLSTM). The analyses show that the pretrained AraBERT model outperforms the conventional CNN-LSTM method by 14% in accuracy and F1 score, and also outperforms the BiLSTM model by 22%.

References

I. A. Al-Sughaiyer, I. A. Al-Kharashi, “Arabic morphological analysis techniques: A comprehensive survey,” J. Am. Soc. Inf. Sci., vol. 55, issue 3, pp. 189-213, 2004. https://doi.org/10.1002/asi.10368.

M. Altantawy, N. Habash, O. Rambow, S. Ibrahim, “Morphological analysis and generation of Arabic nouns: A morphemic functional approach,” Proceedings of the Language Resource and Evaluation Conference, Malta, 2010. [Online]. Available at: http://www.lrec-conf.org/proceedings/lrec2010/pdf/442_Paper.pdf.

A. N. De Roeck, W. Al-Fares, “A morphologically sensitive clustering algorithm for identifying Arabic roots,” Proceedings of the 38th Annual Meeting on Association for Computational Linguistics - ACL’00, 2000, pp. 199-206. https://doi.org/10.3115/1075218.1075244.

S. A. Minko-Mi-Nseme, Modélisation des Expressions Figées en Arabe en Vue de la Constitution d’une Base de Données Lexicale, Ph.D. Thesis, Lyon 2, 2003. (in French).

H. M. Alqahtni, The Structure and Context of Idiomatic Expressions in the Saudi Press, Ph.D. Thesis, University of Leeds, 2014.

A. Kourtin, et al., “Lexicon-grammar tables for modern Arabic idiomatic expressions,” In NooJ Conference Proceedings, 2021. https://doi.org/10.1007/978-3-030-92861-2_3.

J. Baptista, “Compositional vs. idiomatic sequences,” J. Appl. Linguist. Spec. Issue Lexicon-Grammar, pp. 81-92, 2004.

M. S. Ali, “La traduction des expressions figées: Langue et culture,” Traduire, vol. 235, pp. 103-123, 2016. (in French). https://doi.org/10.4000/traduire.865.

M. Abdelmaksoud, Les Expressions Idiomatiques dans le Coran et Leur Traduction Française: Étude Analytique Contrastive de l’Arabe vers le Français dans Trois Interprétations Françaises du Sens du Coran, Ph.D. Thesis, Mansoura University, Egypt, 2018. (in French).

M. Gross, “Les phrases figées en français,” L'information Grammaticale, vol. 59, issue 1, pp. 36-41, 1993. https://doi.org/10.3406/igram.1993.3139.

A. El‑Mahdi, F. Bensalem & S. Khalil, “A framework for translating Arabic idiomatic expressions into English: Cultural and contextual insights,” Journal of King Saud University – Computer and Information Sciences, vol. 29, issue 3, pp. 245–262, 2017.

M. Alotaibi, “Pedagogical implications of teaching Arabic idiomatic expressions to non-native speakers,” Arab World English Journal, vol. 8, issue 1, pp. 85-98, 2017.

A. M. A. Nada, et al., “Arabic text summarization using AraBERT model using extractive text summarization approach,” International Journal of Academic Information Systems Research (IJAISR), vol. 4, issue 8, pp. 6-9, 2020.

F. Alami, et al., “Multilingual offensive language detection method based on transfer learning from transformer fine-tuning model,” Journal of King Saud University - Computer and Information Sciences, vol. 34, issue 8, part. B, pp. 6048-6056. 2022. https://doi.org/10.1016/j.jksuci.2021.07.013.

D. Faraj, M. Abdullah, “SarcasmDet at SemEval-2021 Task 7: Detect humor and offensive based on demographic factors using RoBERTa pretrained model,” Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), 2021, pp. 527-533, https://doi.org/10.18653/v1/2021.semeval-1.64.

M. Al-Shargi, O. Rambow, “Computational challenges of parsing Arabic idiomatic expressions,” Proceedings of the Computational Linguistics Conference, 2015.

F. Mohammed, A. Haji, “A diachronic study on the evolution of Arabic idiomatic expressions,” Journal of Historical Linguistics, 2019.

M. El-Haj, U. Kruschwitz, C. Fox, “Creating an Arabic diacritized corpus for statistical machine translation,” Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), 2014.

M. Alharbi, M. Aziz, “Statistical machine translation for Arabic idioms,” Procedia Computer Science, vol. 117, pp. 112-119, 2017.

T. Bani-Khaled, “The role of idiomatic expressions in Arabic language learning,” Arab World English Journal, vol. 10, issue 2, pp. 35-48, 2019.

D. Al-Ghamdi, H. Altalhi, “Integrating idiomatic expressions in interactive language learning tools,” International Journal of Computer-Assisted Language Learning and Teaching, vol. 8, issue 3, pp. 58-72, 2018.

M. Saad, W. Ashour, “Arabic idiomatic expressions in social media: A computational approach,” Journal of Social Media Studies, vol. 5, issue 1, pp. 98-112, 2020.

L. Al-Sulaiti, E. Atwell, “The design of a corpus of contemporary Arabic,” International Journal of Corpus Linguistics, vol. 11, issue 2, pp. 135-171, 2006. https://doi.org/10.1075/ijcl.11.2.02als.

M. M. Daoud, Dictionary of Idiomatic Expressions in Contemporary Arabic, Dar Gharib for Printing, Publishing, and Distribution, Cairo, Egypt, 2003.

A. El Mahdaouy, E. Gaussier, S. O. El Alaoui, “Arabic text classification based on word and document embeddings,” In: Hassanien, A., Shaalan, K., Gaber, T., Azar, A., Tolba, M. (eds) Proceedings of the International Conference on Advanced Intelligent Systems and Informatics 2016. AISI 2016, Advances in Intelligent Systems and Computing, 2017, vol. 533, pp. 32–41. Springer, Cham. https://doi.org/10.1007/978-3-319-48308-5_4.

A. Kourtin, A. Amzali, M. Mourchid, A. Mouloudi, S. Mbarki, “Lexicon-grammar tables standardization and implementation,” Indonesian Journal of Electrical Engineering and Computer Science (IJEECS), vol. 33, no. 2, pp. 1243-1251, 2024. https://doi.org/10.11591/ijeecs.v33.i2.pp1243-1251.

F. Z. El-Alami, S. O. El Alaoui, “Word sense representation based-method for arabic text categorization,” Proceedings of the 9th IEEE International Symposium on Signal, Image, Video and Communications, Rabat, Morocco, 2018, pp 141-146. https://doi.org/10.1109/ISIVC.2018.8709234.

“BERT (modèle de langage),” Wikipédia, [Online]. Available at: https://fr.wikipedia.org/w/index.php?title=BERT_(mod%C3%A8le_de_langage)&oldid=176328149.

Q. H. Sun, R. M. Horton, D. A. Bader, B. Jones, L. Zhou, T. T. Li, “Projections of temperature-related non-accidental mortality in Nanjing, China. Biomed. Environ. Sci., vol. 32, issue 2, pp. 134-139, 2019. https://doi.org/10.3967/bes2019.019.

W. Antoun, F. Baly, H. Hajj, AraBERT: Transformer-Based Model for Arabic Language Understanding, arXiv preprint arXiv:2003.00104, 2020.

Downloads

Published

2025-07-01

How to Cite

Tace, S., Batal, M., Ounacer, S., El Filali, S., & Azouazi, M. (2025). Harnessing Pretrained Models for Arabic Idiomatic Expression Identification: LLMs. International Journal of Computing, 24(2), 284-290. Retrieved from https://www.computingonline.net/computing/article/view/4011

Issue

Section

Articles