Variations of the CorDeGen+ Method for the Languages of Northern European Countries

Yakiv Yusyn

doi:10.47839/ijc.25.1.4500

Authors

Yakiv Yusyn

DOI:

https://doi.org/10.47839/ijc.25.1.4500

Keywords:

Text Corpora, Corpora Generation, Software Engineering, Software Testing, Northern European

Abstract

This study is devoted to the problem of generating text corpora for their use during the development and testing of natural language processing information systems. The CorDeGen and CorDeGen+ methods are among the approaches that address this problem. However, as shown in this paper, the application of these methods to the development and testing of information systems for processing texts in “regional” languages (less widely spoken than English) has not yet been considered, despite its challenges. In this study, the languages of Northern Europe are considered as such “regional” languages, and the issue of removing part of the terms (if they coincide with the stop words of these languages) from the generated corpora during preprocessing is solved. To address this issue, the paper proposes seven new language variations of the CorDeGen+ method, specifically for Lithuanian, Danish, Swedish, Norwegian, Northern Sami, Lule Sami, and Icelandic languages. Latvian, Estonian, Finnish, and Southern Sami languages are also considered in this study, and the results show that the use of the CorDeGen+^(0-9) variation, already described in the literature, is sufficient for them. The experimental verification of the effect of removing part of the terms showed that the use of the proposed language variations and CorDeGen+^(0-9) variation prevents the removal of 20–43% of all terms from the generated corpus during preprocessing.

References

A. Holmberg and C. Platzack, The Scandinavian languages, in: G. Cinque, R. S. Kayne (Eds.), The Comparative Syntax Handbook, 2012, pp. 420-458.

K. Tanaka, C. Chu and T. Kajiwara, “Corpus Construction for Historical Newspapers: A Case Study on Public Meeting Corpus Construction Using OCR Error Correction,” SN Computer Science, vol. 3, 2022. https://doi.org/10.1007/s42979-022-01393-6.

J. Lichtarge, “Corpora generation for grammatical error correction,” Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019, pp. 3291-3301. https://doi.org/10.18653/v1/N19-1333.

F. Rakotomalala, A. R. Hajalalaina and M. V. Ravonimanantsoa Ndaohialy, “FLICs (Facebook Language Informal Corpus): a novel dataset for informal language,” International Journal of Data Science and Analytics, vol. 18, pp. 393-403, 2024. https://doi.org/10.1007/s41060-023-00460-2.

S. M. U. Qumar, M. Azim and S. M. K. Quadri, “Addressing the data gap: building a parallel corpus for Kashmiri language,” International Journal of Information Technology, 2024. https://doi.org/10.1007/s41870-024-01979-8.

K. Meden, T. Erjavec and A. Pančur, “Slovenian parliamentary corpus siParl,” Language Resources and Evaluation, 2024. https://doi.org/10.1007/s10579-024-09746-8.

Š. Arhar Holdt and I. Kosem, “Šolar, the developmental corpus of Slovene,” Language Resources and Evaluation, 2024. https://doi.org/10.1007/s10579-024-09758-4.

D. Vitório, E. Souza and L. Martins, “Building a relevance feedback corpus for legal information retrieval in the real-case scenario of the Brazilian Chamber of Deputies,” Language Resources and Evaluation, 2024. https://doi.org/10.1007/s10579-024-09767-3.

G. Recski, E. Iklódi and B. Lellmann, “BRISE-plandok: a German legal corpus of building regulations,” Language Resources and Evaluation, 2024. https://doi.org/10.1007/s10579-024-09747-7.

R. Boujelbane, M. Ellouze Khemekhem and L. Belguith, “Mapping Rules for Building a Tunisian Dialect Lexicon and Generating Corpora,” Proceedings of the Sixth International Joint Conference on Natural Language Processing, 2013, pp. 419-428. https://aclanthology.org/I13-1048.

C. Alberti, D. Andor, E. Pitler, J. Devlin and M. Collins, “Synthetic QA Corpora Generation with Roundtrip Consistency,” Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 6168-6173. https://doi.org/10.18653/v1/P19-1620.

Y. Yusyn and T. Zabolotnia, “Text data corpora generation on the basis of the deterministic method,” KPI Science News, vol. 2021, no. 3, pp. 38-45, 2021. http://scinews.kpi.ua/article/view/240780. (in Ukrainian).

Y. Yusyn and N. Rybachok, “Improvement of the deterministic method of the text data corpora generation,” Herald of Khmelnytskyi National University. Technical sciences, vol. 333, no. 2, p. 437–445, 2024. https://doi.org/10.31891/2307-5732-2024-333-2-69.

United Nations Statistics Division, Standard Country or Area Codes for Statistical Use (M49), 1999.

Central Intelligence Agency, The World Factbook 2021, Washington, DC, 2021.

Publications Office of the European Union, Northern Europe, 2024, [Online]. Available at: https://op.europa.eu/s/zXfh.

D. Gene, stopwords-iso, 2024, [Online]. Available at: https://github.com/stopwords-iso.

D. Gene, stopwords-iso/stopwords-lv, 2016, [Online]. Available at: https://github.com/stopwords-iso/stopwords-lv.

D. Gene, stopwords-iso/stopwords-lt, 2016, [Online]. Available at: https://github.com/stopwords-iso/stopwords-lt.

D. Gene, stopwords-iso/stopwords-et, 2016, [Online]. Available at: https://github.com/stopwords-iso/stopwords-et.

D. Gene, stopwords-iso/stopwords-fi, 2016, [Online]. Available at: https://github.com/stopwords-iso/stopwords-fi.

D. Gene, stopwords-iso/stopwords-da, 2016, [Online]. Available at: https://github.com/stopwords-iso/stopwords-da.

D. Gene, stopwords-iso/stopwords-sv, 2016, [Online]. Available at: https://github.com/stopwords-iso/stopwords-sv.

D. Gene, stopwords-iso/stopwords-no, 2016, [Online]. Available at: https://github.com/stopwords-iso/stopwords-no.

E. Klem, eklem/stopword-sami, 2020, [Online]. Available at: https://github.com/eklem/stopword-sami.

S. Friðriksdóttir, rmh_filters, 2021, [Online]. Available at: https://github.com/steinunnfridriks/rmh_filters.

Microsoft, What's new in .NET 8, 2023, [Online]. Available at: https://learn.microsoft.com/en-us/dotnet/core/whats-new/dotnet-8/overview.

Microsoft, What's new in C# 12, 2023, [Online]. Available at: https://learn.microsoft.com/en-us/dotnet/csharp/whats-new/csharp-12.

.NET Foundation and contributors, Home > xUnit.net, 2019, [Online]. Available at: https://xunit.net.

.NET Foundation and contributors, xunit/xunit at 2.6.2, 2023, [Online]. Available at: https://github.com/xunit/xunit/tree/2.6.2.

International Organization for Standardization, Codes for the representation of names of languages—Part 1: Alpha-2 code (ISO Standard No. 639-1:2002), 2002.

International Organization for Standardization, Codes for the representation of names of languages – Part 3: Alpha-3 code for comprehensive coverage of languages (ISO Standard No. 639-3:2007), 2007.

International Journal of Computing

Variations of the CorDeGen+ Method for the Languages of Northern European Countries

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Information