Empowering Smaller Models: Tuning LLaMA and Gemma with Chain-of-Thought for Ukrainian Exam Tasks

Authors

  • Mykyta Syromiatnikov
  • Victoria Ruvinskaya
  • Nataliia Komleva

Keywords:

LLM, LLaMA, Gemma, PEFT, Chain-of-Thought, fine-tuning, reasoning, Ukrainian, information technology

Abstract

Leading large language models have demonstrated impressive capabilities in reasoning-intensive tasks, such as standardized educational testing. However, they often require extensive training in low-resource settings with inaccessible infrastructure. Small or compact models, though more efficient, frequently lack sufficient support for underrepresented languages, leaving a performance gap in critical domains. This work explores the potential of parameter-efficient fine-tuning of compact open-weight language models to handle reasoning-intensive tasks in the underrepresented Ukrainian language, building on the findings of the ZNO-Eval benchmark. Parameter-efficient fine-tuning of LLaMA 3.1 (8 billion parameters), LLaMA 3.2 (3 billion parameters), and Gemma 2 (9 billion parameters) models on chain-of-thought solutions resulted in a modest test score improvement of up to 17.4% on complex matching tasks and 1.6% overall compared to tuning on answer letters alone, offering enhanced interpretability and robustness. In addition, the proposed tuning method with joint task topic and step-by-step solution generation outperforms standard chain-of-thought tuning in matching tasks and provides a 5.4% gain over the best LLaMA 3.2 model due to guiding the model to recall and apply domain‐relevant information. Contrasting obtained results with zero-shot evaluations of leading open-weight and proprietary models such as Qwen, DeepSeek R1, OpenAI o1 and o3, Gemini, and Claude, highlights that fine-tuning LLaMA and Gemma models with 2,032 step-by-step solutions and 20 to 50 million trainable parameters on a single A100 GPU lets them outperform GPT-4o mini, Mistral Large, and larger open-weight models.

References

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” Journal of Machine Learning Research, vol. 21, no. 140, pp. 1–67, 2020.

P. Rajpurkar, The Stanford Question Answering Leaderboard, 2025, [Online] Available at: https://rajpurkar.github.io/SQuAD-explorer/.

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. X. Song and J. Steinhardt. “Measuring massive multitask language understanding,” Proceedings of the Ninth International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021.

B. Romera-Paredes, M. Barekatain, A. Novikov, M. Balog, M. P. Kumar, E. Dupont, F. J. Ruiz, J. S. Ellenberg, P. Wang, O. Fawzi, P. Kohli, A. Fawzi, J. Grochow, A. Lodi, J. Mouret, T. Ringer and T. Yu, “Mathematical discoveries from program search with large language models,” Nature, vol. 625, no. 7995, pp. 468 – 475, 2024. https://doi.org/10.1038/s41586-023-06924-6.

OpenAI, GPT-4o System Card, 2024, [Online] Available at: https://arxiv.org/abs/2410.21276.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, Ł. Kaiser, I. Polosukhin, “Attention Is All You Need,” in Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, December 4-9 2017, pp. 5998-6008.

J. Ainslie, J. Lee-Thorp, M. de Jong, Y. Zemlyanskiy, F. Lebrón, S.K. Sanghai, “GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints,” Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pp. 4895–4901. https://doi.org/10.18653/v1/2023.emnlp-main.298.

J. Kaplan, S. McCandlish, T. Henighan, T.B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, D. Amodei, Scaling Laws for Neural Language Models, 2020, [Online]. Available at: https://arxiv.org/abs/2001.08361.

D. Driess, F. Xia, M.S.M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, W. Huang, Y. Chebotar, P. Sermanet, D. Duckworth, S. Levine, V. Vanhoucke, K. Hausman, M. Toussaint, K. Greff, A. Zeng, I. Mordatch, P. Florence, “PaLM-E: An Embodied Multimodal Language Model,” Proceedings of the 40th International Conference on Machine Learning, ICML 2023, Honolulu, Hawaii, USA, July 23-29 2023, pp. 8469-8488.

N.O. Komleva, K.S. Cherneha, B.I. Tymchenko, O.M. Komlevoy, “Intellectual Approach Application for Pulmonary Diagnosis,” IEEE First International Conference Data Stream Mining & Processing (DSMP), Lviv, Ukraine, August 23–27, 2016, pp. 48–52. https://doi.org/10.1109/DSMP.2016.7583505.

J. Myung, N. Lee, Y. Zhou, J. Jin, R.A. Putri, D. Antypas, H. Borkakoty, E. Kim, C. Pérez-Almendros, A.A. Ayele, V. Gutiérrez-Basulto, Y. Ibáñez-García, H. Lee, S.H. Muhammad, K. Park, A. Rzayev, N. White, S.M. Yimam, M.T. Pilehvar, N. Ousidhoum, J. Camacho-Collados, A. Oh, “BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and Languages,” Proceedings of the 38th Conference on Neural Information Processing Systems (NeurIPS 2024), Vancouver, Canada, December 10-15 2024.

M.V. Syromiatnikov, V.M. Ruvinskaya, A.S. Troynina, “ZNO-Eval: Benchmarking reasoning capabilities of large language models in Ukrainian,” Informatics. Culture. Technology., vol. 1, no. 1, pp. 185–191, 2024. https://doi.org/10.15276/ict.01.2024.27.

DeepSeek-AI, DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, 2025, [Online]. Available at: https://arxiv.org/abs/2501.12948.

Z. Han, C. Gao, J. Liu, J. Zhang, S.Q. Zhang, Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey, 2024, [Online]. Available at: https://arxiv.org/abs/2403.14608.

P. Sahoo, A.K. Singh, S. Saha, V. Jain, S.S. Mondal, A. Chadha, A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications, 2024, [Online]. Available at: https://arxiv.org/abs/2402.07927.

H.W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, E. Li, X. Wang, M. Dehghani, S. Brahma, A. Webson, S.S. Gu, Z. Dai, M. Suzgun, X. Chen, A. Chowdhery, D. Valter, S. Narang, G. Mishra, A.W. Yu, V. Zhao, Y. Huang, A.M. Dai, H. Yu, S. Petrov, E.H. Chi, J. Dean, J. Devlin, A. Roberts, D. Zhou, Q.V. Le, J. Wei, “Scaling Instruction-Finetuned Language Models,” Journal of Machine Learning Research, vol. 25, no. 70, pp. 1-53, 2024.

K. Alizadeh-Vahid, I. Mirzadeh, D. Belenko, K. Khatamifard, M. Cho, C.C. Del Mundo, M. Rastegari, M. Farajtabar, “LLM in a Flash: Efficient Large Language Model Inference with Limited Memory,” Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand, August 11–16, 2024, pp. 12562–12584. https://doi.org/10.18653/v1/2024.acl-long.678.

Gemma Team, Gemma 2: Improving Open Language Models at a Practical Size, 2024, [Online]. Available at: https://arxiv.org/abs/2408.00118.

LLaMA team, The Llama 3 Herd of Models, 2024 [Online]. Available at: https://arxiv.org/abs/2407.21783.

Meta AI, Introducing Llama 3.1: Our Most Capable Models to Date, 2024, [Online]. Available at: https://ai.meta.com/blog/meta-llama-3-1/.

Meta AI, Llama 3.2: Revolutionizing Edge AI and Vision with Open, Customizable Models, 2024, [Online]. Available at: https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/.

N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. de Laroussilhe, A. Gesmundo, M. Attariyan, S. Gelly, Parameter-Efficient Transfer Learning for NLP, 2019, [Online]. Available at: https://arxiv.org/abs/1902.00751.

X. Li, P. Liang, Prefix-Tuning: Optimizing Continuous Prompts for Generation, 2021, [Online]. Available at: https://arxiv.org/abs/2101.00190.

V. Lialin, V. Deshpande, A. Rumshisky, Scaling down to scale up: A guide to parameter-efficient fine-tuning, 2023, [Online]. Available at: https://arxiv.org/abs/2303.15647.

E.J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, LoRA: Low-Rank Adaptation of Large Language Models, 2021, [Online]. Available at: https://arxiv.org/abs/2106.09685.

B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko, “Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, USA, June 18-22, 2018, pp. 2704–2713. https://doi.org/10.1109/CVPR.2018.00286.

S. K. Esser, J. L. McKinstry, D. Bablani, R. Appuswamy, and D. S. Modha, “Learned Step Size Quantization,” The Eighth International Conference on Learning Representations (ICLR 2020), Online, April 26–May 1, 2020. https://doi.org/10.48550/arXiv.1902.08153.

T. Dettmers, A. Pagnoni, A. Holtzman, L. Zettlemoyer, QLoRA: Efficient Finetuning of Quantized LLMs, 2023, [Online]. Available at: https://arxiv.org/abs/2305.14314.

M.V. Syromiatnikov, V.M. Ruvinskaya, “UA-LLM: Advancing Context-Based Question Answering in Ukrainian Through Large Language Models,” Radio Electronics, Computer Science, Control, no. 1, pp. 147-161. 2024. https://doi.org/10.15588/1607-3274-2024-1-14.

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E.H. Chi, Q. Le, D. Zhou, Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, 2022, [Online]. Available at: https://arxiv.org/abs/2201.11903.

J. Liu, A. Liu, X. Lu, S. Welleck, P. West, R. Le Bras, Y. Choi, H. Hajishirzi, “Generated Knowledge Prompting for Commonsense Reasoning,” Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, May 22–27, 2022, pp. 3154–3169. https://doi.org/10.18653/v1/2022.acl-long.225.

X. Wang, J. Wei, D. Schuurmans, Q. V. Le, E. H. Chi, S. Narang, A. Chowdhery, and D. Zhou, Self-Consistency Improves Chain of Thought Reasoning in Language Models, 2023, [Online]. Available at: https://arxiv.org/abs/2203.11171v4.

S. Yao et al., Tree of Thoughts: Deliberate Problem Solving with Large Language Models, 2023, [Online]. Available at: https://arxiv.org/abs/2305.10601.

V.M. Ruvinskaya, A.S. Troynina. “Development of information technology for the generation and maintenance of knowledge–oriented control systems,” Eastern-European Journal of Enterprise Technologies, vol. 2, no. 86, pp. 41–49, 2017. https://doi.org/10.15587/1729-4061.2017.98727.

C. Molnar, Interpretable Machine Learning: A Guide for Making Black Box Models Explainable, 2025, [Online]. Available at: https://christophm.github.io/interpretable-ml-book/

K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, Ł. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, J. Schulman, Training Verifiers to Solve Math Word Problems, 2021, [Online]. Available at: https://arxiv.org/abs/2110.14168.

A. Srivastava et al., “Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models,” Transactions on Machine Learning Research, vol. 5, pp. 1–95, 2023. DOI: https://doi.org/10.48550/arXiv.2206.04615.

M. Hardalov, T. Mihaylov, D. Zlatkova, Y. Dinkov, I. Koychev, and P. Nakov, “EXAMS: A Multi-subject High School Examinations Dataset for Cross-lingual and Multilingual Question Answering,” Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, November 16–20, 2020, pp. 5427–5444. https://doi.org/10.18653/v1/2020.emnlp-main.438.

J. C. de Winter, “Can ChatGPT pass high school exams on English language comprehension?,” International Journal of Artificial Intelligence in Education, vol. 34, no. 3, pp. 915–930, 2024. https://doi.org/10.1007/s40593-023-00372-z.

R. Darģis, G. Barzdins, I. Skadiņa, N. Gruzitis, B. Saulīte, “Evaluating open-source LLMs in low-resource languages: Insights from Latvian high school exams,” Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities (NLP4DH 2024), Bangkok, Thailand, November 16, 2024, pp. 289–293. https://doi.org/10.18653/v1/2024.nlp4dh-1.28.

G. G. Lee, E. Latif, X. Wu, N. Liu, and X. Zhai, “Applying large language models and chain-of-thought for automatic scoring,” Computers and Education: Artificial Intelligence, vol. 6, 100213, 2024. https://doi.org/10.1016/j.caeai.2024.100213.

M. Romanyshyn, O. Syvokon, R. Kyslyi, “The UNLP 2024 Shared Task on Fine-Tuning Large Language Models for Ukrainian,” Proceedings of the Third Ukrainian Natural Language Processing Workshop (UNLP) @ LREC-COLING 2024, Torino, Italia, May 25, 2024, pp. 67–74.

A. Kiulian, A. Polishko, M. Khandoga, O. Chubych, J. Connor, R. Ravishankar, A. Shirawalmath, “From Bytes to Borsch: Fine-Tuning Gemma and Mistral for the Ukrainian Language Representation,” Proceedings of the Third Ukrainian Natural Language Processing Workshop (UNLP) @ LREC-COLING 2024, Torino, Italia, May 25, 2024, pp. 83–94.

Y. Paniv, A. Kiulian, D. Chaplynskyi, M. Khandoga, A. Polishko, T. Bas, G. Gabrielli, Benchmarking Multimodal Models for Ukrainian Language Understanding Across Academic and Cultural Domains, 2024. https://doi.org/10.18653/v1/2025.unlp-1.2.

Z. Li, Y. Su, R. Yang, C. Xie, Z. Wang, Z. Xie, N. Wong, H. Yang, Quantization Meets Reasoning: Exploring LLM Low-Bit Quantization Degradation for Mathematical Reasoning, 2025, [Online]. Available at: https://arxiv.org/abs/2501.03035.

Osvita.ua, Ukrainian ZNO exams, 2024, [Online]. Available at: https://zno.osvita.ua/ukrainian/.

S. Balloccu, P. Schmidtová, M. Lango, O. Dušek, Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed-Source LLMs, 2024. https://doi.org/10.18653/v1/2024.eacl-long.5.

Bitsandbytes Foundation, bitsandbytes, 2025, [Online]. Available at: https://github.com/bitsandbytes-foundation/bitsandbytes.

Ukrainian Center for Education Quality Assessment. Official reports, 2025, [Online]. Available at: https://testportal.gov.ua/ofzvit/

Downloads

Published

2026-01-01

How to Cite

Syromiatnikov, M., Ruvinskaya, V., & Komleva, N. (2026). Empowering Smaller Models: Tuning LLaMA and Gemma with Chain-of-Thought for Ukrainian Exam Tasks. International Journal of Computing, 24(4), 814-825. Retrieved from https://www.computingonline.net/computing/article/view/4349

Issue

Section

Articles