Indo-WDSimpleQuAD2.0: an Indonesian Benchmark Dataset for Knowledge Graph Question Answering System
Keywords:
Indonesian benchmark, Indonesian dataset, KGQA, KGQA system evaluationAbstract
We propose Indo-WDSimpleQuAD2.0, a silver standard for an Indonesian-language benchmark dataset developed from SimpleQuestions and LC-QuAD 2.0 based on Wikidata. This dataset development is proposed due to the current absence of a representative KGQA benchmark dataset in Indonesian language. SimpleQuestions and LC-QuAD 2.0 were chosen because, in terms of question type variety and complexity, these datasets serve as supersets of other available datasets. Indo-WDSimpleQuAD2.0 comprises 27,924 questions for SimpleQuestions and 31,821 for LC-QuAD 2.0. Indo-WDSimpleQuAD2.0 was developed through a rigorous translation process by English language experts and native Indonesian speakers. This translation process was conducted in three rigorous stages: initial translation, validation and verification, and finalization of the translation. To ensure the quality of this dataset, the authors applied four criteria: translation accuracy, writing quality, semantic integrity, and annotation process. Indo-WDSimpleQuAD2.0 can serve as the first Indonesian-language KGQA benchmark dataset based on Wikidata, thus supporting future research and development of Indonesian KGQA systems.
References
F. Manola, E. Miller, and B. McBride, Eds., RDF 1.1 Primer. Cambridge, MA, USA: W3C Recommendation, 24 June 2014. [Online]. Available: https://www.w3.org/TR/2014/NOTE-rdf11-primer-20140624/
M. I. Rahajeng and A. Purwarianti, “Indonesian question answering system for factoid questions using face beauty products knowledge graph,” Jurnal Linguistik Komputasional, vol. 4, no. 2, pp. 59–63, September 2021. [Online]. Available: https://inacl.id/journal/index.php/jlk/article/view/62/46
D. Kerenza and A. A. Krisnadhi, “Ac-iquad: Automatically constructed indonesian question answering dataset by leveraging wikidata,” Lang Resources & Evaluation, 2024. [Online]. Available: https://link.springer.com/article/10.1007/s10579-023-09702-y
L. Zhang, J. Zhang, X. Ke, H. Li, X. Huang, Z. Shao, S. Cao, and X. Lv, “A survey on complex factual question answering,” AI Open, vol. 4, pp. 1–12, 2023. [Online]. Available: https://doi.org/10.1016/j.aiopen.2022.12.003
M. Yani, A. A. Krisnadhi, and I. Budi, “A better entity detection of question for knowledge graph question answering through extracting position-based patterns,” J. Big Data, vol. 9, no. 1, p. 80, 2022. [Online]. Available: https://doi.org/10.1186/s40537-022-00631-1
P. J. Ochieng, “PAROT: translating natural language to SPARQL,” Expert Syst. Appl., vol. 176, p. 114712, 2021. [Online]. Available: https://doi.org/10.1016/j.eswa.2021.114712
K. Höffner, S. Walter, E. Marx, R. Usbeck, J. Lehmann, and A. N. Ngomo, “Survey on challenges of question answering in the semantic web,” Semantic Web, vol. 8, no. 6, pp. 895–920, 2017. [Online]. Available: https://doi.org/10.3233/SW-160247
M. Yani and A. A. Krisnadhi, “Challenges, techniques, and trends of simple knowledge graph question answering: A survey,” Inf., vol. 12, no. 7, p. 271, 2021. [Online]. Available: https://doi.org/10.3390/info12070271
S. Pramanik, J. Alabi, R. S. Roy, and G. Weikum, “UNIQORN: unified question answering over RDF knowledge graphs and natural language text,” CoRR, vol. abs/2108.08614, 2021. [Online]. Available: https://arxiv.org/abs/2108.08614
N. Steinmetz and K. Sattler, “What is in the KGQA benchmark datasets? survey on challenges in datasets for question answering on knowledge graphs,” J. Data Semant., vol. 10, no. 3-4, pp. 241–265, 2021. [Online]. Available: https://doi.org/10.1007/s13740-021-00128-9
A. Bordes, N. Usunier, S. Chopra, and J. Weston, “Large-scale simple question answering with memory networks,” CoRR, vol. abs/1506.02075, 2015. [Online]. Available: http://arxiv.org/abs/1506.02075
A. A. Krisnadhi, M. Yani, and I. Budi, “Entity and relation linking for knowledge graph question answering using gradual searching,” Jurnal Nasional Teknik Elektro dan Teknologi Informasi, vol. 13, no. 2, pp. 139–146, 2024. [Online]. Available: https://jurnal.ugm.ac.id/v3/JNTETI/article/view/9184
M. Azmy, P. Shi, J. Lin, and I. F. Ilyas, “Farewell freebase: Migrating the simplequestions dataset to dbpedia,” in Proceedings of the 27th International Conference on Computational Linguistics, COLING 2018, Santa Fe, New Mexico, USA, August 20-26, 2018, E. M. Bender, L. Derczynski, and P. Isabelle, Eds. Association for Computational Linguistics, 2018, pp. 2093–2103. [Online]. Available: https://aclanthology.org/C18-1178/
T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. HerbertVoss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language models are few-shot learners,” in Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., 2020. [Online]. Available: https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html
M. Dubey, D. Banerjee, A. Abdelkawi, and J. Lehmann, “Lc-quad 2.0: A large dataset for complex question answering over wikidata and dbpedia,” in The Semantic Web - ISWC 2019 - 18th International Semantic Web Conference, Auckland, New Zealand, October 26-30, 2019, Proceedings, Part II, ser. Lecture Notes in Computer Science, C. Ghidini, O. Hartig, M. Maleshkova, V. Svátek, I. F. Cruz, A. Hogan, J. Song, M. Lefrançois, and F. Gandon, Eds., vol. 11779. Springer, 2019, pp. 69–78. [Online]. Available: https://doi.org/10.1007/978-3-030-30796-7_5
D. Lukovnikov, A. Fischer, and J. Lehmann, “Pretrained transformers for simple question answering over knowledge graphs,” in The Semantic Web - ISWC 2019 - 18th International Semantic Web Conference, Auckland, New Zealand, October 26-30, 2019, Proceedings, Part I, ser. Lecture Notes in Computer Science, C. Ghidini, O. Hartig, M. Maleshkova, V. Svátek, I. F. Cruz, A. Hogan, J. Song, M. Lefrançois, and F. Gandon, Eds., vol. 11778. Springer, 2019, pp. 470–486. [Online]. Available: https://doi.org/10.1007/978-3-030-30793-6_27
C. Unger, L. Bühmann, J. Lehmann, A. N. Ngomo, D. Gerber, and P. Cimiano, “Template-based question answering over RDF data,” in Proceedings of the 21st World Wide Web Conference 2012, WWW 2012, Lyon, France, April 16-20, 2012, A. Mille, F. Gandon, J. Misselis, M. Rabinovich, and S. Staab, Eds. ACM, 2012, pp. 639–648. [Online]. Available: https://doi.org/10.1145/2187836.2187923
X. Huang, J. Zhang, D. Li, and P. Li, “Knowledge graph embedding based question answering,” in Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, WSDM 2019, Melbourne, VIC, Australia, February 11-15, 2019, J. S. Culpepper, A. Moffat, P. N. Bennett, and K. Lerman, Eds. ACM, 2019, pp. 105–113. [Online]. Available: https://doi.org/10.1145/3289600.3290956
W. Zhao, T. Chung, A. K. Goyal, and A. Metallinou, “Simple question answering with subgraph ranking and joint-scoring,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACLHLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio, Eds. Association for Computational Linguistics, 2019, pp. 324–334. [Online]. Available: https://doi.org/10.18653/v1/n19-1029
D. Luo, J. Su, and S. Yu, “A bert-based approach with relation-aware attention for knowledge base question answering,” in 2020 International Joint Conference on Neural Networks, IJCNN 2020, Glasgow, United Kingdom, July 19-24, 2020. IEEE, 2020, pp. 1–8. [Online]. Available: https://doi.org/10.1109/IJCNN48605.2020.9207186
E. Cabrio, P. Cimiano, V. López, A. N. Ngomo, C. Unger, and S. Walter, “QALD-3: multilingual question answering over linked data,” in Working Notes for CLEF 2013 Conference , Valencia, Spain, September 23-26, 2013, ser. CEUR Workshop Proceedings, P. Forner, R. Navigli, D. Tufis, and N. Ferro, Eds., vol. 1179. CEUR-WS.org, 2013. [Online]. Available: https://ceur-ws.org/Vol-1179/CLEF2013wn-QALD3-CabrioEt2013.pdf
C. Unger, C. Forascu, V. López, A. N. Ngomo, E. Cabrio, P. Cimiano, and S. Walter, “Question answering over linked data (QALD-4),” in Working Notes for CLEF 2014 Conference, Sheffield, UK, September 15-18, 2014, ser. CEUR Workshop Proceedings, L. Cappellato, N. Ferro, M. Halvey, and W. Kraaij, Eds., vol. 1180. CEUR-WS.org, 2014, pp. 1172–1180. [Online]. Available: https://ceur-ws.org/Vol-1180/CLEF2014wn-QA-UngerEt2014.pdf
C. Unger, C. Forascu, V. López, A. N. Ngomo, E. Cabrio, P. Cimiano, and S. Walter, “Question answering over linked data (QALD-5),” in Working Notes of CLEF 2015 - Conference and Labs of the Evaluation forum, Toulouse, France, September 8-11, 2015, ser. CEUR Workshop Proceedings, L. Cappellato, N. Ferro, G. J. F. Jones, and E. SanJuan, Eds., vol. 1391. CEUR-WS.org, 2015. [Online]. Available: https://ceur-ws.org/Vol-1391/173-CR.pdf
A. Perevalov, D. Diefenbach, R. Usbeck, and A. Both, “Qald-9-plus: A multilingual dataset for question answering over dbpedia and wikidata translated by native speakers,” in 16th IEEE International Conference on Semantic Computing, ICSC 2022, Laguna Hills, CA, USA, January 26-28, 2022. IEEE, 2022, pp. 229–234. [Online]. Available: https://doi.org/10.1109/ICSC52841.2022.00045
R. Usbeck, A. N. Ngomo, B. Haarmann, A. Krithara, M. Röder, and G. Napolitano, “7th open challenge on question answering over linked data (QALD-7),” in Semantic Web Challenges - 4th SemWebEval Challenge at ESWC 2017, Portoroz, Slovenia, May 28 - June 1, 2017, Revised Selected Papers, ser. Communications in Computer and Information Science, M. Dragoni, M. Solanki, and E. Blomqvist, Eds., vol. 769. Springer, 2017, pp. 59–69. [Online]. Available: https://doi.org/10.1007/978-3-319-69146-6_6
Kartik, F. Shenoy, D. Ilievski, D. Garijo, P. Schwabe, and Szekely, “A study of the quality of wikidata,” Journal of Web Semantics, vol. 72, no. -, pp. 1–10, April 2022. [Online]. Available: https: //www.sciencedirect.com/science/article/abs/pii/S1570826821000536
L. Kaffee and E. Simperl, “Analysis of editors’ languages in wikidata,” in Proceedings of the 14th International Symposium on Open Collaboration, OpenSym 2018, Paris, France, August 22-24, 2018. ACM, 2018, pp. 21:1–21:5. [Online]. Available: https://doi.org/10.1145/3233391.3233965
K. Tharani, “Much more than a mere technology: A systematic review of wikidata in libraries,” The Journal of Academic Librarianship, vol. 47, pp. –, 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0099133321000173
S. Cahyawijaya, H. Lovenia, A. F. Aji, G. I. Winata, B. Wilie, F. Koto, R. Mahendra, C. Wibisono, A. Romadhony, K. Vincentio, J. Santoso, D. Moeljadi, C. Wirawan, F. Hudi, M. S. Wicaksono, I. H. Parmonangan, I. Alfina, I. F. Putra, S. Rahmadani, Y. Oenang, A. A. Septiandri, J. Jaya, K. D. Dhole, A. A. Suryani, R. A. Putri, D. Su, K. Stevens, M. N. Nityasya, M. F. Adilazuarda, R. Hadiwijaya, R. Diandaru, T. Yu, V. Ghifari, W. Dai, Y. Xu, D. Damapuspita, H. A. Wibowo, C. Tho, I. M. K. Karo, T. Fatyanosa, Z. Ji, G. Neubig, T. Baldwin, S. Ruder, P. Fung, H. Sujaini, S. Sakti, and A. Purwarianti, “Nusacrowd: Open source initiative for indonesian NLP resources,” in Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, A. Rogers, J. L. Boyd-Graber, and N. Okazaki, Eds. Association for Computational Linguistics, 2023, pp. 13 745–13 818. [Online]. Available: https://doi.org/10.18653/v1/2023.findings-acl.868
G. D. Israel, “Determining sample size,” April 2009. [Online]. Available: https://www.psycholosphere.com/Determining%20sample%20size%20by%20Glen%20Israel.pdf
B. D. Eugenio and M. Glass, “The kappa statistic: A second look,” Comput. Linguistics, vol. 30, no. 1, pp. 95–101, 2004. [Online]. Available: https://doi.org/10.1162/089120104773633402
Downloads
Published
How to Cite
Issue
Section
License
International Journal of Computing is an open access journal. Authors who publish with this journal agree to the following terms:• Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
• Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
• Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.