Large Language Models; Rationale Extraction; Classifier; Natural Language Processing; Reinforcement Learning; Fine Tuning; Embedding
Abstract :
[en] In this paper, we address the problem of Rationale Extraction (RE) from Natural Language Processing: given a context (C), a related question (Q) and its answer (A), the task is to find the best sentence-level rationale (R*). This rationale is loosely defined as being the subset of sentences of the context C such that producing A would require at least R*. We have constructed a dataset where each entry is composed of the four terms (C, Q, A, R*) to explore different methods in the particular case where the answer is one or multiple full sentences. The methods studied are based on TF-IDF scores, embedding similarity, classifiers and attention and have been evaluated using a sentence overlap metric akin to the Intersection over Union (IoU). Results show that the best scores were achieved by the classifier-based approach with the nuance of a better scaling with the attention-based method as the size of the context increases, which is a challenge for all other methods. We also show that generating A significantly decreases the performance of the attention-based method, but training the model to generate A can improve the results, linking the ability to generate with the accomplishment of the task.
Research Center/Unit :
Montefiore Institute - Montefiore Institute of Electrical Engineering and Computer Science - ULiège
Disciplines :
Computer science
Author, co-author :
Pirenne, Lize ✱; Université de Liège - ULiège > Montefiore Institute of Electrical Engineering and Computer Science
Mokeddem, Samy ✱; Université de Liège - ULiège > Département d'électricité, électronique et informatique (Institut Montefiore) > Smart grids
Ernst, Damien ; Université de Liège - ULiège > Département d'électricité, électronique et informatique (Institut Montefiore) > Smart grids
Louppe, Gilles ; Université de Liège - ULiège > Département d'électricité, électronique et informatique (Institut Montefiore) > Big Data
✱ These authors have contributed equally to this work.
Language :
English
Title :
Exploration of Rationale-Extraction Methods for Closed-Domain Question Answering with a New Sentence-Level Rationale Dataset
Publication date :
01 July 2025
Event name :
30th Annual International Conference on Natural Language & Information Systems (NLDB 2025)
Event place :
kanazawa, Japan
Event date :
4 July 2025 - 6 July 2025
Audience :
International
Main work title :
Natural Language Processing and Information Systems
Main work alternative title :
[en] NLDB25
Editor :
Ryutaro Ichise; Institute of Science Tokyo, Tokyo, Japan
Atanasova, P., Simonsen, J.G., Lioma, C., Augenstein, I.: Diagnostics-guided explanation generation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 10, pp. 10445–10453 (2022). https://doi.org/10.1609/aaai.v36i10.21287
Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Goldstein, J., Lavie, A., Lin, C.Y., Voss, C.R. (eds.) Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization@ACL 2005, Ann Arbor, Michigan, USA, 29 June 2005, pp. 65–72. Association for Computational Linguistics (2005)
Chae, Y., Davidson, T.: Large language models for text classification: from zero-shot learning to fine-tuning. Open Science Foundation (2023)
Chan, A., et al.: UNIREX: a unified learning framework for language model rationale extraction. In: Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., Sabato, S. (eds.) Proceedings of the 39th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 162, pp. 2867–2889. PMLR (2022)
Choi, E., et al.: QuAC: question answering in context. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2174–2184. Association for Computational Linguistics, Brussels, Belgium (2018). https://doi.org/10.18653/v1/D18-1241
Chuang, Y.S., Qiu, L., Hsieh, C.Y., Krishna, R., Kim, Y., Glass, J.: Lookback lens: detecting and mitigating contextual hallucinations in large language models using only attention maps. arXiv preprint arXiv:2407.07071 (2024)
Clark, K., Khandelwal, U., Levy, O., Manning, C.D.: What does BERT look at? An analysis of BERT’s attention. arXiv preprint arXiv:1906.04341 (2019)
Conover, M., et al.: Free dolly: introducing the world’s first truly open instruction-tuned LLM (2023). https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm
Dasigi, P., Liu, N.F., Marasović, A., Smith, N.A., Gardner, M.: Quoref: a reading comprehension dataset with questions requiring coreferential reasoning. arXiv preprint arXiv:1908.05803 (2019)
Dettmers, T., Pagnoni, A., Holtzman, A., Zettlemoyer, L.: Qlora: efficient finetuning of quantized LLMs. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advances in Neural Information Processing Systems, vol. 36, pp. 10088–10115. Curran Associates, Inc. (2023)
DeYoung, J., et al.: Eraser: a benchmark to evaluate rationalized NLP models. arXiv preprint arXiv:1911.03429 (2019)
Gemma, T.: Gemma: Open Models Based on Gemini Research and Technology (2024)
Glockner, M., Habernal, I., Gurevych, I.: Why do you think that? Exploring faithful sentence-level rationales without supervision. In: Cohn, T., He, Y., Liu, Y. (eds.) Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 1080–1095. Association for Computational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020.findings-emnlp.97
Hu, E.J., et al.: LoRA: low-rank adaptation of large language models. In: The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, 25–29 April 2022. OpenReview.net (2022)
Joshi, B., et al.: ER-test: evaluating explanation regularization methods for language models. In: Goldberg, Y., Kozareva, Z., Zhang, Y. (eds.) Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, 7–11 December 2022, pp. 3315–3336. Association for Computational Linguistics (2022). https://doi.org/10.18653/V1/2022.FINDINGS-EMNLP.242
Krishna, S., Ma, J., Slack, D., Ghandeharioun, A., Singh, S., Lakkaraju, H.: Post hoc explanations of language models can improve language models. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, 10–16 December 2023 (2023)
Lamm, M., et al.: QED: a framework and dataset for explanations in question answering. Trans. Assoc. Comput. Linguist. 9, 790–806 (2021). https://doi.org/10.1162/tacl a 00398
Li, J., Wang, M., Zheng, Z., Zhang, M.: Loogle: can long-context language models understand long contexts?arXiv preprint arXiv:2311.04939 (2023)
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., et al.: Roberta: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Meng, R., Liu, Y., Joty, S.R., Xiong, C., Zhou, Y., Yavuz, S.: Sfrembedding-mistral: enhance text retrieval with transfer learning. Salesforce AI Res. Blog 3, 6 (2024)
Menick, J., et al.: Teaching language models to support answers with verified quotes. arXiv preprint arXiv:2203.11147 (2022)
Moradi, P., Kambhatla, N., Sarkar, A.: Measuring and improving faithfulness of attention in neural machine translation. In: Merlo, P., Tiedemann, J., Tsarfaty, R. (eds.) Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 2791–2802. Association for Computational Linguistics, Online (2021). https://doi.org/10.18653/v1/2021.eacl-main.243
Nakayama, H., Kubo, T., Kamura, J., Taniguchi, Y., Liang, X.: doccano: text annotation tool for human (2018). https://github.com/doccano/doccano. Accessed 02 Mar 2024
Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks. In: Conference on Empirical Methods in Natural Language Processing (2019). https://doi.org/10.18653/v1/D19-1410
Ross, A.S., Hughes, M.C., Doshi-Velez, F.: Right for the right reasons: training differentiable models by constraining their explanations. In: Sierra, C. (ed.) Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, 19–25 August 2017, pp. 2662–2670. ijcai.org (2017). https://doi.org/10.24963/IJCAI. 2017/371
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24(5), 513–523 (1988). https://doi.org/10.1016/0306-4573(88)90021-0
Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)
Shi, F., et al.: Large language models can be easily distracted by irrelevant context. In: Proceedings of the 40th International Conference on Machine Learning, pp. 31210–31227. PMLR (2023)
Sun, J., Swayamdipta, S., May, J., Ma, X.: Investigating the benefits of free-form rationales. In: Goldberg, Y., Kozareva, Z., Zhang, Y. (eds.) Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, 7–11 December 2022, pp. 5867–5882. Association for Computational Linguistics (2022). https://doi.org/10.18653/V1/2022.FINDINGS-EMNLP.432
Sun, X., et al.: Text classification via large language models. In: Bouamor, H., Pino, J., Bali, K. (eds.) Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 6–10 December 2023, pp. 8990–9005. Association for Computational Linguistics (2023). https://doi.org/10.18653/V1/2023.FINDINGS-EMNLP.603
Yang, Z., et al.: HotpotQA: a dataset for diverse, explainable multi-hop question answering. In: Conference on Empirical Methods in Natural Language Processing (EMNLP) (2018)
Zhang, J., et al.: Longcite: enabling LLMs to generate fine-grained citations in long-context QA. arXiv preprint arXiv:2409.02897 (2024)
Zhang, T., et al.: Raft: adapting language model to domain specific rag. arXiv preprint arXiv: 2403.10131 (2024)
Zhao, H., et al.: Explainability for large language models: a survey. ACM Trans. Intell. Syst. Technol. 15(2) (2024). https://doi.org/10.1145/3639372