Large Language Models (LLMs) for Educational Applications

Recent advancements in natural language processing have seen the evaluation of large language models (LLMs) like GPT models in various educational tasks, leveraging techniques such as prompting and fine-tuning [30]. The GPT models (e.g., GPT-3.5 or GPT-4) have demonstrated significant potential in enhancing many educational tasks (e.g., feedback generation and learning content generation) [30]. Motivated by these developments, our study aims to investigate the applicability of prompting and fine-tuning GPT models to identify both desirable and less desirable aspects of tutoring responses. We intend to evaluate the effectiveness of these approaches in developing an automated system for providing explanatory feedback.

2.4.1 Prompting large language models

Prompting, which involves the use of specific queries or statements to guide a large language model’s (LLM) output, has been identified as a significant technique for leveraging the capabilities of LLMs in education [30]. The prompting strategy plays a pivotal role in effectively guiding the models, such as GPT-3 and GPT-4, to produce responses that are more aligned with the context and requirements of the tasks. Research by Dai et al. [10] on the GPT-3.5 and GPT-4 model highlighted GPT models’ ability to generate student feedback that surpassed human instructors in readability. Furthermore, Hirunyasiri et al. [24] demonstrated the superiority of the GPT-4 model over human expert tutors in assessing specific tutoring practices. [34] used GPT-4 model to generate high-quality answer responses for middle school math questions. [45] providing feedback for multiple-choice questions at the middle-school math level. Given that GPT models has shown remarkable performance on various educational tasks [9, 24, 34, 45], our study also leveraged the GPT models to further explore its capabilities in automatically generating explanatory feedback since the exploration of prompting GPT models for providing explanatory feedback in response to open-ended questions remains limited.

2.4.2 Fine-tuning large language models

In addition to prompting the GPT models, the fine-tuning of GPT models has also shown considerable promise in various educational tasks [30]. The fine-tuning method adjusts the model’s neural network to better suit particular domain, thereby enhancing its performance in relevant contexts [26]. Latif and Zhai [33] employed fine-tuned GPT-3.5 model for the purpose of automatic scoring in science education. Their findings indicate that GPT-3.5, once finetuned with domain-specific data, not only surpassed the performance of the established BERT model [12] but also demonstrated superior accuracy in assessing a variety of science education tasks. Such advancements underscore the value of fine-tuning GPT models for educational applications, showcasing their ability to provide precise, scalable solutions across diverse educational settings. Bhat et al. [3] introduced a method for generating assessment questions from text-based learning materials using a fine-tuned GPT-3 model. The generated questions was further assessed regard to their usefulness to the learning outcome by human experts, with the findings revealing a favorable reception among human experts. Inspired by these pioneering research, our study aims to extend the application of fine-tuning method to GPT models within the context of generating explanatory feedback. While the aforementioned studies [12, 33] have not directly addressed the generation of explanatory feedback, their success in applying fine-tuned LLMs within educational domains suggests a promising avenue for our investigation. By customizing GPT models to the nuances of educational feedback, we anticipate uncovering new potentials in automating and enhancing the feedback process. These efforts will contribute to the growing body of evidence supporting the integration of fine-tuned LLMs in educational technology, potentially revolutionizing the way feedback is generated and applied in learning environments.

This paper is available on arxiv under CC BY 4.0 DEED license.

Authors:

(1) Jionghao Lin, Carnegie Mellon University (jionghal@cs.cmu.edu);

(2) Eason Chen, Carnegie Mellon University (easonc13@cmu.edu);

(3) Zeifei Han, University of Toronto (feifei.han@mail.utoronto.ca);

(4) Ashish Gurung, Carnegie Mellon University (agurung@andrew.cmu.edu);

(5) Danielle R. Thomas, Carnegie Mellon University (drthomas@cmu.edu);

(6) Wei Tan, Monash University (wei.tan2@monash.edu);

(7) Ngoc Dang Nguyen, Monash University (dan.nguyen2@monash.edu);

(8) Kenneth R. Koedinger, Carnegie Mellon University (koedinger@cmu.edu).

← Previous

Tutor Training Dataset & Sequence Labeling Annotation

Up Next →

GPT Models for Sequence Labeling: Prompt Engineering & Fine-tuning