Table of Links
2. Background
2.1 Effective Tutoring Practice
2.2 Feedback for Tutor Training
2.3 Sequence Labeling for Feedback Generation
2.4 Large Language Models in Education
3. Method
3.1 Dataset and 3.2 Sequence Labeling
3.3 GPT Facilitated Sequence Labeling
4. Results
6. Limitation and Future Works
APPENDIX
B. Input for Fine-Tunning GPT-3.5
C. Scatter Matric of the Correlation on the Outcome-based Praise
D. Detailed Results of Fine-Tuned GPT-3.5 Model's Performance
5. DISCUSSION
Our study examined the potential of GPT models to highlight the desired and undesired parts of praise from trainee responses and further integrated the highlighted parts into the feedback for tutor training. By employing the Modified Intersection over Union (M-IoU) as a novel metric, we measured the quality of the praise highlighted by GPT models. The M-IoU metric, validated through correlation with human coders, underscores the potential of combining human intuition with algorithmic metrics to enhance the specificity and relevance of educational feedback. The findings from our investigation confirmed the considerable promise of employing techniques such as prompting and fine-tuning within GPT models to generate automated, explanatory feedback tailored for tutor training programs. By leveraging a finetuned GPT model, we developed an automated feedback system specifically designed for tutor training, with the objective of delivering immediate and explanatory feedback. This innovation presented a viable, scalable solution to the pressing challenge of delivering personalized feedback to learners (trainee tutors as learners in our study). The implementation of an automated explanatory feedback system in our study exemplifies how such technology can be leveraged to identify specific elements in tutors’ open-ended responses that are either desirable or in need of enhancement.
Prompting GPT model to highlight key components. Upon evaluating the highlighted praise components from GPT models (prompting) and expert annotations, we observe that the quality of highlighted praise components by experts typically outperform those highlighted by GPT models. For instance, as indicated in the first row of Table 5, there is unanimous agreement among human coders that the highlighted praise components by expert annotations are better than those highlighted by both GPT models. Specifically, while the GPT-3.5 model accurately identified phrases such as “doing a great job” and “Stick with this” as forms of praise, it erroneously categorized “doing a great job” as effort-based praise, contrary to the established praise principle which classifies it as outcome-based praise [55]. Conversely, the GPT-4 model correctly classified “doing a great job” as outcome-based praise but included additional words like “We can finish it” in its identification of effort-based praise. The comparison of additional words included in effort-based praise annotations between GPT-3.5 and GPT4 models resulted in identical scores from the coders (0.5 in Table 5, reflecting a neutral stance equivalent to a score of 3 on the Likert Scale). The identical scoring stems from the equal number of additional words identified by both models. Furthermore, the M-IoU score aligns with the coders’ assessments, underscoring the metric’s utility in capturing the accuracy of the models’ annotations. In another observation, detailed in the second row of Table 5, both coders concurred that in certain instances, prompting GPT-3.5’s identification of praise components was superior to that of expert annotations. Additionally, the third row of Table 5 presents a scenario with significant discrepancies in the ratings assigned to the highlighted praise components by GPT3.5 and GPT-4 between two coders. Here, the M-IoU score proved instrumental in mitigating the variances in individual assessments, effectively approximating the average score derived from both coders’ ratings.
Fine-tuning GPT model to highlight key components. Then, our study assessed the impact of fine-tuning the GPT-3.5 model with different amount of training data to determine the optimal dataset size required to achieve satisfactory performance in generating explanatory feedback. This insight is important for researchers and educational practitioners seeking to use LLMs effectively, especially when faced with constraints on data availability. Our findings highlight the critical role of task-specific optimization for LLMs, illustrating how strategic modifications to the quantity of training data can markedly enhance the performance of automated feedback systems. By identifying the minimum dataset requirements for fine-tuning GPT models, our study provides valuable guidelines for developing effective explanatory feedback. Furthermore, by integrating our proposed prompting strategies alongside a certain number of training datasets, we found that the fine-tuned GPT-3.5 model generally outperforming prompting models (both GPT-3.5 and GPT4) in identifying the praise elements (including effort- and outcome-based praise). It suggest that, despite the general advancements represented by newer models like GPT4, fine-tuning earlier versions such as GPT-3.5 can achieve comparable or even superior performance in specific applications. This insight is important for educational practitioners and researchers, particularly those constrained by financial limitations, as fine-tuning GPT-3.5 proves to be a more cost-effective option than prompting GPT-4. Moreover, the fine-tuning approach offers a solution to challenges related to accessing the latest models or dealing with limited resources, such as a restricted number of training datasets.
Comparison of prompting and fine-tuning approaches. In our study, we employed both prompting and fine-tuning approaches to adapt large language models, specifically GPT models, for providing highlighting the desired and undesired parts of trainee responses. Prompting enables rapid model adaptation to highlight the components of effort- and outcome-based praise without extensive retraining, thus conserving computational resources and time. However, since the model parameters are not updated, prompting might not capture deeper insights from annotated data, potentially limiting performance on highlighting key components from complex responses. For example, consider the tutor response “Great job figuring out that problem! Would you like help with anything else?” Because of the use of “figuring out”, this response is categorized as effort-based praise, however, GPT-4 without fine-tuning mistakenly classified it as outcome-based, whereas GPT-3.5 with fine-tuning correctly classified it as effort-based. This error likely occurred because the model over-weighted the generic phrase “Great job”. Additionally, while the prompting approach offers flexibility in testing different prompts to quickly gauge the model’s capabilities on our task, its effectiveness heavily depends on the quality of the prompt design. As observed during our prompt engineering phase, inadequate prompts can lead to misleading outputs.
On the other hand, fine-tuning allows for deeper model customization by adjusting internal parameters to closely align with our task in identifying the components of praises from tutor responses, often resulting in superior performance measured by M-IOU scores, as observed in our study. Finetuning enables the GPT model to deeply integrate new knowledge and adjust its existing knowledge, better fitting the task requirements of identifying components of effort- and outcome-based praise. Despite these advantages, fine-tuning requires a substantial amount of relevant and high-quality data and significant computational resources. The data must be carefully annotated to guide the model effectively toward the desired behavior, which present a significant limitation if such data is scarce or difficult to collect. Additionally, finetuning involves updating the weights of a neural network based on a specific dataset, a process that can be resource intensive and requires access to powerful hardware, especially for larger models.
To address some of these challenges and further enhance our highlighted feedback system, we are considering the integration of Retrieval-Augmented Generation (RAG). RAG combines the strengths of both retrieval and generation models to improve the performance of language models on specific tasks [35]. RAG could enhance the performance of prompting LLMs by dynamically incorporating relevant external information into responses, providing more informed and contextually accurate outputs (e.g., [20]). Additionally, RAG can be integrated with the fine-tuning approach for providing highlighted feedback, potentially improving the model’s accuracy in highlighting components of praise. This integration aims to create a model that not only leverages external data through RAG but also adapts more finely to specialized tasks through fine-tuning, demonstrating superior performance in contextually rich and dynamic environments.
This paper is available on arxiv under CC BY 4.0 DEED license.
Authors:
(1) Jionghao Lin, Carnegie Mellon University (jionghal@cs.cmu.edu);
(2) Eason Chen, Carnegie Mellon University (easonc13@cmu.edu);
(3) Zeifei Han, University of Toronto (feifei.han@mail.utoronto.ca);
(4) Ashish Gurung, Carnegie Mellon University (agurung@andrew.cmu.edu);
(5) Danielle R. Thomas, Carnegie Mellon University (drthomas@cmu.edu);
(6) Wei Tan, Monash University (wei.tan2@monash.edu);
(7) Ngoc Dang Nguyen, Monash University (dan.nguyen2@monash.edu);
(8) Kenneth R. Koedinger, Carnegie Mellon University (koedinger@cmu.edu).