GPT Models for Sequence Labeling: Prompt Engineering & Fine-tuning

cover
28 May 2025

Abstract and 1 Introduction

2. Background

2.1 Effective Tutoring Practice

2.2 Feedback for Tutor Training

2.3 Sequence Labeling for Feedback Generation

2.4 Large Language Models in Education

3. Method

3.1 Dataset and 3.2 Sequence Labeling

3.3 GPT Facilitated Sequence Labeling

3.4 Metrics

4. Results

4.1 Results on RQ1

4.2 Results on RQ2

5. Discussion

6. Limitation and Future Works

7. Conclusion

8. Acknowledgments

9. References

APPENDIX

A. Lesson Principles

B. Input for Fine-Tunning GPT-3.5

C. Scatter Matric of the Correlation on the Outcome-based Praise

D. Detailed Results of Fine-Tuned GPT-3.5 Model's Performance

3.3 GPT Facilitated Sequence Labeling

As discussed, our study employed two widely used approaches for adapting GPT models to sequence labeling tasks: prompt engineering and fine-tuning. Each method offers unique advantages and impacts the process of creating automated explanatory feedback in different ways.

3.3.1 Prompt engineering for identifying praise components

To answer RQ 1, we conducted prompt engineering to design certain prompting strategies to enable GPT models to identify the praise components within the tutor responses. Prompting engineering approach involves designing and structuring input prompts to guide the GPT model in generating desired outputs [42, 56]. The art of prompt engineering lies in crafting prompts that can effectively communicate the context and requirements of the task to the model [42, 56]. In our study, given the presence of tutor responses that exemplify both effort-based and outcome-based praise in the tutor training lesson, we employed a two-shot prompting strategy to guide GPT-3.5 (gpt-3.5-turbo-0125) and GPT-4 (gpt-4-0125-preview) models to highlight praise components within tutor responses. Our prompt is shown in Table 1. The following explains our prompt design, aimed at extracting specific elements from tutor responses related to praising student effort and outcomes.

• {Lesson Principle}: This segment provides the guiding principles for desired tutor responses. It includes key aspects of effective praise in educational settings, such as sincerity, specificity, immediacy, authenticity, and focus on the learning process. This principle acts as a reference for evaluating the tutor responses. The lesson principle is detailed in Appendix A

• {Tutor Response}: This part simulates an interactive environment where the model identify the praise components from the input of tutor responses.

Table 1: Prompt for identifying praise from tutor responses

3.3.2 Fine-tuned GPT models for identifying praise components

Given limited access to fine-tuning capabilities for the GPT4 model, we focused on optimizing the use of GPT-3.5 (gpt3.5-turbo-1106) to answer the RQ2, particularly within the constraints of a modestly sized training dataset. The model fine-tuning approach was implemented to train the GPT-3.5 model to recognize and understand the patterns associated with identifying praise components in tutor responses. To prepare our data for the fine-tuning process, we converted tutor responses and their associated tags into JSON format. This format facilitated the structured representation of our data, mirroring the input style typically expected by the GPT model. The structure of our input data closely resembled the prompts used in GPT model training, with a key distinction: instead of prompting the model to generate text containing praise, we supplied it with annotated outcomes and effort-based praises. Due to the page limit and avoid repetitive content appearing in the paper, we decided to put the details of the fine-tuning input in the Appendix B.

Our approach aimed to investigate the extent to which finetuned model can accurately classify and label praise components under limited training dataset, thereby enhancing its performance on our task. By doing so, we first divided our dataset evenly, allocating 50% (65 responses) for training and the remaining 50% (64 responses) for testing. The dis-tribution of annotation is shown in Table 2 which presents O as the major tag in our dataset. Then, we subdivided our training set into five distinct partitions: 13, 26, 39, 52, and 65 responses. For each partition, the training process was repeated five times using different random seeds. These partitions represented 10%, 20%, 30%, 40%, and 50% of our original dataset, respectively. This stratified approach allowed us to simulate different training conditions, thereby enabling a comprehensive analysis of the model’s adaptability and learning efficiency as the amount of available training data varied.

Table 2: Distribution of token labels.

This paper is available on arxiv under CC BY 4.0 DEED license.

Authors:

(1) Jionghao Lin, Carnegie Mellon University (jionghal@cs.cmu.edu);

(2) Eason Chen, Carnegie Mellon University (easonc13@cmu.edu);

(3) Zeifei Han, University of Toronto (feifei.han@mail.utoronto.ca);

(4) Ashish Gurung, Carnegie Mellon University (agurung@andrew.cmu.edu);

(5) Danielle R. Thomas, Carnegie Mellon University (drthomas@cmu.edu);

(6) Wei Tan, Monash University (wei.tan2@monash.edu);

(7) Ngoc Dang Nguyen, Monash University (dan.nguyen2@monash.edu);

(8) Kenneth R. Koedinger, Carnegie Mellon University (koedinger@cmu.edu).