Enhancing Automated Multi-Dimensional Analysis of Peer Feedback in Middle School Mathematics With LLM-Based Embeddings

Incorporating advanced large language model embeddings to improve predictive modeling of process-focused peer feedback and support scalable, real-time instructional interventions.

Author

Affiliation

John Baker

Penn GSE: University of Pennsylvania Graduate School of Education

Published

December 12, 2024

Abstract

Peer feedback in mathematics can reinforce problem-solving skills, but ensuring high-quality, process-focused comments at scale remains challenging. This study enhances an existing framework for detecting process commentary in middle school math by integrating large language model (LLM) embeddings.

Using the same dataset, architecture, and cross-validation approach as prior work, I compare the performance of LLM embeddings against earlier sentence encodings. The LLM-based model achieves significantly higher accuracy and generalizability in identifying process-oriented feedback.

These findings demonstrate the potential of advanced language representations to capture nuanced indicators of effective peer review. By enabling more precise, automated feedback analysis, this work informs the development of educational tools that can offer targeted, real-time support to enhance students’ mathematical reasoning.

Keywords

peer feedback, process-focused comments, neural networks, large language models, educational data mining

Introduction

Peer review is a powerful tool in mathematics education that encourages students to engage critically with problem-solving strategies, not just final answers. Substantial research has shown that high-quality, process-oriented feedback can deepen conceptual understanding and enhance mathematical reasoning skills (Uesato et al. 2022; Nicol and Macfarlane-Dick 2006). However, facilitating such feedback effectively at a large scale presents significant challenges. Manually evaluating peer comments for quality is time-intensive, while automated approaches have often struggled to identify substantive, process-focused characteristics reliably.

Recent natural language processing (NLP) advancements offer promising avenues to address these issues. Prior work has leveraged machine learning techniques, such as sentence embeddings and neural networks, to classify peer feedback along dimensions including correctness, specificity, and process orientation. These efforts have achieved noteworthy improvements in accuracy and efficiency compared to manual methods (Zhang et al. 2023). However, there remains significant room for further refinement, particularly in capturing the nuanced linguistic patterns associated with high-quality, process-level commentary.

The rapid development of large language models (LLMs) presents an exciting opportunity in this regard. LLMs, which are trained on massive and diverse text corpora, have demonstrated remarkable capabilities in representing complex semantic relationships and generating contextually relevant embeddings (Radford et al. 2019). By encoding text in a high-dimensional space, LLM embeddings can potentially capture subtle indicators of effective feedback that previous methods may overlook. Integrating such advanced language representations into existing predictive frameworks thus offers a promising path to enhance the precision and robustness of automated peer review analysis.

This study aims to investigate the impact of incorporating LLM embeddings into a proven classification model for detecting process-focused feedback. Building upon the methodology established in Zhang et al. (2023), I preserve the core neural network architecture, cross-validation scheme and construct operationalization to isolate the effects of the embedding approach. By systematically comparing the performance of LLM-based embeddings against prior sentence-level encodings, this work seeks to quantify the benefits of more sophisticated language modeling in the context of educational peer feedback.

Furthermore, I evaluate the model’s ability to generalize to completely unseen student populations. Demonstrating strong transferability is crucial for practical applications, as it suggests the model is learning meaningful linguistic patterns rather than overfitting to specific student characteristics. Improved generalization would support the development of broadly applicable tools that could provide real-time, adaptive feedback to enhance students’ learning experiences across diverse contexts.

Ultimately, this research advances the state-of-the-art in automated analysis of peer review, laying the groundwork for scalable, data-driven support systems in mathematics education. By harnessing the power of LLMs to identify effective process-oriented feedback, this work informs the design of educational technologies that can offer targeted, timely interventions to foster deeper mathematical understanding. More broadly, it contributes to ongoing efforts in leveraging artificial intelligence (AI) to enhance formative assessment and personalize learning at scale.

Methods

This study extends an established predictive framework for detecting peer comments on the process (CP) in middle school mathematics. The key innovation is the integration of large language model (LLM) embeddings to capture nuanced linguistic patterns associated with CP.

Overview of Approach

The methodology follows these main steps:

Data Preparation: Load a previously annotated dataset of peer comments, preserving the original structure and CP labels.
Embedding Generation: Feed each comment through an LLM to obtain high-dimensional vector representations that encode semantic relationships.
Model Architecture: Employ the same neural network design as prior work, adjusted only to accommodate the dimensionality of LLM embeddings.
Cross-Validation: Implement a student-level cross-validation scheme, ensuring that no student’s data appears in both training and test sets for any given fold.
Model Training & Evaluation: Train the model on the training set, validate on a held-out portion, and evaluate on the corresponding test set using the Area Under the Receiver Operating Characteristic curve (AUC ROC) as the primary performance metric.

The following subsections provide more detailed information on dataset characteristics, embedding techniques, model specification, and evaluation procedures.

Data and Annotation

The dataset consists of peer comments on mathematics problems submitted by middle school students through an online learning platform. Each comment was manually annotated for CP’s presence (1) or absence (0). CP is operationalized as feedback that addresses the problem-solving process, such as discussing strategies, identifying misconceptions, or suggesting alternative approaches rather than solely evaluating the final answer.

LLM Embeddings

Unlike previous studies that utilized sentence-level encodings such as the Universal Sentence Encoder, this work leverages OpenAI’s text-embedding-3-small model to generate comment embeddings. LLMs can learn more contextually rich representations by training on massive, diverse text corpora. The text-embedding-3-small model produces a high-dimensional vector for each comment, capturing latent semantic features that may be indicative of CP. I implemented an exponential backoff strategy to handle potential rate limits during embedding generation.

Model Specification

The predictive model architecture remains consistent with prior work identifying the impact of LLM embeddings. The core structure is a feedforward neural network with an input layer (adjusted to match LLM embedding dimensions), two hidden layers with ReLU activation, and a sigmoid output layer for binary classification. The model is trained using binary cross-entropy loss and the Adam optimizer, with hyperparameters following the original study.

Evaluation

Model performance is assessed using AUC ROC, a threshold-agnostic metric that captures the trade-off between true and false positive rates. A five-fold student-level cross-validation scheme is employed, so any given student’s comments are restricted to either the training or testing set within each fold. This grouping strategy, consistent with the original methodology, allows evaluation of the model’s generalization to new students, not just new comments. Comparing AUC ROC scores against prior baselines quantifies the impact of integrating LLM embeddings.

Implementation Details

Below is a detailed description of the implementation steps taken to build and evaluate a predictive model for Commenting on the Process (CP), including representative code snippets. This implementation adheres closely to the methodological framework and neural network architecture described in the original study while introducing large language model (LLM) embeddings to enhance feature representations.

Data Preparation

The initial step involves loading the dataset, which contains comments annotated with the presence or absence of the CP construct. Each data row includes the text of the student’s comment, the student’s unique identifier, and a binary label for whether the comment contains process-focused feedback. I also ensure that grouping students into folds is consistent with the original experimental design, preventing any student’s work from appearing in both training and test sets.

import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

import tensorflow as tf
tf.get_logger().setLevel('ERROR')

import time
from dotenv import load_dotenv
from openai import OpenAI, RateLimitError
import pandas as pd

# Load environment variables, including API keys for LLM access
status = load_dotenv()

# Initialize the OpenAI client
client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

# Load the dataset
df = pd.read_csv('data/Annotations_final.csv')
X_text = df['annotation_text'].tolist()
y = df['comment_process']  # Binary labels indicating presence of CP

In the code above, df is the complete data frame containing both the annotation text and CP labels. The variable y is a Pandas Series containing the target variable (CP presence).

Grouping and Cross-Validation Setup

Following the original paper’s methodology, I used a student-level five-fold cross-validation with GroupKFold. Each student is assigned to precisely one fold, ensuring that no student’s comments appear in both training and test data. This approach tests the model’s ability to generalize to completely new students.

import numpy as np
from sklearn.model_selection import GroupKFold

group_dict = {}
groups = np.array([])

for index, row in df.iterrows():
    s_id = row['created_by']  # Unique identifier for the student who created the Thinklet
    if s_id not in group_dict:
        group_dict[s_id] = len(group_dict)
    groups = np.append(groups, group_dict[s_id])

groups = groups.astype(int)

gkf = GroupKFold(n_splits=5)

Here, I constructed a dictionary mapping each unique student ID to a group index. The resulting groups array associates each comment with its student group, which is then passed to GroupKFold.

LLM-Based Embeddings

Unlike the original study, which relied on pre-trained sentence encoders like the Universal Sentence Encoder, I integrated a large language model (i.e., text-embedding-3-small) to generate embeddings. Each comment is fed into the LLM embedding function, producing a vector representation that captures nuanced semantic information.

I implemented exponential backoff to handle potential rate limits when calling the LLM API:

def get_embedding_with_backoff(text, model="text-embedding-3-small", max_retries=5):
    for attempt in range(max_retries):
        try:
            response = client.embeddings.create(input=[text], model=model)
            return response.data[0].embedding
        except RateLimitError:
            if attempt < max_retries - 1:
                sleep_time = 2 ** attempt
                print(f"Rate limit exceeded. Retrying in {sleep_time} seconds ... ")
                time.sleep(sleep_time)
            else:
                raise

X_embeddings = np.array([get_embedding_with_backoff(comment) for comment in X_text])

Here, each comment comment is embedded into a numerical vector. The result, X_embeddings, is a NumPy array where each row corresponds to the embedding of a single comment.

Neural Network Architecture

I preserved the general neural network architecture, training regime, and hyperparameters to maintain comparability with the original study. The only modification is adjusting the input layer’s dimensions to match the LLM embedding size. The network typically includes an input layer, two hidden layers with ReLU activations, and a final sigmoid layer for binary classification.

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Input

def create_neural_network(input_dim):
    model = Sequential()
    # Input layer matches the size of the embedding dimension
    model.add(Input(shape=(input_dim,)))
    model.add(Dense(12, activation='relu'))
    model.add(Dense(8, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))

    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

This function returns a compiled Keras model that is ready for training.

Model Training and Evaluation

I iterated through each fold of the GroupKFold cross-validation. For each fold, I split the data into training and test sets. Then, I trained the neural network on the training folds, validated performance on a held-out portion of that training set (validation split), and finally evaluated the model on the test fold. Performance was recorded using the AUC ROC metric.

from sklearn.metrics import roc_auc_score

roc_auc_scores = []

for train_index, test_index in gkf.split(X_embeddings, y, groups=groups):
    # Split embeddings and labels
    X_train, X_test = X_embeddings[train_index], X_embeddings[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    # Create and train the model
    model = create_neural_network(input_dim=X_embeddings.shape[1])
    model.fit(
        X_train,
        y_train,
        epochs=30,        # as in the original study
        batch_size=10,    # as in the original study
        validation_split=0.1,
        shuffle=True,
        verbose=0
    );

    # Predict on the test set
    predictions = model.predict(X_test, verbose=0)
    roc_auc = roc_auc_score(y_test, predictions)
    roc_auc_scores.append(roc_auc)

# Report overall performance
print("Average ROC AUC Score:", np.mean(roc_auc_scores))
print("Standard Deviation:", np.std(roc_auc_scores))
print("Maximum ROC AUC Score:", np.max(roc_auc_scores))

<keras.src.callbacks.history.History at 0x13aac7b60>

<keras.src.callbacks.history.History at 0x13abeaed0>

<keras.src.callbacks.history.History at 0x13adcbf20>

<keras.src.callbacks.history.History at 0x13ad69c40>

<keras.src.callbacks.history.History at 0x13b0cd970>

Average ROC AUC Score: 0.9352577417708996
Standard Deviation: 0.0411599012839265
Maximum ROC AUC Score: 0.9813519813519813

Results

Table 1: Original Results vs. New Results

Comparison of AUC ROC Metrics
Metric	Original Results	New Results
Average AUC ROC	0.899	0.935
Standard Deviation	0.032	0.041
Maximum AUC ROC	Not reported	0.981

Integrating LLM embeddings led to notable improvements. Average AUC ROC increased from approximately 0.899 in earlier work to about 0.935 with LLM embeddings, with some folds reaching 0.981, indicating that advanced embeddings more accurately distinguish process-focused feedback from other comment types. Moreover, performance stabilized across folds, suggesting improved robustness and reduced variance.

Importantly, the model generalized well to new student data. This finding implies that the embedding-based model is not simply memorizing student idiosyncrasies but learning transferable linguistic features associated with CP. The resulting model could support large-scale implementations, identifying high-quality, process-oriented feedback in real-time.

Discussion

The results highlight the potential of LLM-based embeddings to enhance automated feedback analytics in educational settings. By capturing subtle semantic patterns, these embeddings enable more accurate identification of CP attributes and facilitate timely, targeted interventions. Teachers can use these insights to recognize when students engage in meaningful, process-level thinking, while platform developers can design adaptive features that prompt deeper reflection. Policymakers and curriculum specialists might leverage these tools to inform professional development and improve peer review guidelines. Still, several limitations warrant further exploration.

Limitations

Reliance on proprietary LLMs may raise cost, access, and interpretability issues. Additionally, while my results show strong performance within a middle school mathematics context, it remains unclear how well these methods transfer to other subjects, age groups, or types of feedback. Future work should explore these dimensions, assess the interpretability of LLM embeddings in educational contexts, and test different architectures or training regimes to boost performance and generalizability further.

Conclusion

This study demonstrates the significant potential of integrating large language model (LLM) embeddings into automated peer feedback analysis in mathematics education. By enhancing an established predictive framework with LLM-based representations, I substantially improved accuracy and generalizability for detecting process-focused commentary (CP).

The results highlight the power of advanced language models to capture nuanced linguistic patterns indicative of effective feedback. LLM embeddings outperformed prior sentence encodings in correctly identifying CP, suggesting their ability to learn more contextually rich features from limited data. Importantly, these gains were not merely a result of overfitting to specific student characteristics; the model’s strong performance on completely unseen students underscores its potential for broad, reliable application in real-world settings.

These findings offer promising avenues for enhancing formative assessment and personalized learning at scale. By enabling more precise, automated identification of high-quality feedback, this work lays the foundation for educational technologies that can offer immediate, targeted support to foster students’ mathematical reasoning skills. Such tools could help educators efficiently recognize and reinforce effective peer review practices, tailor instruction to individual needs, and promote richer classroom discussions around problem-solving processes.

More broadly, this study contributes to the growing body of research on AI-augmented education. It demonstrates the value of leveraging state-of-the-art NLP techniques, particularly LLMs, to tackle complex challenges in learning analytics. The approach presented here could potentially be extended to other domains and feedback dimensions, opening up new possibilities for data-driven support across diverse educational contexts.

However, it is important to acknowledge this work’s limitations and potential future directions. The reliance on a proprietary LLM may raise questions of cost, transparency, and reproducibility. Additionally, while the results are promising within the scope of middle school mathematics, further research is needed to validate the transferability of these methods to other subject areas, age groups, and feedback types. Investigating the interpretability of LLM embeddings in educational settings and exploring alternative model architectures are also important areas for future study.

Nonetheless, this research represents a significant step forward in the development of scalable, AI-powered tools to support effective peer learning. By harnessing the power of language models to identify and amplify high-quality feedback, we can create more responsive, adaptive educational environments that foster deeper engagement and understanding. Ultimately, this work contributes to the broader vision of leveraging AI to enhance education equity and outcomes, empowering all students to reach their full potential as mathematical thinkers and problem-solvers.

Submission Guidelines

This document includes all required explanations. The code and data are organized to facilitate replication and further analysis. Please let me know if additional information is needed.

References

Anthropic. 2024. “Claude 3 Opus.” Large language model. https://claude.ai/.

Kapur, Manu. 2010. “Productive Failure in Mathematical Problem Solving.” Instructional Science 38: 523–50.

Nicol, David J, and Debra Macfarlane-Dick. 2006. “Formative Assessment and Self-Regulated Learning: A Model and Seven Principles of Good Feedback Practice.” Studies in Higher Education 31 (2): 199–218.

OpenAI. 2024a. “GPT-4o.” Large language model. https://chatgpt.com/.

———. 2024b. “O1.” Large language model. https://chatgpt.com/.

Radford, Alec, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. “Language Models Are Unsupervised Multitask Learners.” OpenAI Blog 1 (8): 9.

Uesato, Jonathan, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. 2022. “Solving Math Word Problems with Process-and Outcome-Based Feedback.” arXiv Preprint arXiv:2211.14275.

Zhang, Jiayi, Ryan S Baker, JM Andres, Stephen Hutt, and Sheela Sethuraman. 2023. “Automated Multi-Dimensional Analysis of Peer Feedback in Middle School Mathematics.” In Proceedings of the 16th International Conference on Computer-Supported Collaborative Learning-CSCL 2023, Pp. 221-224. International Society of the Learning Sciences.