Update chapter 6 with latest version of article

This commit is contained in:
Charlotte Van Petegem 2024-03-06 15:11:52 +01:00
parent 9816406d66
commit c2788663f9
No known key found for this signature in database
GPG key ID: 019E764B7184435A
9 changed files with 266 additions and 191 deletions

454
book.org
View file

@ -45,6 +45,7 @@
#+LATEX_HEADER: Prof.\ Dr.\ Bram De Wever
#+LATEX_HEADER: }
#+LATEX_HEADER: \addtokomafont{caption}{\small}
#+LATEX_HEADER: \counterwithout{footnote}{chapter}
#+LATEX_HEADER: \setuptoc{toc}{numbered}
#+LATEX_HEADER: \addto\captionsenglish{\renewcommand{\contentsname}{Table of Contents}}
#+OPTIONS: ':t
@ -2916,47 +2917,45 @@ This section is based on an article that is currently being prepared for submiss
:END:
Feedback is a key factor in student learning\nbsp{}[cite:@hattiePowerFeedback2007; @blackAssessmentClassroomLearning1998].
In programming education, many steps have been taken to automate feedback using automated assessment systems\nbsp{}[cite:@paivaAutomatedAssessmentComputer2022; @ihantolaReviewRecentSystems2010; @ala-mutkaSurveyAutomatedAssessment2005].
These automated assessment systems provide feedback on correctness, and can provide some feedback on style and best practices by using linters.
They are, however, generally not able to provide the same high-level feedback on program design that a seasoned programmer can give.
In programming education, many steps have been taken to give feedback using automated assessment systems\nbsp{}[cite:@paivaAutomatedAssessmentComputer2022; @ihantolaReviewRecentSystems2010; @ala-mutkaSurveyAutomatedAssessment2005].
These automated assessment systems give feedback on correctness, and can give some feedback on style and best practices through the use of linters.
They are, however, generally not able to give the same high-level feedback on program design that an experienced programmer can give.
In many educational practices, automated assessment is therefore supplemented with manual feedback, especially when grading evaluations or exams\nbsp{}[cite:@debuseEducatorsPerceptionsAutomated2008].
This requires a large time investment from teachers\nbsp{}[cite:@tuckFeedbackgivingSocialPractice2012].
This requires a significant time investment of teachers\nbsp{}[cite:@tuckFeedbackgivingSocialPractice2012].
Consequently, numerous researchers have explored the enhancement of feedback mechanisms through AI.
[cite/t:@vittoriniAIBasedSystemFormative2021] automated grading using natural language processing, and found that students who used this system during the semester were more likely to pass the course at the end of the semester.
[cite/t:@leeSupportingStudentsGeneration2023] has used supervised learning with ensemble learning to enable students to conduct peer and self-evaluation.
Furthermore,\nbsp{}[cite/t:@berniusMachineLearningBased2022] introduced a framework based on clustering text segments in textual exercises to reduce the grading workload.
As a result, many researchers have explored the use of AI to enhance giving feedback.
[cite/t:@vittoriniAIBasedSystemFormative2021] used natural language processing to automate grading, and found that students who used the system during the semester were more likely to pass the course at the end of the semester.
[cite/t:@leeSupportingStudentsGeneration2023] has used supervised learning with ensemble learning to enable students to perform peer and self-assessment.
In addition, [cite/t:@berniusMachineLearningBased2022] introduced a framework based on clustering text segments in textual exercises to reduce the grading workload.
The context of our work is in our own assessment system, called Dodona, developed at Ghent University\nbsp{}[cite:@vanpetegemDodonaLearnCode2023].
Dodona provides automated feedback on every submission, but also allows teachers to give manual feedback on student submissions and assign scores to them, from within the platform.
In 2023, 3\thinsp{}663\thinsp{}749 submissions were made on our platform, of which 44\thinsp{}012 were manually assessed.
During those assessments, 22\thinsp{}888 annotations were added.
The process of giving feedback on a programming assignment in Dodona is very similar to a code review, where mistakes or suggestions for improvements are annotated at the relevant line(s), as can be seen on Figure\nbsp{}[[fig:feedbackintroductionreview]].
Dodona gives automated feedback on each submission, but also has a module that allows teachers to give manual feedback on student submissions and assign scores to them, from within the platform.
The process of giving feedback on solution to a programming exercise in Dodona is very similar to a code review, where errors or suggestions for improvements are annotated on the relevant line(s), as can be seen on Figure\nbsp{}[[fig:feedbackintroductionreview]].
In 2023, there were 3\thinsp{}663\thinsp{}749 submissions on our platform, of which 44\thinsp{}012 were assessed manually.
During these assessments, 22\thinsp{}888 annotations were added.
#+CAPTION: Manual assessment of a submission: a teacher gave feedback on the code by adding inline annotations and is grading the submission by filling up the scoring rubric.
#+CAPTION: Manual assessment of a submission: a teacher gives feedback on the code by adding inline annotations and scores the submission by filling out the exercise-specific scoring rubric.
#+CAPTION: The teacher just searched for an annotation so that they could reuse it.
#+CAPTION: Automated assessment was already performed beforehand, where 22 test cases failed, as can be seen from the badge on the "Correctness" tab.
#+CAPTION: An automated annotation left by PyLint can be seen on line 22.
#+NAME: fig:feedbackintroductionreview
[[./images/feedbackintroductionreview.png]]
However, there exists a crucial distinction between traditional code reviews and those in an educational context: instructors often provide feedback on numerous solutions to the same assignment.
Given that students frequently commit similar errors, it logically follows that instructors repeatedly deliver the same feedback across multiple student submissions.
In response to this repetitive nature of feedback, Dodona has implemented a feature enabling instructors to save and later retrieve specific messages.
This functionality facilitates the reuse of feedback by allowing teachers to search for previously saved messages.
By using this functionality, we have generated data that we can use in this study: code submissions, where those submissions have been annotated on specific lines with messages that are shared over those submissions.
However, there is a crucial difference between traditional code reviews and those in an educational context: teachers often give feedback on numerous solutions to the same exercise.
Since students often make similar mistakes, it logically follows that teachers will repeatedly give the same feedback on multiple student submissions.
In response to this repetitive nature of feedback, Dodona has implemented a feature that allows teachers to save and later retrieve specific annotations.
This feature facilitates the reuse of feedback by allowing teachers to search for previously saved annotations.
In 2023, 777 annotations were saved by teachers on Dodona, and there were 7\thinsp{}180 instances of reuse of these annotations.
By using this functionality, we have generated data that we can use in this study: code submissions, where those submissions have been annotated on specific lines with annotations that are shared across those submissions.
Note that there are two concepts here, whose distinction is important.
On the one hand, we have /annotations/.
Annotations are a specific instance of a message left by a grader: it consists of its text, and is also linked to a specific line of a specific submission.
On the other hand we have /messages/.
This is the text that can be reused by graders when adding an annotation.
In this section we give an answer to the following research question: Can we, in the context of grading code written by students during an evaluation, use previously given feedback to predict what feedback a grader will give on a particular line?
In this section we answer the following research question: In the context of manually assessing code written by students during an evaluation, can we use previously added annotations to predict wat annotation a reviewer will add on a particular line?
We present a machine learning method for suggesting reuse of previously given feedback.
We start with an in-depth explanation of the design of the method.
We then present and discuss the experimental results we obtained when testing our method on student submissions.
Two datasets are used to evaluate our method, based on real (Python) code written by students for exams.
With the first dataset we predict automated PyLint messages.
For the second dataset we use actual annotations left by graders during the grading of an exam.
We begin with a detailed explanation of the design of the method.
We then present and discuss the experimental results we obtained by testing our method on student submissions.
The dataset we use is based on real (Python) code written by students for exams.
First, we test our method by predicting automated PyLint annotations.
Second, we use manual annotations left by human reviewers during assessment.
*** Methodology
:PROPERTIES:
@ -2964,54 +2963,56 @@ For the second dataset we use actual annotations left by graders during the grad
:CUSTOM_ID: subsec:feedbackpredictionmethodology
:END:
The approach we present to predict what feedback a grader will give on source code is based on pattern mining.
Pattern mining is a data mining technique for extracting frequently occurring patterns from data that can be represented as trees.
It was already developed in the early 2000s\nbsp{}[cite:@zakiEfficientlyMiningFrequent2005; @asaiEfficientSubstructureDiscovery2004].
Program code can be represented as an abstract syntax tree (AST), where nodes of the tree represent the language constructs used in the program.
More recent work used this fact to look into how these pattern mining algorithms could be used to efficiently find frequent patterns in source code\nbsp{}[cite:@phamMiningPatternsSource2019].
In an educational context, these techniques could then be used to, for example, find patterns common to solutions that failed a given exercise\nbsp{}[cite:@mensGoodBadUgly2021].
Other work looked into generating unit tests from mined patterns\nbsp{}[cite:@lienard2023extracting].
The approach we present for predicting what feedback a reviewer will give on source code is based on mining patterns from trees.
This is a data mining technique for extracting frequently occurring patterns from data that can be represented as trees.
It was developed in the early 2000s\nbsp{}[cite:@zakiEfficientlyMiningFrequent2005; @asaiEfficientSubstructureDiscovery2004].
Program code can be represented as an abstract syntax tree (AST), where the nodes of the tree represent the language constructs used in the program.
Recent work has used this fact to investigate how these pattern mining algorithms can be used to efficiently find frequent patterns in source code\nbsp{}[cite:@phamMiningPatternsSource2019].
In an educational context, these techniques could then be used, for example, to find patterns common to solutions that failed a given exercise\nbsp{}[cite:@mensGoodBadUgly2021].
Other work has looked at automatically generating unit tests from mined patterns\nbsp{}[cite:@lienard2023extracting].
We start with a general overview of our method (explained visually in Figure\nbsp{}[[fig:feedbackmethodoverview]]).
The first step is using the tree-sitter library\nbsp{}[cite:@brunsfeldTreesitterTreesitterV02024] to generate ASTs for each submission.
For every annotation, a constrained AST context surrounding the annotated line is extracted.
Subsequently, we then aggregate all the subtrees for each occurrence of a message.
Every message's collection of subtrees is processed by the =TreeminerD= algorithm\nbsp{}[cite:@zakiEfficientlyMiningFrequent2005], yielding a set of frequently occurring patterns specific for that message.
We assign weights to these patterns based on their length and their frequency across the entire dataset of patterns for all messages.
We start with a general overview of our method (Figure\nbsp{}[[fig:feedbackmethodoverview]]).
The first step is to use the tree-sitter library\nbsp{}[cite:@brunsfeldTreesitterTreesitterV02024] to generate ASTs for each submission.
Using tree-sitter should make our method independent of the programming language used, since it is a generic interface for generating syntax trees.
For each instance of an annotation, a constrained AST context surrounding the annotated line is extracted.
We then aggregate all subtrees of an annotation.
The collection of subtrees for each annotation is processed by the =TreeminerD= algorithm\nbsp{}[cite:@zakiEfficientlyMiningFrequent2005], yielding a set of frequently occurring patterns specific to that annotation.
We assign weights to these patterns based on their length and their frequency across the entire dataset of patterns for all annotations.
The result of these operations is our trained model.
The model can now be used to suggest matching patterns and thus messages for a given code fragment.
In practice, the instructor first selects a line of code in a given student's submission.
The model can now be used to suggest matching patterns and thus annotations for a given code fragment.
In practice, the reviewer first selects a line of code in a given student's submission.
Next, the AST of the selected line and its surrounding context is generated.
For all messages, each of its patterns is matched to the line, and a similarity score is calculated (given the weights determined earlier).
This similarity score is used to rank the messages and this ranking is shown to the teacher.
For each annotation, each of its patterns is matched to the line, and a similarity score is calculated, given the previously determined weights.
This similarity score is used to rank the annotations and this ranking is displayed to the teacher.
A detailed explanation of this process follows, with a particular emphasis on operational efficiency.
Speed is a paramount concern throughout the model's lifecycle, from training to deployment in real-time grading contexts.
Given the continuous generation of training data during the grading process, the model's training duration must be optimized to prevent significant delays, ensuring that the model remains practical for live grading situations.
Speed is a paramount concern throughout the model's lifecycle, from training to deployment in real-time reviewing contexts.
Given the continuous generation of training data during the reviewing process, the model's training time must be optimized to avoid significant delays, ensuring that the model remains practical for live reviewing situations.
#+CAPTION: Overview of our machine learning method for predicting feedback reuse.
#+CAPTION: Code is converted to its Abstract Syntax Tree form.
#+CAPTION: Annotations for the same message have been given the same colour.
#+CAPTION: Per message, the context of each annotation is extracted and mined for patterns using the =TreeminerD= algorithm.
#+CAPTION: These patterns are then weighted, after which they make up our model.
#+CAPTION: When a teacher wants to place an annotation on a line, messages are ranked based on the similarity determined for that line.
#+CAPTION: Overview of our machine learning method for predicting annotation reuse.
#+CAPTION: Code of previously reviewed submissions is converted to its abstract syntax tree (AST) form.
#+CAPTION: Instances of the same annotation have the same colour.
#+CAPTION: For each annotation, the context of each instance is extracted and mined for patterns using the =TreeminerD= algorithm.
#+CAPTION: These patterns are then weighted to form our model.
#+CAPTION: When a teacher wants to place an annotation on a line of the submissions they are currently reviewing, all previously given annotations are ranked based on the similarity determined for that line.
#+CAPTION: The teacher can then choose which annotation they want to place.
#+NAME: fig:feedbackmethodoverview
[[./diagrams/feedbackmethodoverview.svg]]
**** Extracting a subtree around a line
**** Extract subtree around a line
:PROPERTIES:
:CREATED: [2024-01-19 Fri 15:44]
:CUSTOM_ID: subsubsec:feedbackpredictionsubtree
:END:
Currently, the context around a line is extracted by taking all the AST nodes that are solely on that line.
For example, the subtree extracted for the code on line 3 in Listing\nbsp{}[[lst:feedbacksubtreesample]] can be seen on Figure\nbsp{}[[fig:feedbacksubtree]].
Currently, the context around a line is extracted by taking all AST nodes from that line.
For example, Figure\nbsp{}[[fig:feedbacksubtree]] shows that the subtree extracted for the code on line 3 in Listing\nbsp{}[[lst:feedbacksubtreesample]].
Note that the context we extract here is very limited.
Previous iterations considered all the nodes that contained the relevant line (e.g. the function node for a line in a function), but these contexts turned out to be too large to process in an acceptable time frame.
Previous iterations considered all nodes that contained the relevant line (e.g. the function node for a line in a function), but these contexts proved too large to process in an acceptable time frame.
#+ATTR_LATEX: :float t
#+CAPTION: Sample code that simply reads a number from standard input and prints its digits.
#+CAPTION: Example code that simply adds a number to the ASCII value of a character and converts it back to a character.
#+NAME: lst:feedbacksubtreesample
#+BEGIN_SRC python
def jump(alpha, n):
@ -3024,7 +3025,7 @@ def jump(alpha, n):
#+NAME: fig:feedbacksubtree
[[./diagrams/feedbacksubtree.svg]]
**** =TreeminerD= algorithm
**** Find frequent patterns
:PROPERTIES:
:CREATED: [2023-11-20 Mon 13:33]
:CUSTOM_ID: subsubsec:feedbackpredictiontreeminer
@ -3033,102 +3034,122 @@ def jump(alpha, n):
=Treeminer=\nbsp{}[cite:@zakiEfficientlyMiningFrequent2005] is an algorithm for discovering frequently occurring subtrees in datasets of rooted, ordered and labelled trees.
It does this by starting with a list of frequently occurring nodes, and then iteratively expanding the frequently occurring patterns.
In the base =Treeminer= algorithm, frequently occurring means that the amount of times the pattern occurs in all trees divided by the amount of trees is larger than some predefined threshold.
This is the =minimum support= parameter.
In the base =Treeminer= algorithm, frequently occurring means that the number of times the pattern occurs in all trees divided by the number of trees is greater than some predefined threshold.
This is called the =minimum support= parameter.
Patterns are embedded subtrees: the nodes in a pattern are a subset of the nodes of the tree, where the ancestor-descendant relationships are kept and the left-to-right ordering of nodes is also preserved.
Patterns are embedded subtrees: the nodes in a pattern are a subset in the nodes of the tree, preserving the ancestor-descendant relationships and the left-to-right order of the nodes.
=TreeminerD= is a more efficient version of the base =Treeminer= algorithm.
It achieves this efficiency by not counting the amount of occurrences of a frequent pattern within one tree.
Since we are not interested in this information for our method, it was an obvious choice to use the =TreeminerD= version.
=TreeminerD= is a more efficient variant of the base =Treeminer= algorithm.
It achieves this efficiency by not counting occurrences of a frequent pattern within a tree.
Since we are not interested in this information for our method, it was an obvious choice to use the =TreeminerD= variant.
We use our own Python implementation of the =TreeminerD= algorithm, in this case to find patterns in the AST subtrees for each message.
We set the =minimum support= parameter to 0.8 in our implementation.
This value was experimentally determined.
We use a custom Python implementation of the =TreeminerD= algorithm, to find patterns in the AST subtrees for each annotation.
In our implementation, we set the =minimum support= parameter to 0.8.
This value was determined experimentally.
As an example, one message in our real-world dataset was placed 92 times on 47 submissions by students.
For this message =TreeminerD= finds 105\thinsp{}718 patterns.
For example, one annotation in our real-world dataset had 92 instances placed on 47 student submissions.
For this annotation =TreeminerD= finds 105\thinsp{}718 patterns.
**** Assigning weights to patterns
**** Assign weights to patterns
:PROPERTIES:
:CREATED: [2023-11-22 Wed 14:39]
:CUSTOM_ID: subsubsec:feedbackpredictionweights
:END:
Due to the iterative nature of =TreeminerD= a lot of patterns are (embedded) subtrees of other patterns.
We don't do any post-processing to remove these patterns since they might be relevant for code we have not seen yet, but we do assign weights to them.
Due to the iterative nature of =TreeminerD=, many patterns are (embedded) subtrees of other patterns.
We don't do any post-processing to remove these patterns, since they might be relevant to code we haven't seen yet, but we do assign weights to them.
Weights are assigned using two criteria.
The first criterion is the size of the pattern (i.e., the number of nodes in the pattern), since a pattern with twenty nodes is a lot more specific than a pattern with only one node.
The second criterion is the amount of times a pattern occurs across all messages.
If all messages contain a specific pattern, it can not be reliably used to determine which message should be predicted and will therefore be assigned a smaller weight.
The weights are calculated using the formula below.
The first criterion is the size of the pattern (i.e., the number of nodes in the pattern), since a pattern with twenty nodes is much more specific than a pattern with only one node.
The second criterion is the number of occurrences of a pattern across all annotations.
If the pattern sets for all annotations contain a particular pattern, it can't be used reliably to determine which annotation should be predicted and is therefore given a lower weight.
Weights are calculated using the formula below.
\[\operatorname{weight}(pattern) = \frac{\operatorname{len}(pattern)}{\operatorname{\#occurences}(pattern)}\]
**** Matching patterns to subtrees
**** Match patterns to subtrees
:PROPERTIES:
:CREATED: [2024-02-01 Thu 14:25]
:CUSTOM_ID: subsubsec:feedbackpredictionmatching
:END:
To know whether a given pattern matches a given subtree, we iterate over all nodes in the subtree.
Simultaneously, we also iterate over the nodes in the patterns.
During iteration, we also store the current depth, both in the pattern and the subtree.
We also keep a stack to store (some of) the subtree depths.
To check whether a given pattern matches a given subtree, we iterate over all the nodes in the subtree.
At the same time, we also iterate over the nodes in the pattern.
During the iteration, we also store the current depth, both in the pattern and the subtree.
We also keep a stack to store (some of) the depths of the subtree.
If the current label in the subtree and the pattern are the same, we store the current subtree depth on the stack and move to the next node in the pattern.
For moving upwards in the tree things are more complicated.
Moving up in the tree is more complicated.
If the current depth and the depth of the last match (stored on the stack) are the same, we can move forwards in the pattern (and the subtree).
If this is not the case, we need to check that we are still in the embedded subtree, otherwise we need to reset our position in the pattern to the start.
Because subtrees can contain the same label multiple times we also need to make sure we can backtrack.
The full pseudocode for this algorithm can be seen in Listing\nbsp{}[[lst:feedbackmatchingpseudocode]].
If not, we need to check that we are still in the embedded subtree, otherwise we need to reset our position in the pattern to the start.
Since subtrees can contain the same label multiple times, we also need to make sure that we can backtrack.
Listings\nbsp{}[[lst:feedbackmatchingpseudocode1]]\nbsp{}and\nbsp{}[[lst:feedbackmatchingpseudocode2]] contain the full pseudocode for this algorithm.
#+ATTR_LATEX: :float t
#+CAPTION: Pseudocode for checking whether a pattern matches a subtree.
#+CAPTION: Note that both the pattern and the subtree are stored in the encoding described by\nbsp{}[cite/t:@zakiEfficientlyMiningFrequent2005].
#+NAME: lst:feedbackmatchingpseudocode
#+CAPTION: The implementation of =find_in_subtree= can be found in Listing\nbsp[[lst:feedbackmatchingpseudocode1]].
#+NAME: lst:feedbackmatchingpseudocode1
#+BEGIN_SRC python
start, p_i, pattern_depth, depth = 0
depth_stack, history = []
subtree_matches(subtree, pattern):
p_i, pattern_depth, depth = 0
depth_stack = []
for item in subtree:
if item == -1:
if depth_stack is not empty and \
depth - 1 == depth_stack.last:
last_depth = depth_stack.pop()
if pattern [p_i] != -1 and \
(last_depth < pattern_depth or \
depth_stack is empty):
p_i = 0
if pattern[p_i] == -1:
pattern_depth -= 1
p_i += 1
depth -= 1
result = find_in_subtree(subtree, subtree)
while not result and history is not empty:
to_explore, to_explore_subtree = history.pop()
while not result and to_explore is not empty:
start, depth, depth_stack, p_i = to_explore.pop()
new_subtree = to_explore_subtree[start:]
start = 0
if pattern_length - p_i <= len(new_subtree) and new_subtree is fully contained in pattern[p_i:]:
result = find_in_subtree(subtree, new_subtree)
return result
#+END_SRC
#+ATTR_LATEX: :float t
#+CAPTION: Continuation of Listing\nbsp[[lst:feedbackmatchingpseudocode1]].
#+NAME: lst:feedbackmatchingpseudocode2
#+BEGIN_SRC python
find_in_subtree(subtree, current_subtree):
local_history = []
for item in subtree:
if item == -1:
if depth_stack is not empty and depth - 1 == depth_stack.last:
depth_stack.pop()
if pattern [p_i] != -1:
p_i = 0
if depth_stack is empty:
history.append((local_history, current_subtree[:i + 1])
local_history = []
else:
if pattern[p_i] == item:
depth_stack.append(depth)
pattern_depth += 1
p_i += 1
depth += 1
p_i += 1
depth -= 1
else:
if pattern[p_i] == item:
local_history.append((start + i + 1, depth + 1, depth_stack, p_i))
depth_stack.append(depth)
p_i += 1
depth += 1
if p_i == pattern_length:
return True
return False
return True
if local_history is not empty:
history.append((local_history, current_subtree))
return False
#+END_SRC
Checking whether a pattern matches a subtree is an operation that needs to happen a lot.
For some messages, there are many patterns, and all patterns of all messages are checked.
One important optimization we added was therefore to only execute the algorithm in Listing\nbsp{}[[lst:feedbackmatchingpseudocode]] if the set of labels in the pattern is a subset of the labels in the subtree.
For some annotations, there are many patterns, and all patterns of all annotations are checked.
An important optimization we added was to run the algorithm in Listings\nbsp{}[[lst:feedbackmatchingpseudocode1]]\nbsp{}and\nbsp{}[[lst:feedbackmatchingpseudocode2]] only if the set of labels in the pattern is a subset of the labels in the subtree.
**** Ranking the messages
**** Rank annotations
:PROPERTIES:
:CREATED: [2023-11-22 Wed 14:47]
:CUSTOM_ID: subsubsec:feedbackpredictionsimilarity
:END:
Given a model where we have weighted patterns for each message, and a method for matching patterns to subtrees, we can now put these two together to make a final ranking of the messages for a line of code.
We calculate a matching score for each message using the formula below.
\[ \operatorname{score}(message) = \frac{\displaystyle\sum_{pattern \atop \in\, patterns} \begin{cases} \operatorname{weight}(pattern) & \text{if } pattern \text{ matches} \\ 0 & \text{otherwise} \end{cases}}{\operatorname{len}(patterns)} \]
The messages are sorted using this score.
Given a model where we have weighted patterns for each annotation, and a method for matching patterns to subtrees, we can now put the two together to make a final ranking of the available annotations for a given line of code.
We compute a match score for each annotation using the formula below.
\[ \operatorname{score}(annotation) = \frac{\displaystyle\sum_{pattern \atop \in\, patterns} \begin{cases} \operatorname{weight}(pattern) & pattern \text{ matches} \\ 0 & \text{otherwise} \end{cases}}{\operatorname{len}(patterns)} \]
The annotations are ranked according to this score.
*** Results and discussion
:PROPERTIES:
@ -3136,23 +3157,41 @@ The messages are sorted using this score.
:CUSTOM_ID: subsec:feedbackpredictionresults
:END:
We used two datasets to evaluate our method.
Both are based on real (Python) code written by students during multiple exams.
To test our method, we split the datasets in half and used the first half to train and the second half to test.
During the test phase, we iterate over the places where annotations were added in the source data.
These are the lines we give to our model.
We look at whether the correct message is ranked first, or if it is ranked in the first five suggestions.
This gives us a good idea on how useful this would be in practice: if a message is ranked farther than fifth, we would expect the grader to need to search for it.
As a dataset to validate our method, we used read (Python) code written by students for exercises from (different) exams.
The dataset contains between 135 and 214 submissions per exercise.
We split the datasets evenly into a training set and a test set.
During the testing phase we iterate over all instances of annotations in the test set.
The lines these occur on are the lines we feed to our model.
We evaluate if the correct annotation is ranked first, or if it is ranked in the top five.
This gives us a good idea of how useful the suggested ranking would be in practice: if an annotation is ranked higher than fifth, we would expect the reviewer to have to search for it instead of directly selecting it from the suggested ranking.
In the first dataset, we run PyLint on those student submissions, and use PyLint's annotations as our training data and test data.
Note that in this dataset, we don't make the distinction between the different assignments students had to solve, since the way Pylint annotates them does not differ between assignments.
In the second dataset, we use actual annotations left by graders on student code in Dodona.
Here we train and test per assignment, since the set of messages that were used is also different for each assignment.
We first ran PyLint on the student submissions, and used PyLint's machine annotations as our training and test data.
For a second evaluation, we used the manual annotations left by human reviewers on student code in Dodona.
In this case, we train and test per exercise, since the set of annotations used is also different for each exercise.
Exercises have between 55 and 469 instances of annotations.
Unique annotations vary between 11 and 92 per exercise.
Table\nbsp{}[[tab:feedbackresultsdataset]] gives an overview of some of the features of the dataset.
We differentiate between these two datasets, because we expect PyLint to be more consistent in when it places an annotation and also where it places that annotation.
Most linting messages are detected through explicit pattern matching in the AST, so we expect our implicit pattern matching to perform rather well.
Real-world data is more difficult, since graders are humans, and might miss an issue in one student's code that they annotated in another student's code, or they might not place the annotation for a certain message in a consistent location.
The method by which graders place an annotation is also a lot more implicit than PyLint's pattern matching.
#+CAPTION: Statistics for the exercises used in this analysis.
#+CAPTION: Max is the maximum amount of instances per annotation.
#+CAPTION: Avg is the average amount of instances per annotation.
#+NAME: tab:feedbackresultsdataset
| Exercise | subm. | ann. | inst. | max | avg |
|-------------------------------------------------------------------------+-------+------+-------+-----+------|
| <l> | <r> | <r> | <r> | <r> | <r> |
| A last goodbye[fn:: https://dodona.be/en/activities/505886137/] | 135 | 35 | 334 | 92 | 9.5 |
| Symbolic[fn:: https://dodona.be/en/activities/933265977/] | 141 | 11 | 55 | 24 | 5.0 |
| Narcissus cipher[fn:: https://dodona.be/en/activities/1730686412/] | 144 | 24 | 193 | 53 | 8.0 |
| Cocktail bar[fn:: https://dodona.be/en/activities/1875043169/] | 211 | 92 | 469 | 200 | 5.09 |
| Anthropomorphic emoji[fn:: https://dodona.be/en/activities/2046492002/] | 214 | 84 | 322 | 37 | 3.83 |
| Hermit[fn:: https://dodona.be/en/activities/2146239081/] | 194 | 70 | 215 | 29 | 3.07 |
We distinguish between these two sources of annotations, because we expect PyLint to be more consistent both in when it places an instance of an annotation and also where it places the instance.
Most linting annotations are detected through explicit pattern matching in the AST, so we expect our implicit pattern matching to work fairly well.
However, we want to skip this explicit pattern matching because of the time required to assemble them and the fact that annotations will often be specific to a particular exercise.
Therefore we also test on real-world data.
Real-world data is expected to be more inconsistent because human reviewer may miss a problem in one student's code that they annotated in another student's code, or they may not place an instance of a particular annotation in a consistent location.
The method by which human reviewers place an annotation is also much more implicit than PyLint's pattern matching.
**** Machine annotations (PyLint)
:PROPERTIES:
@ -3160,29 +3199,32 @@ The method by which graders place an annotation is also a lot more implicit than
:CREATED: [2023-11-20 Mon 13:33]
:END:
We will first discuss the PyLint results.
As can be seen in Figure\nbsp{}[[fig:feedbackpredictionpylintglobal]], for about 70% of the annotations, the actual message is ranked in the top five of messages.
For about 30% of the annotations, the message is even ranked first.
We will first discuss the results for the PyLint annotations.
Depending on the exercise, the actual annotation is ranked among the top five annotations for 50% to 80% of all test instances (Figure\nbsp{}[[fig:feedbackpredictionpylintglobal]]).
The annotation is even ranked first for 10% to 45% of all test instances.
Interestingly, the method performs worse when the instances for all exercises are combined.
This highlights the fact that our method is most useful in the context where similar code needs to be reviewed many times.
For the submissions and instances in the training set, training takes 1.5 to 50 seconds to process all submissions and instances in a training set, depending on the number of patterns found.
Testing takes 4 seconds to 22 minutes, again depending on the number of patterns.
Performance was measured on a consumer-grade Macbook Pro with a 1.4GHz Intel quad-core processor and 16 GB of RAM.
#+CAPTION: Global overview of the performance of our method on PyLint data.
#+CAPTION: For 70% of the annotations, the expected message is ranked in the top five.
#+CAPTION: In 30% of the annotations, the expected message is ranked first.
#+CAPTION: Predictive accuracy for suggesting instances of PyLint annotations.
#+NAME: fig:feedbackpredictionpylintglobal
[[./images/feedbackpredictionpylintglobal.png]]
In Figure\nbsp{}[[fig:feedbackpredictionpylintmessages]], we have highlighted some messages, some of which perform very well, and some of which perform worse.
The differences in performance can be explained through the content of the message and the underlying patterns sought by PyLint.
For example, the message "too many branches"[fn:: https://pylint.pycqa.org/en/latest/user_guide/messages/refactor/too-many-branches.html] performs rather poorly.
This can be explained through the fact that we prune too much context for the pattern that PyLint used to be picked up by =TreeminerD=.
There are also annotations that can not be predicted at all, because no patterns are found.
We have selected some annotations for further inspection, some of which perform very well, and some of which perform worse (Figure\nbsp{}[[fig:feedbackpredictionpylintmessages]]).
The differences in performance can be explained by the content of the annotation and the underlying patterns PyLint is looking for.
For example, the annotation "too many branches"[fn:: https://pylint.pycqa.org/en/latest/user_guide/messages/refactor/too-many-branches.html] performs rather poorly.
This can be explained by the fact that we do not feed enough context to =TreeminerD= to find predictive patterns for this PyLint annotation.
There are also annotations that can't be predicted at all, because no patterns are found.
Other messages, like "consider using with"[fn:: https://pylint.pycqa.org/en/latest/user_guide/messages/refactor/consider-using-with.html], perform very well.
For these messages, =TreeminerD= does have enough context to pick up the underlying patterns.
The amount of times the message occurs in the training set also has an impact.
Messages which only have a few annotations are generally predicted worse than those with a lot of annotations.
Other annotations, like "consider using with"[fn:: https://pylint.pycqa.org/en/latest/user_guide/messages/refactor/consider-using-with.html], work very well.
For these annotations, =TreeminerD= does have enough context to pick up the underlying patterns.
The number of instances of an annotation in the training set also has an impact.
Annotations which have only a few instances are generally predicted worse than those with lots of instances.
#+CAPTION: Detailed view of predictions for a few PyLint messages.
#+CAPTION: Each bar is a message, and the amount of occurrences in the training set and in the test set (respectively) is denoted in brackets after the name.
#+CAPTION: Predictive accuracy for a selection of PyLint annotations.
#+CAPTION: Each line corresponds to a Pylint annotation, with the number of instances in the training and test set denoted in brackets after the name of the annotation.
#+NAME: fig:feedbackpredictionpylintmessages
[[./images/feedbackpredictionpylintmessages.png]]
@ -3192,76 +3234,108 @@ Messages which only have a few annotations are generally predicted worse than th
:CUSTOM_ID: subsubsec:feedbackpredictionresultsrealworld
:END:
For the real-world data, we applied two different techniques to test our method.
Aside from using the same 50/50 split as with the PyLint data, we also tried to simulate how a grader would use the method, gradually increasing the size of the training set and decreasing the size of the test set.
For the annotations added by human reviewers, we applied two different scenarios to evaluate our method.
Besides using the same 50/50 split as with the PyLint data, we also simulated how a human reviewer would use the method in practice by gradually increasing the training set and decreasing the test set as the reviewer progresses through the submissions during the assessment.
At the start of the assessment no annotations are available and the first instance of an annotation that applies to a reviewed submission cannot be predicted.
As more submissions have been reviewed, and more instances of annotations are placed on those submissions, the training set for modelling predictions on the next submission under review grows gradually.
The results of the first test can be seen on Figure\nbsp{}[[fig:feedbackpredictionrealworldglobal]].
The results are similar to the PyLint data.
The percentages of annotations where the correct message is ranked first are generally higher (between 35 and 55 percent), and the percentages of annotations where it is ranked in the first five annotations are between 65 and 80 percent.
If we evenly split submissions and the corresponding annotations from a human reviewer into a training and a test set, the predictive accuracy is similar or even slightly better compared to the PyLint annotations (Figure\nbsp{}[[fig:feedbackpredictionrealworldglobal]]).
The number of instances where the true annotation is ranked first is generally higher (between 20% and 62% depending on the exercise), and the number of instances where it is ranked in the top five is between 42.5% and 81% depending on the exercise.
However, there is quite some variance between exercises.
This can be explained by the quality of the data.
For example, for the exercise "Symbolic", very few instances were placed for most annotations, which makes it difficult to predict additional instances.
For this experiment, training took between 1.5 and 16 seconds depending on the exercise.
Testing took between 1.5 and 36 seconds depending on the exercise.
These evaluations were run on the same hardware as those for the machine annotations.
#+CAPTION: Prediction results for four exercises that were part of an exam.
#+CAPTION: Our model was trained on half of the submissions in the dataset and tested with the other half.
#+CAPTION: Prediction results for six exercises that were designed and used for an exam.
#+CAPTION: Models were trained on half of the submissions from the dataset and tested on the other half of the submissions from the dataset.
#+NAME: fig:feedbackpredictionrealworldglobal
[[./images/feedbackpredictionrealworldglobal.png]]
For the next test, an extra category of annotations was added, namely "Not yet seen".
This means that the message for that annotation was not part of the test set, and thus could never have been predicted.
Results of this test (for the same exercises as in Figure\nbsp{}[[fig:feedbackpredictionrealworldglobal]]) can be seen on Figures\nbsp{}[[fig:feedbackpredictionrealworldsimulation1]],\nbsp{}[[fig:feedbackpredictionrealworldsimulation2]],\nbsp{}[[fig:feedbackpredictionrealworldsimulation3]]\nbsp{}and\nbsp{}[[fig:feedbackpredictionrealworldsimulation4]].
These figures show that while some build-up is required, once a critical mass of annotations is reached, the accuracy of the system jumps to its final state.
These results show that we can predict reuse with an accuracy that is quite high at the midpoint of a reviewing session for an exercise.
The accuracy depends on the amount of instances per annotation and the consistency of the reviewer.
Looking at the underlying data, we can also see that short, atomic messages can be predicted very well, as hinted by\nbsp{}[cite/t:@moonsAtomicReusableFeedback2022].
We will now look at the accuracy of our method over time, to test how the accuracy evolves as the reviewing session progresses.
#+CAPTION: Prediction results for the exercise "A last goodbye" over time.
#+CAPTION: The amount of annotations for which the message has never been seen is marked separately.
#+CAPTION: The chart on the right shows the messages that are in the training set.
For the next experiment, we introduce two specific categories of negative predictive outcomes, namely "Not yet seen" and "No patterns found".
"Not yet seen" means that the annotation corresponding to the true instance had no instances in the test set, and therefore could never have been predicted.
"No patterns found" means that =TreeminerD= was unable to find any frequent patterns for the set of subtrees.
We know beforehand that test instances of such annotations cannot be predicted.
Figures\nbsp{}[[fig:feedbackpredictionrealworldsimulation1]],\nbsp{}[[fig:feedbackpredictionrealworldsimulation2]],\nbsp{}[[fig:feedbackpredictionrealworldsimulation3]]\nbsp{}and\nbsp{}[[fig:feedbackpredictionrealworldsimulation4]] show the results of this experiment for four of the exercises we used in the previous experiments.
The exercises that performed worse results in the previous experiment were not taken into account for this experiment.
We also excluded submissions that received no annotations during the human review process, which explains the lower number of submissions compared to the numbers in Table\nbsp{}[[tab:feedbackresultsdataset]].
This experiment shows that while the review process requires some build-up before sufficient training instances are available, once a critical mass of training instances is reached, the accuracy for suggesting new instances of annotations reaches its maximal predictive power.
This critical mass is reached after about 20 to 30 reviews, which is quite early in the reviewing process.
The point at which the critical mass is reached will of course depend on the nature of the exercises and the consistency of the reviewer.
#+CAPTION: Progression of the predictive accuracy for the exercise "A last goodbye" throughout the review process.
#+CAPTION: Predictions for instances whose annotation had no instances in the training set are classified as "Not yet seen".
#+CAPTION: Predictions for instances whose annotation had no corresponding patterns in the model learned from the training set are classified as "No patterns found".
#+CAPTION: The graph on the right shows the number of annotations present with at least one instance in the training set.
#+NAME: fig:feedbackpredictionrealworldsimulation1
[[./images/feedbackpredictionrealworldsimulation1.png]]
#+CAPTION: Prediction results for the exercise "Narcissus cipher" over time.
#+CAPTION: The amount of annotations for which the message has never been seen is marked separately.
#+CAPTION: The chart on the right shows the messages that are in the training set.
#+CAPTION: Progression of the predictive accuracy for the exercise "Narcissus cipher" throughout the review process.
#+CAPTION: Predictions for instances whose annotation had no instances in the training set are classified as "Not yet seen".
#+CAPTION: Predictions for instances whose annotation had no corresponding patterns in the model learned from the training set are classified as "No patterns found".
#+CAPTION: The graph on the right shows the number of annotations present with at least one instance in the training set.
#+NAME: fig:feedbackpredictionrealworldsimulation2
[[./images/feedbackpredictionrealworldsimulation2.png]]
#+CAPTION: Prediction results for the exercise "Cocktail bar" over time.
#+CAPTION: The amount of annotations for which the message has never been seen is marked separately.
#+CAPTION: The chart on the right shows the messages that are in the training set.
#+CAPTION: Progression of the predictive accuracy for the exercise "Cocktail bar" throughout the review process.
#+CAPTION: Predictions for instances whose annotation had no instances in the training set are classified as "Not yet seen".
#+CAPTION: Predictions for instances whose annotation had no corresponding patterns in the model learned from the training set are classified as "No patterns found".
#+CAPTION: The graph on the right shows the number of annotations present with at least one instance in the training set.
#+NAME: fig:feedbackpredictionrealworldsimulation3
[[./images/feedbackpredictionrealworldsimulation3.png]]
#+CAPTION: Prediction results for the exercise "Anthropomorphic emoji" over time.
#+CAPTION: The amount of annotations for which the message has never been seen is marked separately.
#+CAPTION: The chart on the right shows the messages that are in the training set.
#+CAPTION: Progression of the predictive accuracy for the exercise "Anthropomorphic emoji" throughout the review process.
#+CAPTION: Predictions for instances whose annotation had no instances in the training set are classified as "Not yet seen".
#+CAPTION: Predictions for instances whose annotation had no corresponding patterns in the model learned from the training set are classified as "No patterns found".
#+CAPTION: The graph on the right shows the number of annotations present with at least one instance in the training set.
#+NAME: fig:feedbackpredictionrealworldsimulation4
[[./images/feedbackpredictionrealworldsimulation4.png]]
As mentioned before, we are working with a slightly inconsistent data set when using annotations by human reviewers.
They will sometimes miss an instance of an annotation, place it inconsistently, or create duplicate annotations.
If this system is used in practice, the predictions could possibly be even better, since knowing about its existence might further motivate a reviewer to be consistent in their reviews.
The exercises were also reviewed by different people, which could also be an explanation for the differences in accuracy of predictions.
For example, the reviewers of the exercises in Figures\nbsp{}[[fig:feedbackpredictionrealworldsimulation3]]\nbsp{}and\nbsp{}[[fig:feedbackpredictionrealworldsimulation4]] were still creating new annotations in the last few submissions, which obviously can't be predicted.
*** Conclusions and future work
:PROPERTIES:
:CREATED: [2023-11-20 Mon 13:33]
:CUSTOM_ID: subsec:feedbackpredictionconclusion
:END:
We presented a prediction method to help when giving feedback during grading by reusing messages.
Improving reuse of messages can be both a time-saver, and improve consistency with which feedback is given.
We presented a prediction method to assist in giving feedback while reviewing students submissions for an exercise by reusing annotations.
Improving annotation reuse can be both a time-saver, and improve the consistency with which feedback is given.
The latter itself might actually also improve the accuracy of the predictions when the strategy is applied during the review process.
The framework already has promising results.
We validated the framework by predicting both automated linting messages to establish a baseline and by using real-world data.
The method performs about the same for real-world data as it does for PyLint's linting messages.
We can thus answer our research question and say that yes, we can use previously given feedback to predict what feedback a grader will give on a particular line.
The method has already shown promising results.
We validated the framework both by predicting automated linting annotations to establish a baseline and by predicting annotations from human reviewers.
The method has about the same predictive accuracy for machine (Pylint) and human annotations.
Thus, we can give a positive answer to our research question that reuse of feedback given previously by a human reviewer can be predicted with high accuracy on a particular line of a new submission.
We can conclude that the proposed method has achieved the desired objective.
Having this method at hand immediately raises some possible follow-up work.
Currently, the proposed model is reactive: we suggest a ranking of most likely annotations when a reviewer wants to add an annotation to a particular line.
By introducing a confidence score, we could check beforehand if we have a confident match for each line, and then immediately propose those suggestions to the reviewer.
Whether or not a reviewer accepts these suggestions could then also be used as an input to the model.
This could also have an extra advantage, since it could help reviewers be more consistent in where and when they place an annotation.
Annotations that dont lend themselves well to prediction also need further investigation.
The context used could be expanded, although the important caveat here is that the method still needs to maintain its speed.
We could also consider applying some of the source code pattern mining techniques proposed by\nbsp{}[cite/t:@phamMiningPatternsSource2019] to achieve further speed improvements.
Another important aspect that was explicitly left out of the scope of this chapter was its integration into a learning platform and user testing.
Of course, alternative methods could also be considered.
One cannot overlook the rise of Large Language Models (LLMs) and the way they could contribute to this problem.
LLMs can also generate feedback for students, based on their code and a well-chosen system prompt.
Fine-tuning of a model with feedback already given could also be considered.
We can conclude that the proposed method achieved the aimed-for objective.
Having this method at hand immediately raises some possible follow-up work.
Messages that don't lend themselves well to being predicted need further investigation.
The context used could also be extended (although the important caveat here is that the method also needs to still maintain its speed).
Right now the model is also reactive: we propose a group of most likely messages when a grader wants to add an annotation on a line.
By introducing a confidence score we could check beforehand if we have a confident match for each line and then immediately propose this to the grader.
Whether a grader accepts this suggestion could then also be used as an input into the model.
We could also look into applying some of the techniques for source code pattern mining proposed by\nbsp{}[cite/t:@phamMiningPatternsSource2019] to make further speed improvements.
Another important aspect that was explicitly left out of scope in this chapter is building it into a learning platform and doing user testing.
* Looking ahead: opportunities and challenges
:PROPERTIES:
:CREATED: [2023-10-23 Mon 08:51]