Update chapter 6 with new version of article

This commit is contained in:
Charlotte Van Petegem 2024-03-14 14:52:40 +01:00
parent fb33d38002
commit e58bb482b0
No known key found for this signature in database
GPG key ID: 019E764B7184435A
15 changed files with 391 additions and 116 deletions

268
book.org
View file

@ -3003,32 +3003,33 @@ As a result, many researchers have explored the use of AI to enhance giving feed
In addition, [cite/t:@berniusMachineLearningBased2022] introduced a framework based on clustering text segments in textual exercises to reduce the grading workload.
The context of our work is in our own assessment system, called Dodona, developed at Ghent University\nbsp{}[cite:@vanpetegemDodonaLearnCode2023].
Dodona gives automated feedback on each submission, but also has a module that allows teachers to give manual feedback on student submissions and assign scores to them, from within the platform.
The process of giving feedback on solution to a programming exercise in Dodona is very similar to a code review, where errors or suggestions for improvements are annotated on the relevant line(s), as can be seen on Figure\nbsp{}[[fig:feedbackintroductionreview]].
In 2023, there were 3\thinsp{}663\thinsp{}749 submissions on our platform, of which 44\thinsp{}012 were assessed manually.
During these assessments, 22\thinsp{}888 annotations were added.
Dodona gives automated feedback on each submission, but also has a module that allows teachers to give manual feedback on student submissions and assign scores to them.
The process of giving manual feedback on a submission for a programming exercise in Dodona is very similar to a code review, where errors or suggestions for improvements are annotated on the relevant line(s)\nbsp{}(Figure\nbsp{}[[fig:feedbackintroductionreview]]).
3\thinsp{}663\thinsp{}749 submissions were submitted to Dodona in 2023 alone, of which 44\thinsp{}012 were then also assessed manually.
During manual assessment, 22\thinsp{}888 annotations were added to specific lines of code.
#+CAPTION: Manual assessment of a submission: a teacher gives feedback on the code by adding inline annotations and scores the submission by filling out the exercise-specific scoring rubric.
#+CAPTION: The teacher just searched for an annotation so that they could reuse it.
#+CAPTION: Automated assessment was already performed beforehand, where 22 test cases failed, as can be seen from the badge on the "Correctness" tab.
#+CAPTION: The teacher just searched for a previously saved annotation so that they could reuse it.
#+CAPTION: An automated annotation left by Pylint can be seen on line 22.
#+CAPTION: After manually assessing this submission, the teacher still needs to assess 23 other submissions, as can be seen on the progress bar on the right.
#+NAME: fig:feedbackintroductionreview
[[./images/feedbackintroductionreview.png]]
However, there is a crucial difference between traditional code reviews and those in an educational context: teachers often give feedback on numerous solutions to the same exercise.
Since students often make similar mistakes, it logically follows that teachers will repeatedly give the same feedback on multiple student submissions.
In response to this repetitive nature of feedback, Dodona has implemented a feature that allows teachers to save and later retrieve specific annotations.
However, there is a crucial difference between traditional code reviews and those in an educational context: teachers often give feedback on numerous submissions to the same exercise.
Since students often make similar mistakes in their submissions to an exercise, it logically follows that teachers will repeatedly give the same or similar feedback on multiple student submissions.
Because of this repetitive nature of feedback, Dodona allows teachers to save and later search for and retrieve specific annotations.
This feature facilitates the reuse of feedback by allowing teachers to search for previously saved annotations.
In 2023, 777 annotations were saved by teachers on Dodona, and there were 7\thinsp{}180 instances of reuse of these annotations.
By using this functionality, we have generated data that we can use in this study: code submissions, where those submissions have been annotated on specific lines with annotations that are shared across those submissions.
In 2023, 777 annotations were saved by teachers on Dodona, which were reused 7\thinsp{}180 times in total.
The utilisation of this functionality has generated data that we can use in this study: code submissions, annotated on specific lines with annotations that are being shared across submissions.
In this section we answer the following research question: In the context of manually assessing code written by students during an evaluation, can we use previously added annotations to predict wat annotation a reviewer will add on a particular line?
In this section we answer the following research question: Can previously added annotations be used to predict the annotations a reviewer is likely to add to a specific line of code, during the manual assessment of student-written code?
We present a machine learning method for suggesting reuse of previously given feedback.
We begin with a detailed explanation of the design of the method.
We present a machine learning approach aimed at facilitating the reuse of previously given feedback.
We begin with a detailed explanation of the design of our method.
We then present and discuss the experimental results we obtained by testing our method on student submissions.
The dataset we use is based on real (Python) code written by students for exams.
First, we test our method by predicting automated Pylint annotations.
The dataset we used for this experiment is based on real Python code written by students during examinations.
First, we test our method by predicting machine annotations by Pylint.
Second, we use manual annotations left by human reviewers during assessment.
*** Methodology
@ -3037,24 +3038,24 @@ Second, we use manual annotations left by human reviewers during assessment.
:CUSTOM_ID: subsec:feedbackpredictionmethodology
:END:
The approach we present for predicting what feedback a reviewer will give on source code is based on mining patterns from trees.
This is a data mining technique for extracting frequently occurring patterns from data that can be represented as trees.
It was developed in the early 2000s\nbsp{}[cite:@zakiEfficientlyMiningFrequent2005; @asaiEfficientSubstructureDiscovery2004].
The approach we present for predicting what feedback a reviewer will give on source code is based on tree mining.
This is a data mining technique for extracting frequently occurring patterns from data that can be represented as trees\nbsp{}[cite:@zakiEfficientlyMiningFrequent2005; @asaiEfficientSubstructureDiscovery2004].
Program code can be represented as an abstract syntax tree (AST), where the nodes of the tree represent the language constructs used in the program.
Recent work has used this fact to investigate how these pattern mining algorithms can be used to efficiently find frequent patterns in source code\nbsp{}[cite:@phamMiningPatternsSource2019].
In an educational context, these techniques could then be used, for example, to find patterns common to solutions that failed a given exercise\nbsp{}[cite:@mensGoodBadUgly2021].
In an educational context, these techniques were already used, for example, to find patterns common to solutions that failed a given exercise\nbsp{}[cite:@mensGoodBadUgly2021].
Other work has looked at automatically generating unit tests from mined patterns\nbsp{}[cite:@lienard2023extracting].
We use tree mig to find commonalities between the lines of code where the same annotation was added.
We start with a general overview of our method (Figure\nbsp{}[[fig:feedbackmethodoverview]]).
The first step is to use the tree-sitter library\nbsp{}[cite:@brunsfeldTreesitterTreesitterV02024] to generate ASTs for each submission.
Using tree-sitter should make our method independent of the programming language used, since it is a generic interface for generating syntax trees.
For each instance of an annotation, a constrained AST context surrounding the annotated line is extracted.
We then aggregate all subtrees of an annotation.
The collection of subtrees for each annotation is processed by the =TreeminerD= algorithm\nbsp{}[cite:@zakiEfficientlyMiningFrequent2005], yielding a set of frequently occurring patterns specific to that annotation.
For each annotation, we identify all occurrences and extract a constrained AST context around the annotated line for each instance.
The resulting subtrees are then aggrefated for each annotation, after which they are processed by the =TreeminerD= algorithm\nbsp{}[cite:@zakiEfficientlyMiningFrequent2005].
This yields a set of frequently occurring patterns specific to that annotation.
We assign weights to these patterns based on their length and their frequency across the entire dataset of patterns for all annotations.
The result of these operations is our trained model.
The model can now be used to suggest matching patterns and thus annotations for a given code fragment.
The model can now be used to score how well an annotation matches a given code fragment.
In practice, the reviewer first selects a line of code in a given student's submission.
Next, the AST of the selected line and its surrounding context is generated.
For each annotation, each of its patterns is matched to the line, and a similarity score is calculated, given the previously determined weights.
@ -3074,16 +3075,17 @@ Given the continuous generation of training data during the reviewing process, t
#+NAME: fig:feedbackmethodoverview
[[./diagrams/feedbackmethodoverview.svg]]
**** Extract subtree around a line
**** Subtree extraction
:PROPERTIES:
:CREATED: [2024-01-19 Fri 15:44]
:CUSTOM_ID: subsubsec:feedbackpredictionsubtree
:END:
The first step of our method is to extract a subtree for every instance of an annotation and then aggregate them per annotation.
Currently, the context around a line is extracted by taking all AST nodes from that line.
For example, Figure\nbsp{}[[fig:feedbacksubtree]] shows that the subtree extracted for the code on line 3 in Listing\nbsp{}[[lst:feedbacksubtreesample]].
Note that the context we extract here is very limited.
Previous iterations considered all nodes that contained the relevant line (e.g. the function node for a line in a function), but these contexts proved too large to process in an acceptable time frame.
Previous iterations of our method considered all nodes that contained the relevant line (e.g. the function node for a line in a function), but these contexts proved too large to process in an acceptable time frame.
#+ATTR_LATEX: :float t
#+CAPTION: Example code that simply adds a number to the ASCII value of a character and converts it back to a character.
@ -3099,20 +3101,25 @@ def jump(alpha, n):
#+NAME: fig:feedbacksubtree
[[./diagrams/feedbacksubtree.svg]]
**** Find frequent patterns
**** Pattern mining
:PROPERTIES:
:CREATED: [2023-11-20 Mon 13:33]
:CUSTOM_ID: subsubsec:feedbackpredictiontreeminer
:END:
=Treeminer=\nbsp{}[cite:@zakiEfficientlyMiningFrequent2005] is an algorithm for discovering frequently occurring subtrees in datasets of rooted, ordered and labelled trees.
After we have gathered a collection of subtrees for every annotation, we the need to mine patterns from these subtrees.
=Treeminer=\nbsp{}[cite:@zakiEfficientlyMiningFrequent2005] is an algorithm for discovering frequently occurring patterns in datasets of rooted, ordered and labelled trees.
It does this by starting with a list of frequently occurring nodes, and then iteratively expanding the frequently occurring patterns.
Patterns are embedded subtrees: the nodes in a pattern are a subset in the nodes of the tree, preserving the ancestor-descendant relationships and the left-to-right order of the nodes.
An example of a valid pattern for the tree in Figure\nbsp{}[[fig:feedbacksubtree]] can be seen in Figure\nbsp{}[[fig:feedbackpattern]].
#+CAPTION: Valid pattern for the tree in Figure\nbsp{}[[fig:feedbacksubtree]].
#+NAME: fig:feedbackpattern
[[./diagrams/feedbackpattern.svg]]
In the base =Treeminer= algorithm, frequently occurring means that the number of times the pattern occurs in all trees divided by the number of trees is greater than some predefined threshold.
This is called the =minimum support= parameter.
Patterns are embedded subtrees: the nodes in a pattern are a subset in the nodes of the tree, preserving the ancestor-descendant relationships and the left-to-right order of the nodes.
=TreeminerD= is a more efficient variant of the base =Treeminer= algorithm.
It achieves this efficiency by not counting occurrences of a frequent pattern within a tree.
Since we are not interested in this information for our method, it was an obvious choice to use the =TreeminerD= variant.
@ -3121,31 +3128,34 @@ We use a custom Python implementation of the =TreeminerD= algorithm, to find pat
In our implementation, we set the =minimum support= parameter to 0.8.
This value was determined experimentally.
For example, one annotation in our real-world dataset had 92 instances placed on 47 student submissions.
For this annotation =TreeminerD= finds 105\thinsp{}718 patterns.
The choice of Python might seem antithetical to our desire for efficiency, but the amount of work we have to do has a far larger impact on the performance than the choice of programming language.
**** Assign weights to patterns
**** Weight assignment
:PROPERTIES:
:CREATED: [2023-11-22 Wed 14:39]
:CUSTOM_ID: subsubsec:feedbackpredictionweights
:END:
Due to the iterative nature of =TreeminerD=, many patterns are (embedded) subtrees of other patterns.
We don't do any post-processing to remove these patterns, since they might be relevant to code we haven't seen yet, but we do assign weights to them.
Weights are assigned using two criteria.
We now have a set of patterns corresponding to each annotation.
Some patterns are more important that others though. Therefore we need to assign weight to the patterns we got from =TreeminerD=
Weights are assigned using two criteria.
The first criterion is the size of the pattern (i.e., the number of nodes in the pattern), since a pattern with twenty nodes is much more specific than a pattern with only one node.
The second criterion is the number of occurrences of a pattern across all annotations.
If the pattern sets for all annotations contain a particular pattern, it can't be used reliably to determine which annotation should be predicted and is therefore given a lower weight.
Weights are calculated using the formula below.
\[\operatorname{weight}(pattern) = \frac{\operatorname{len}(pattern)}{\operatorname{\#occurences}(pattern)}\]
**** Match patterns to subtrees
**** Pattern matching
:PROPERTIES:
:CREATED: [2024-02-01 Thu 14:25]
:CUSTOM_ID: subsubsec:feedbackpredictionmatching
:END:
After completing the steps above we now have our finished model.
To use our model, we need to know how to match patterns to subtrees.
To check whether a given pattern matches a given subtree, we iterate over all the nodes in the subtree.
At the same time, we also iterate over the nodes in the pattern.
During the iteration, we also store the current depth, both in the pattern and the subtree.
@ -3154,7 +3164,7 @@ If the current label in the subtree and the pattern are the same, we store the c
Moving up in the tree is more complicated.
If the current depth and the depth of the last match (stored on the stack) are the same, we can move forwards in the pattern (and the subtree).
If not, we need to check that we are still in the embedded subtree, otherwise we need to reset our position in the pattern to the start.
Since subtrees can contain the same label multiple times, we also need to make sure that we can backtrack.
Since subtrees can contain multiple instances of the same label, we also need to make sure that we can backtrack.
Listings\nbsp{}[[lst:feedbackmatchingpseudocode1]]\nbsp{}and\nbsp{}[[lst:feedbackmatchingpseudocode2]] contain the full pseudocode for this algorithm.
#+ATTR_LATEX: :float t
@ -3214,7 +3224,7 @@ Checking whether a pattern matches a subtree is an operation that needs to happe
For some annotations, there are many patterns, and all patterns of all annotations are checked.
An important optimization we added was to run the algorithm in Listings\nbsp{}[[lst:feedbackmatchingpseudocode1]]\nbsp{}and\nbsp{}[[lst:feedbackmatchingpseudocode2]] only if the set of labels in the pattern is a subset of the labels in the subtree.
**** Rank annotations
**** Annotation ranking
:PROPERTIES:
:CREATED: [2023-11-22 Wed 14:47]
:CUSTOM_ID: subsubsec:feedbackpredictionsimilarity
@ -3231,43 +3241,49 @@ The annotations are ranked according to this score.
:CUSTOM_ID: subsec:feedbackpredictionresults
:END:
As a dataset to validate our method, we used read (Python) code written by students for exercises from (different) exams.
As a dataset to validate our method, we used Python code written by students for exercises from (different) exams.
The dataset contains between 135 and 214 submissions per exercise.
We split the datasets evenly into a training set and a test set.
During the testing phase we iterate over all instances of annotations in the test set.
The lines these occur on are the lines we feed to our model.
We evaluate if the correct annotation is ranked first, or if it is ranked in the top five.
This gives us a good idea of how useful the suggested ranking would be in practice: if an annotation is ranked higher than fifth, we would expect the reviewer to have to search for it instead of directly selecting it from the suggested ranking.
This simulates the midpoint of an assessment session.
During testing, we let our model suggest annotations for each of the lines that had an actual annotation associated with it in the test set.
We evaluate at which position the correct annotation is ranked.
We only look at the top five to give us a good idea of how useful the suggested ranking would be in practice: if an annotation is ranked higher than fifth, we would expect the reviewer to have to search for it instead of directly selecting it from the suggested ranking.
We first ran Pylint[fn:: https://www.pylint.org/] (version 3.1.0) on the student submissions.
Pylint is a static code analyser for Python that checks for erros and code smells, and enforces a standard programming style.
We used Pylint's machine annotations as our training and test data.
We test per exercise since that's our main use case for this method, but also perform one test where all submissions of all exercises are combined.
For a second experiment, we used the manual annotations left by human reviewers on student code in Dodona.
Exercises were reviewed by different people, but all submissions for an exercise were reviewed by the same person.
The reviewers were not aware at the time this method would be developed.
In this case, there is no combined test, since the set of annotations used is also different for each exercise.
We distinguish between these two sources of annotations because we expect Pylint to be more consistent both in when it places an instance of an annotation and also where it places the instance.
Most linting annotations are detected by explicit pattern matching in the AST, so we expect our implicit pattern matching to work fairly well.
However, we want to skip this explicit pattern matching for manual annotations because of the time required to assemble them and the fact that annotations will often be specific to a particular exercise.
Therefore we also test on manual annotations.
Manual annotations are expected to be more inconsistent because reviewers may miss a problem in one student's code that they annotated in another student's code, or they may not place instances of a particular annotation in consistent locations.
The method by which human reviewers place an annotation is also much more implicit than Pylint's pattern matching.
We first ran Pylint on the student submissions, and used Pylint's machine annotations as our training and test data.
For a second evaluation, we used the manual annotations left by human reviewers on student code in Dodona.
In this case, we train and test per exercise, since the set of annotations used is also different for each exercise.
Exercises have between 55 and 469 instances of annotations.
Unique annotations vary between 11 and 92 per exercise.
The number of distinct annotations varies between 7 and 34 per exercise.
Table\nbsp{}[[tab:feedbackresultsdataset]] gives an overview of some of the features of the dataset.
Timings mentioned in this section were measured on a 2019 Macbook Pro with a 1.4GHz Intel quad-core processor and 16 GB of RAM.
#+CAPTION: Statistics for the exercises used in this analysis.
#+CAPTION: Statistics of manually added annotations for the exercises used in this analysis.
#+CAPTION: Max is the maximum amount of instances per annotation.
#+CAPTION: Avg is the average amount of instances per annotation.
#+NAME: tab:feedbackresultsdataset
| Exercise | subm. | ann. | inst. | max | avg |
|-------------------------------------------------------------------------+-------+------+-------+-----+------|
| <l> | <r> | <r> | <r> | <r> | <r> |
| A last goodbye[fn:: https://dodona.be/en/activities/505886137/] | 135 | 35 | 334 | 92 | 9.5 |
| Symbolic[fn:: https://dodona.be/en/activities/933265977/] | 141 | 11 | 55 | 24 | 5.0 |
| Narcissus cipher[fn:: https://dodona.be/en/activities/1730686412/] | 144 | 24 | 193 | 53 | 8.0 |
| Cocktail bar[fn:: https://dodona.be/en/activities/1875043169/] | 211 | 92 | 469 | 200 | 5.09 |
| Anthropomorphic emoji[fn:: https://dodona.be/en/activities/2046492002/] | 214 | 84 | 322 | 37 | 3.83 |
| Hermit[fn:: https://dodona.be/en/activities/2146239081/] | 194 | 70 | 215 | 29 | 3.07 |
We distinguish between these two sources of annotations, because we expect Pylint to be more consistent both in when it places an instance of an annotation and also where it places the instance.
Most linting annotations are detected through explicit pattern matching in the AST, so we expect our implicit pattern matching to work fairly well.
However, we want to skip this explicit pattern matching because of the time required to assemble them and the fact that annotations will often be specific to a particular exercise.
Therefore we also test on real-world data.
Real-world data is expected to be more inconsistent because human reviewer may miss a problem in one student's code that they annotated in another student's code, or they may not place an instance of a particular annotation in a consistent location.
The method by which human reviewers place an annotation is also much more implicit than Pylint's pattern matching.
Timings mentioned in this section were measured on a consumer-grade Macbook Pro with a 1.4GHz Intel quad-core processor and 16 GB of RAM.
| Exercise | subm. | ann. | inst. | max | avg |
|-----------------------+-------+------+-------+-----+-------|
| <l> | <r> | <r> | <r> | <r> | <r> |
| A last goodbye | 135 | 34 | 334 | 92 | 9.82 |
| Symbolic | 141 | 7 | 55 | 25 | 7.85 |
| Narcissus cipher | 144 | 17 | 193 | 55 | 11.35 |
| Cocktail bar | 211 | 15 | 469 | 231 | 31.27 |
| Anthropomorphic emoji | 214 | 27 | 322 | 39 | 11.93 |
| Hermit | 194 | 32 | 215 | 27 | 6.71 |
**** Machine annotations (Pylint)
:PROPERTIES:
@ -3278,30 +3294,37 @@ Timings mentioned in this section were measured on a consumer-grade Macbook Pro
We will first discuss the results for the Pylint annotations.
During the experiment, a few Pylint annotations not related to the actual code were left out to avoid skewing the results.
These are "line too long", "trailing whitespace", "trailing newlines", "missing module docstring", "missing class docstring", and "missing function docstring".
Depending on the exercise, the actual annotation is ranked among the top five annotations for 50% to 78% of all test instances (Figure\nbsp{}[[fig:feedbackpredictionpylintglobal]]).
The annotation is even ranked first for 19% to 58% of all test instances.
Interestingly, the method performs worse when the instances for all exercises are combined.
Depending on the exercise, the actual annotation is ranked among the top five annotations for 45% to 75% of all test instances (Figure\nbsp{}[[fig:feedbackpredictionpylintglobal]]).
The annotation is even ranked first for 19% to 59% of all test instances.
Interestingly, the method mostly performs worse when the instances for all exercises are combined.
This highlights the fact that our method is most useful in the context where similar code needs to be reviewed many times.
For the submissions and instances in the training set, training takes 1.5 to 52 seconds to process all submissions and instances in a training set, depending on the number of patterns found.
Testing takes 4 seconds to 10.5 minutes, again depending on the number of patterns.
For the submissions and instances in the training set, training takes 1.5 to 52 seconds for an exercise.
The entire testing phase takes between 4 seconds to 9.5 minutes.
The average time of one prediction ranges between 30ms and 6 seconds.
The minima range between 6ms and 4 seconds, the maxima between 127ms and 55 seconds.
Note that these are very wide ranges.
These big differences can be explained through the number of patterns that are found for the annotations.
If there are annotations with a very large amount of patterns, this will be reflected in both the training and testing time.
#+CAPTION: Predictive accuracy for suggesting instances of Pylint annotations.
#+CAPTION: Numbers on the right are the total amount of annotations and instances respectively.
#+CAPTION: Numbers on the right are the total number of annotations and instances respectively.
#+CAPTION: The "Combined" test evaluated our method on the entire set of submissions for all exercises.
#+NAME: fig:feedbackpredictionpylintglobal
[[./images/feedbackpredictionpylintglobal.png]]
We have selected some annotations for further inspection, some of which perform very well, and some of which perform worse (Figure\nbsp{}[[fig:feedbackpredictionpylintmessages]]).
We have selected some interesting annotations for further inspection, some of which perform very well, and some of which perform worse (Figure\nbsp{}[[fig:feedbackpredictionpylintmessages]]).
We selected these specific annotations to demonstrate interesting behaviours our method exhibits.
The differences in performance can be explained by the content of the annotation and the underlying patterns Pylint is looking for.
For example, the annotation "too many branches"[fn:: https://pylint.pycqa.org/en/latest/user_guide/messages/refactor/too-many-branches.html] performs rather poorly.
For example, the annotation "unused variable"[fn:: https://pylint.pycqa.org/en/latest/user_guide/messages/warning/unused-variable.html] performs rather poorly.
This can be explained by the fact that we do not feed enough context to =TreeminerD= to find predictive patterns for this Pylint annotation.
There are also annotations that can't be predicted at all, because no patterns are found.
Other annotations, like "consider using with"[fn:: https://pylint.pycqa.org/en/latest/user_guide/messages/refactor/consider-using-with.html], work very well.
Other annotations, like "consider using in"[fn:: https://pylint.pycqa.org/en/latest/user_guide/messages/refactor/consider-using-in.html], work very well.
For these annotations, =TreeminerD= does have enough context to pick up the underlying patterns.
The number of instances of an annotation in the training set also has an impact.
Annotations which have only a few instances are generally predicted worse than those with lots of instances.
#+CAPTION: Predictive accuracy for a selection of Pylint annotations.
#+CAPTION: Predictive accuracy for a selection of machine annotations by Pylint.
#+CAPTION: Each line corresponds to a Pylint annotation, with the number of instances in the training and test set denoted in brackets after the name of the annotation.
#+NAME: fig:feedbackpredictionpylintmessages
[[./images/feedbackpredictionpylintmessages.png]]
@ -3313,22 +3336,25 @@ Annotations which have only a few instances are generally predicted worse than t
:END:
For the annotations added by human reviewers, we applied two different scenarios to evaluate our method.
Besides using the same 50/50 split as with the Pylint data, we also simulated how a human reviewer would use the method in practice by gradually increasing the training set and decreasing the test set as the reviewer progresses through the submissions during the assessment.
Besides using the same 50/50 split between training and testing data as with the Pylint data, we also simulated how a human reviewer would use the method in practice by gradually increasing the training set and decreasing the test set as the reviewer progresses through the submissions during the assessment.
At the start of the assessment no annotations are available and the first instance of an annotation that applies to a reviewed submission cannot be predicted.
As more submissions have been reviewed, and more instances of annotations are placed on those submissions, the training set for modelling predictions on the next submission under review grows gradually.
If we evenly split submissions and the corresponding annotations from a human reviewer into a training and a test set, the predictive accuracy is similar or even slightly better compared to the Pylint annotations (Figure\nbsp{}[[fig:feedbackpredictionrealworldglobal]]).
The number of instances where the true annotation is ranked first is generally higher (between 20% and 62% depending on the exercise), and the number of instances where it is ranked in the top five is between 42% and 82% depending on the exercise.
The number of instances where the true annotation is ranked first is generally higher (between 36.4% and 60% depending on the exercise), and the number of instances where it is ranked in the top five is between 61% and 89% depending on the exercise.
However, there is quite some variance between exercises.
This can be explained by the quality of the data.
For example, for the exercise "Symbolic", very few instances were placed for most annotations, which makes it difficult to predict additional instances.
For this experiment, training took between 1.5 and 16 seconds depending on the exercise.
Testing took between 1.5 and 36 seconds depending on the exercise.
For this experiment, training took between 1.2 and 16.7 seconds depending on the exercise.
The entire testing phase took between 1.5 and 35 seconds depending on the exercise.
These evaluations were run on the same hardware as those for the machine annotations.
For one prediction, average times range between 0.1ms and 1 second.
The minima range between 0.1ms and 240ms and the maxima range between 0.2ms and 3 seconds.
The explanation for these big ranges remains the same as for the Pylint predictions: everything depends on the number of patterns found.
#+CAPTION: Prediction results for six exercises that were designed and used for an exam.
#+CAPTION: Prediction results for six exercises that were designed and used for an exam, using manual annotations.
#+CAPTION: Models were trained on half of the submissions from the dataset and tested on the other half of the submissions from the dataset.
#+CAPTION: Numbers on the right are the total amount of annotations and instances respectively.
#+CAPTION: Numbers on the right are the total number of annotations and instances respectively.
#+NAME: fig:feedbackpredictionrealworldglobal
[[./images/feedbackpredictionrealworldglobal.png]]
@ -3337,90 +3363,100 @@ The accuracy depends on the amount of instances per annotation and the consisten
Looking at the underlying data, we can also see that short, atomic messages can be predicted very well, as hinted by\nbsp{}[cite/t:@moonsAtomicReusableFeedback2022].
We will now look at the accuracy of our method over time, to test how the accuracy evolves as the reviewing session progresses.
For the next experiment, we introduce two specific categories of negative predictive outcomes, namely "Not yet seen" and "No patterns found".
"Not yet seen" means that the annotation corresponding to the true instance had no instances in the test set, and therefore could never have been predicted.
"No patterns found" means that =TreeminerD= was unable to find any frequent patterns for the set of subtrees.
For the next experiment, we introduce two specific categories of negative predictive outcomes, namely "No training instances" and "No patterns".
"No training instances means that the annotation corresponding to the true instance had no instances in the test set, and therefore could never have been predicted.
"No patterns" means that =TreeminerD= was unable to find any frequent patterns for the set of subtrees extracted from the annotation instances.
This could be because the collection of subtrees is too diverse, or because we have only seen one or two instances of a particular annotation, in which case we can't run =TreeminerD=.
We know beforehand that test instances of such annotations cannot be predicted.
Figures\nbsp{}[[fig:feedbackpredictionrealworldsimulation1]],\nbsp{}[[fig:feedbackpredictionrealworldsimulation2]],\nbsp{}[[fig:feedbackpredictionrealworldsimulation3]]\nbsp{}and\nbsp{}[[fig:feedbackpredictionrealworldsimulation4]] show the results of this experiment for four of the exercises we used in the previous experiments.
The exercises that performed worse results in the previous experiment were not taken into account for this experiment.
The exercises that performed worse in the previous experiment were not taken into account for this experiment.
We also excluded submissions that received no annotations during the human review process, which explains the lower number of submissions compared to the numbers in Table\nbsp{}[[tab:feedbackresultsdataset]].
This experiment shows that while the review process requires some build-up before sufficient training instances are available, once a critical mass of training instances is reached, the accuracy for suggesting new instances of annotations reaches its maximal predictive power.
This critical mass is reached after about 20 to 30 reviews, which is quite early in the reviewing process.
This critical mass is reached after about 20 to 30 reviews, which is quite early in the reviewing process\nbsp{}(Figure\nbsp{}[[fig:feedbackpredictionrealworldevolution]]).
The point at which the critical mass is reached will of course depend on the nature of the exercises and the consistency of the reviewer.
#+CAPTION: Progression of the predictive accuracy for the exercise "A last goodbye" throughout the review process.
#+CAPTION: Predictions for instances whose annotation had no instances in the training set are classified as "Not yet seen".
#+CAPTION: Predictions for instances whose annotation had no corresponding patterns in the model learned from the training set are classified as "No patterns found".
#+CAPTION: Predictions for instances whose annotation had no instances in the training set are classified as "No training instances.
#+CAPTION: Predictions for instances whose annotation had no corresponding patterns in the model learned from the training set are classified as "No patterns".
#+CAPTION: The graph on the right shows the number of annotations present with at least one instance in the training set.
#+NAME: fig:feedbackpredictionrealworldsimulation1
[[./images/feedbackpredictionrealworldsimulation1.png]]
#+CAPTION: Progression of the predictive accuracy for the exercise "Narcissus cipher" throughout the review process.
#+CAPTION: Predictions for instances whose annotation had no instances in the training set are classified as "Not yet seen".
#+CAPTION: Predictions for instances whose annotation had no corresponding patterns in the model learned from the training set are classified as "No patterns found".
#+CAPTION: Predictions for instances whose annotation had no instances in the training set are classified as "No training instances.
#+CAPTION: Predictions for instances whose annotation had no corresponding patterns in the model learned from the training set are classified as "No patterns".
#+CAPTION: The graph on the right shows the number of annotations present with at least one instance in the training set.
#+NAME: fig:feedbackpredictionrealworldsimulation2
[[./images/feedbackpredictionrealworldsimulation2.png]]
#+CAPTION: Progression of the predictive accuracy for the exercise "Cocktail bar" throughout the review process.
#+CAPTION: Predictions for instances whose annotation had no instances in the training set are classified as "Not yet seen".
#+CAPTION: Predictions for instances whose annotation had no corresponding patterns in the model learned from the training set are classified as "No patterns found".
#+CAPTION: Predictions for instances whose annotation had no instances in the training set are classified as "No training instances.
#+CAPTION: Predictions for instances whose annotation had no corresponding patterns in the model learned from the training set are classified as "No patterns".
#+CAPTION: The graph on the right shows the number of annotations present with at least one instance in the training set.
#+NAME: fig:feedbackpredictionrealworldsimulation3
[[./images/feedbackpredictionrealworldsimulation3.png]]
#+CAPTION: Progression of the predictive accuracy for the exercise "Anthropomorphic emoji" throughout the review process.
#+CAPTION: Predictions for instances whose annotation had no instances in the training set are classified as "Not yet seen".
#+CAPTION: Predictions for instances whose annotation had no corresponding patterns in the model learned from the training set are classified as "No patterns found".
#+CAPTION: Predictions for instances whose annotation had no instances in the training set are classified as "No training instances.
#+CAPTION: Predictions for instances whose annotation had no corresponding patterns in the model learned from the training set are classified as "No patterns".
#+CAPTION: The graph on the right shows the number of annotations present with at least one instance in the training set.
#+NAME: fig:feedbackpredictionrealworldsimulation4
[[./images/feedbackpredictionrealworldsimulation4.png]]
#+CAPTION: Evolution of the percentage of suggestions that are in the top 5.
#+CAPTION: The percentages are quite stable after 20 to 30 submissions have been reviewed.
#+NAME: fig:feedbackpredictionrealworldevolution
[[./images/feedbackpredictionrealworldevolution.png]]
As mentioned before, we are working with a slightly inconsistent data set when using annotations by human reviewers.
They will sometimes miss an instance of an annotation, place it inconsistently, or create duplicate annotations.
If this system is used in practice, the predictions could possibly be even better, since knowing about its existence might further motivate a reviewer to be consistent in their reviews.
The exercises were also reviewed by different people, which could also be an explanation for the differences in accuracy of predictions.
To evaluate the performance of our model for these experiments, we measure the training times, and the times required for each prediction (this corresponds to a teacher wanting to add an annotation to a line in practice).
Figures\nbsp{}[[fig:feedbackpredictionrealworldtimings1]],\nbsp{}[[fig:feedbackpredictionrealworldtimings2]],\nbsp{}[[fig:feedbackpredictionrealworldtimings3]],\nbsp{}and\nbsp{}[[fig:feedbackpredictionrealworldtimings4]] show the timings for these experiments.
Like in the previous experiments, we can see that there is a big difference between exercises: for two of the exercises, the training time does not exceed 10 seconds, where for the others the training times go up to about a minute.
Prediction times show a similar diversity: for one exercise the prediction times never exceed 3 seconds, while for the others there are some outliers up to 25 seconds.
The average prediction times never exceed a few seconds though.
The timings show that even though there are some outliers, most predictions can be performed quickly enough to make this an interactive system.
The outliers also correspond with higher training times, indicating this is mainly caused by a high number of underlying patterns for some annotations.
Currently this process is also parallelized over the files, but in the real world, the process would probably be parallelized over the patterns, which would speed up the prediction even more.
Note that the training time can also go down given more training data.
If there are more instances per annotation, the diversity in related subtrees will usually increase, which decreases the number of patterns that can be found, which also decreases the training time.
#+CAPTION: Progression of timings for the exercise "A last goodbye".
#+CAPTION: The top graph shows the training time.
#+CAPTION: The bottom graph shows the times per prediction, where the range shows the minimum and maximum, and the orange dot shows the average.
#+CAPTION: Time needed for training and testing throughout the review process for the exercise "A last goodbye".
#+CAPTION: Top: training time.
#+CAPTION: Bottom: average (orange dot) and range (blue line) of time required for predicting a single instance.
#+NAME: fig:feedbackpredictionrealworldtimings1
[[./images/feedbackpredictionrealworldtimings1.png]]
#+CAPTION: Progression of timings for the exercise "Narcissus cipher".
#+CAPTION: The top graph shows the training time.
#+CAPTION: The bottom graph shows the times per prediction, where the range shows the minimum and maximum, and the orange dot shows the average.
#+CAPTION: Time needed for training and testing throughout the review process for the exercise "Narcissus cipher".
#+CAPTION: Top: training time.
#+CAPTION: Bottom: average (orange dot) and range (blue line) of time required for predicting a single instance.
#+NAME: fig:feedbackpredictionrealworldtimings2
[[./images/feedbackpredictionrealworldtimings2.png]]
#+CAPTION: Progression of timings for the exercise "Cocktail bar".
#+CAPTION: The top graph shows the training time.
#+CAPTION: The bottom graph shows the times per prediction, where the range shows the minimum and maximum, and the orange dot shows the average.
#+CAPTION: Time needed for training and testing throughout the review process for the exercise "Cocktail bar".
#+CAPTION: Top: training time.
#+CAPTION: Bottom: average (orange dot) and range (blue line) of time required for predicting a single instance.
#+NAME: fig:feedbackpredictionrealworldtimings3
[[./images/feedbackpredictionrealworldtimings3.png]]
#+CAPTION: Progression of timings for the exercise "Anthropomorphic emoji".
#+CAPTION: The top graph shows the training time.
#+CAPTION: The bottom graph shows the times per prediction, where the range shows the minimum and maximum, and the orange dot shows the average.
#+CAPTION: Time needed for training and testing throughout the review process for the exercise "Anthropomorphic emoji".
#+CAPTION: Top: training time.
#+CAPTION: Bottom: average (orange dot) and range (blue line) of time required for predicting a single instance.
#+NAME: fig:feedbackpredictionrealworldtimings4
[[./images/feedbackpredictionrealworldtimings4.png]]
As mentioned before, we are working with a slightly inconsistent data set when using annotations by human reviewers.
They will sometimes miss an instance of an annotation, place it inconsistently, or create duplicate annotations.
If this system is used in practice, the predictions could possibly be even better, since knowing about its existence might further motivate a reviewer to be consistent in their reviews.
The exercises were also reviewed by different people, which could also be an explanation for the differences in accuracy of predictions.
For example, the reviewers of the exercises in Figures\nbsp{}[[fig:feedbackpredictionrealworldsimulation3]]\nbsp{}and\nbsp{}[[fig:feedbackpredictionrealworldsimulation4]] were still creating new annotations in the last few submissions, which obviously can't be predicted.
*** Conclusions and future work
:PROPERTIES:
:CREATED: [2023-11-20 Mon 13:33]
:CUSTOM_ID: subsec:feedbackpredictionconclusion
:END:
We presented a prediction method to assist in giving feedback while reviewing students submissions for an exercise by reusing annotations.
We presented a prediction method to assist human reviewers in giving feedback while reviewing students submissions for an exercise by reusing annotations.
Improving annotation reuse can be both a time-saver, and improve the consistency with which feedback is given.
The latter itself might actually also improve the accuracy of the predictions when the strategy is applied during the review process.
@ -3429,14 +3465,14 @@ We validated the framework both by predicting automated linting annotations to e
The method has about the same predictive accuracy for machine (Pylint) and human annotations.
Thus, we can give a positive answer to our research question that reuse of feedback given previously by a human reviewer can be predicted with high accuracy on a particular line of a new submission.
We can conclude that the proposed method has achieved the desired objective.
We can conclude that the proposed method has achieved the desired objective as expressed in the research question.
Having this method at hand immediately raises some possible follow-up work.
Currently, the proposed model is reactive: we suggest a ranking of most likely annotations when a reviewer wants to add an annotation to a particular line.
By introducing a confidence score, we could check beforehand if we have a confident match for each line, and then immediately propose those suggestions to the reviewer.
Whether or not a reviewer accepts these suggestions could then also be used as an input to the model.
This could also have an extra advantage, since it could help reviewers be more consistent in where and when they place an annotation.
Annotations that dont lend themselves well to prediction also need further investigation.
Annotations that don't lend themselves well to prediction also need further investigation.
The context used could be expanded, although the important caveat here is that the method still needs to maintain its speed.
We could also consider applying some of the source code pattern mining techniques proposed by\nbsp{}[cite/t:@phamMiningPatternsSource2019] to achieve further speed improvements.
This could also help with the outliers seen in the timing data.