Last update to feedback prediction article

2024-03-19 11:49:39 +01:00 · 2024-03-19 11:49:39 +01:00 · 22acfa8b4c
commit 22acfa8b4c
parent 8fca776c1b
2 changed files with 49 additions and 50 deletions
--- a/book.org
+++ b/book.org
@ -3024,7 +3024,7 @@ This feature facilitates the reuse of feedback by allowing teachers to search fo
 In 2023, 777 annotations were saved by teachers on Dodona, which were reused 7\thinsp{}180 times in total.
 The utilisation of this functionality has generated data that we can use in this study: code submissions, annotated on specific lines with annotations that are being shared across submissions.

-In this section we answer the following research question: Can previously added annotations be used to predict the annotations a reviewer is likely to add to a specific line of code, during the manual assessment of student-written code?
+In this section we answer the following research question: Can previously added annotations be used to predict what annotations a reviewer is likely to add to a specific line of code, during the manual assessment of student-written code?

 We present a machine learning approach aimed at facilitating the reuse of previously given feedback.
 We begin with a detailed explanation of the design of our method.
@ -3039,7 +3039,7 @@ Second, we use manual annotations left by human reviewers during assessment.
 :CUSTOM_ID: subsec:feedbackpredictionmethodology
 :END:

-The approach we present for predicting what feedback a reviewer will give on source code is based on tree mining.
+Our approach for predicting what feedback a reviewer will give on source code is based on tree mining.
 This is a data mining technique for extracting frequently occurring patterns from data that can be represented as trees\nbsp{}[cite:@zakiEfficientlyMiningFrequent2005; @asaiEfficientSubstructureDiscovery2004].
 Program code can be represented as an abstract syntax tree (AST), where the nodes of the tree represent the language constructs used in the program.
 Recent work has used this fact to investigate how these pattern mining algorithms can be used to efficiently find frequent patterns in source code\nbsp{}[cite:@phamMiningPatternsSource2019].
@ -3049,14 +3049,14 @@ We use tree mig to find commonalities between the lines of code where the same a

 We start with a general overview of our method (Figure\nbsp{}[[fig:feedbackmethodoverview]]).
 The first step is to use the tree-sitter library\nbsp{}[cite:@brunsfeldTreesitterTreesitterV02024] to generate ASTs for each submission.
-Using tree-sitter should make our method independent of the programming language used, since it is a generic interface for generating syntax trees.
+Using tree-sitter makes our method independent of the programming language used, since it is a generic interface for generating syntax trees.
 For each annotation, we identify all occurrences and extract a constrained AST context around the annotated line for each instance.
 The resulting subtrees are then aggrefated for each annotation, after which they are processed by the =TreeminerD= algorithm\nbsp{}[cite:@zakiEfficientlyMiningFrequent2005].
 This yields a set of frequently occurring patterns specific to that annotation.
 We assign weights to these patterns based on their length and their frequency across the entire dataset of patterns for all annotations.
 The result of these operations is our trained model.

-The model can now be used to score how well an annotation matches a given code fragment.
+The model can then be used to score how well an annotation matches a given code fragment.
 In practice, the reviewer first selects a line of code in a given student's submission.
 Next, the AST of the selected line and its surrounding context is generated.
 For each annotation, each of its patterns is matched to the line, and a similarity score is calculated, given the previously determined weights.
@ -3072,7 +3072,7 @@ Given the continuous generation of training data during the reviewing process, t
 #+CAPTION: For each annotation, the context of each instance is extracted and mined for patterns using the =TreeminerD= algorithm.
 #+CAPTION: These patterns are then weighted to form our model.
 #+CAPTION: When a teacher wants to place an annotation on a line of the submissions they are currently reviewing, all previously given annotations are ranked based on the similarity determined for that line.
-#+CAPTION: The teacher can then choose which annotation they want to place.
+#+CAPTION: The teacher can then choose which annotation they want to place, with the goal to have the selected annotation high up in the ranking.
 #+NAME: fig:feedbackmethodoverview
 [[./diagrams/feedbackmethodoverview.svg]]

@ -3098,7 +3098,7 @@ def jump(alpha, n):
    return chr(adjusted)
 #+END_SRC

-#+CAPTION: AST subtree corresponding to line 3 in Listing\nbsp{}[[lst:feedbacksubtreesample]].
+#+CAPTION: AST subtree corresponding to line 3 in Listing\nbsp{}[[lst:feedbacksubtreesample]], as generated by tree-sitter.
 #+NAME: fig:feedbacksubtree
 [[./diagrams/feedbacksubtree.svg]]

@ -3119,18 +3119,16 @@ An example of a valid pattern for the tree in Figure\nbsp{}[[fig:feedbacksubtree
 [[./diagrams/feedbackpattern.svg]]

 In the base =Treeminer= algorithm, frequently occurring means that the number of times the pattern occurs in all trees divided by the number of trees is greater than some predefined threshold.
-This is called the =minimum support= parameter.
+This threshold is called the =minimum support= parameter of the algorithm.

 =TreeminerD= is a more efficient variant of the base =Treeminer= algorithm.
 It achieves this efficiency by not counting occurrences of a frequent pattern within a tree.
 Since we are not interested in this information for our method, it was an obvious choice to use the =TreeminerD= variant.

-We use a custom Python implementation of the =TreeminerD= algorithm, to find patterns in the AST subtrees for each annotation.
+We use a custom implementation of the =TreeminerD= algorithm, to find patterns in the AST subtrees for each annotation.
 In our implementation, we set the =minimum support= parameter to 0.8.
 This value was determined experimentally.

-The choice of Python might seem antithetical to our desire for efficiency, but the amount of work we have to do has a far larger impact on the performance than the choice of programming language.
-

 **** Weight assignment
 :PROPERTIES:
@ -3139,8 +3137,8 @@ The choice of Python might seem antithetical to our desire for efficiency, but t
 :END:

 We now have a set of patterns corresponding to each annotation.
-Some patterns are more important that others though.
-Therefore we need to assign weight to the patterns we got from =TreeminerD=
+Some patterns are more informative that others though.
+Therefore we assign weights to the patterns we obtain from =TreeminerD=

 Weights are assigned using two criteria.
 The first criterion is the size of the pattern (i.e., the number of nodes in the pattern), since a pattern with twenty nodes is much more specific than a pattern with only one node.
@ -3155,7 +3153,7 @@ Weights are calculated using the formula below.
 :CUSTOM_ID: subsubsec:feedbackpredictionmatching
 :END:

-After completing the steps above we now have our finished model.
+After completing the steps above we have trained our model.
 To use our model, we need to know how to match patterns to subtrees.

 To check whether a given pattern matches a given subtree, we iterate over all the nodes in the subtree.
@ -3245,11 +3243,11 @@ The annotations are ranked according to this score.

 As a dataset to validate our method, we used Python code written by students for exercises from (different) exams.
 The dataset contains between 135 and 214 submissions per exercise.
-We split the datasets evenly into a training set and a test set.
-This simulates the midpoint of an assessment session.
+We initially split the datasets evenly into a training set and a test set.
+This simulates the midpoint of an assessment session for the exercise.
 During testing, we let our model suggest annotations for each of the lines that had an actual annotation associated with it in the test set.
 We evaluate at which position the correct annotation is ranked.
-We only look at the top five to give us a good idea of how useful the suggested ranking would be in practice: if an annotation is ranked higher than fifth, we would expect the reviewer to have to search for it instead of directly selecting it from the suggested ranking.
+We only look at the top five to give us a good idea of how useful the suggested ranking would be in practice: if an annotation is not in the top five, we would expect the reviewer to have to search for it instead of directly selecting it from the suggested ranking.

 We first ran Pylint[fn:: https://www.pylint.org/] (version 3.1.0) on the student submissions.
 Pylint is a static code analyser for Python that checks for erros and code smells, and enforces a standard programming style.
@ -3263,17 +3261,17 @@ In this case, there is no combined test, since the set of annotations used is al

 We distinguish between these two sources of annotations because we expect Pylint to be more consistent both in when it places an instance of an annotation and also where it places the instance.
 Most linting annotations are detected by explicit pattern matching in the AST, so we expect our implicit pattern matching to work fairly well.
-However, we want to skip this explicit pattern matching for manual annotations because of the time required to assemble them and the fact that annotations will often be specific to a particular exercise.
+However, we want to skip this explicit pattern matching for manual annotations because of the time required to assemble them and the fact that annotations will often be specific to a particular exercise and to a particular reviewer.
 Therefore we also test on manual annotations.
 Manual annotations are expected to be more inconsistent because reviewers may miss a problem in one student's code that they annotated in another student's code, or they may not place instances of a particular annotation in consistent locations.
 The method by which human reviewers place an annotation is also much more implicit than Pylint's pattern matching.

-Exercises have between 55 and 469 instances of annotations.
+Exercises have between 55 and 469 instances of manual annotations.
 The number of distinct annotations varies between 7 and 34 per exercise.
 Table\nbsp{}[[tab:feedbackresultsdataset]] gives an overview of some of the features of the dataset.
 Timings mentioned in this section were measured on a 2019 Macbook Pro with a 1.4GHz Intel quad-core processor and 16 GB of RAM.

-#+CAPTION: Statistics of manually added annotations for the exercises used in this analysis.
+#+CAPTION: Statistics of manually added annotations for the exercises used in the benchmark.
 #+CAPTION: Max is the maximum amount of instances per annotation.
 #+CAPTION: Avg is the average amount of instances per annotation.
 #+NAME: tab:feedbackresultsdataset
@ -3300,10 +3298,10 @@ Depending on the exercise, the actual annotation is ranked among the top five an
 The annotation is even ranked first for 19% to 59% of all test instances.
 Interestingly, the method mostly performs worse when the instances for all exercises are combined.
 This highlights the fact that our method is most useful in the context where similar code needs to be reviewed many times.
-For the submissions and instances in the training set, training takes 1.5 to 52 seconds for an exercise.
-The entire testing phase takes between 4 seconds to 9.5 minutes.
-The average time of one prediction ranges between 30ms and 6 seconds.
-The minima range between 6ms and 4 seconds, the maxima between 127ms and 55 seconds.
+For the submissions and instances in the training set, training took 1.5 to 52 seconds for an exercise.
+The entire testing phase took between 4 seconds to 9.5 minutes.
+The average time of one prediction ranges between 30 milliseconds and 6 seconds.
+The minima range between 6 milliseconds and 4 seconds, the maxima between 127 milliseconds and 55 seconds.
 Note that these are very wide ranges.
 These big differences can be explained through the number of patterns that are found for the annotations.
 If there are annotations with a very large amount of patterns, this will be reflected in both the training and testing time.
@ -3322,7 +3320,7 @@ This can be explained by the fact that we do not feed enough context to =Treemin
 There are also annotations that can't be predicted at all, because no patterns are found.

 Other annotations, like "consider using in"[fn:: https://pylint.pycqa.org/en/latest/user_guide/messages/refactor/consider-using-in.html], work very well.
-For these annotations, =TreeminerD= does have enough context to pick up the underlying patterns.
+For these annotations, =TreeminerD= does have enough context to automatically determine the underlying patterns.
 The number of instances of an annotation in the training set also has an impact.
 Annotations which have only a few instances are generally predicted worse than those with lots of instances.

@ -3347,12 +3345,6 @@ The number of instances where the true annotation is ranked first is generally h
 However, there is quite some variance between exercises.
 This can be explained by the quality of the data.
 For example, for the exercise "Symbolic", very few instances were placed for most annotations, which makes it difficult to predict additional instances.
-For this experiment, training took between 1.2 and 16.7 seconds depending on the exercise.
-The entire testing phase took between 1.5 and 35 seconds depending on the exercise.
-These evaluations were run on the same hardware as those for the machine annotations.
-For one prediction, average times range between 0.1ms and 1 second.
-The minima range between 0.1ms and 240ms and the maxima range between 0.2ms and 3 seconds.
-The explanation for these big ranges remains the same as for the Pylint predictions: everything depends on the number of patterns found.

 #+CAPTION: Prediction results for six exercises that were designed and used for an exam, using manual annotations.
 #+CAPTION: Models were trained on half of the submissions from the dataset and tested on the other half of the submissions from the dataset.
@ -3360,6 +3352,13 @@ The explanation for these big ranges remains the same as for the Pylint predicti
 #+NAME: fig:feedbackpredictionrealworldglobal
 [[./images/feedbackpredictionrealworldglobal.png]]

+For this experiment, training took between 1.2 and 16.7 seconds depending on the exercise.
+The entire testing phase took between 1.5 and 35 seconds depending on the exercise.
+These evaluations were run on the same hardware as those for the machine annotations.
+For one prediction, average times range between 0.1ms and 1 second.
+The minima range between 0.1ms and 240ms and the maxima range between 0.2ms and 3 seconds.
+The explanation for these wide ranges remains the same as for the Pylint predictions: everything depends on the number of patterns found.
+
 These results show that we can predict reuse with an accuracy that is quite high at the midpoint of a reviewing session for an exercise.
 The accuracy depends on the amount of instances per annotation and the consistency of the reviewer.
 Looking at the underlying data, we can also see that short, atomic messages can be predicted very well, as hinted by\nbsp{}[cite/t:@moonsAtomicReusableFeedback2022].
@ -3372,7 +3371,7 @@ This could be because the collection of subtrees is too diverse, or because we h
 We know beforehand that test instances of such annotations cannot be predicted.

 Figures\nbsp{}[[fig:feedbackpredictionrealworldsimulation1]],\nbsp{}[[fig:feedbackpredictionrealworldsimulation2]],\nbsp{}[[fig:feedbackpredictionrealworldsimulation3]]\nbsp{}and\nbsp{}[[fig:feedbackpredictionrealworldsimulation4]] show the results of this experiment for four of the exercises we used in the previous experiments.
-The exercises that performed worse in the previous experiment were not taken into account for this experiment.
+The two exercises that performed worse in the previous experiment were not taken into account for this experiment.
 We also excluded submissions that received no annotations during the human review process, which explains the lower number of submissions compared to the numbers in Table\nbsp{}[[tab:feedbackresultsdataset]].
 This experiment shows that while the review process requires some build-up before sufficient training instances are available, once a critical mass of training instances is reached, the accuracy for suggesting new instances of annotations reaches its maximal predictive power.
 This critical mass is reached after about 20 to 30 reviews, which is quite early in the reviewing process\nbsp{}(Figure\nbsp{}[[fig:feedbackpredictionrealworldevolution]]).
@ -3411,10 +3410,10 @@ The point at which the critical mass is reached will of course depend on the nat
 #+NAME: fig:feedbackpredictionrealworldevolution
 [[./images/feedbackpredictionrealworldevolution.png]]

-As mentioned before, we are working with a slightly inconsistent data set when using annotations by human reviewers.
-They will sometimes miss an instance of an annotation, place it inconsistently, or create duplicate annotations.
-If this system is used in practice, the predictions could possibly be even better, since knowing about its existence might further motivate a reviewer to be consistent in their reviews.
-The exercises were also reviewed by different people, which could also be an explanation for the differences in accuracy of predictions.
+As mentioned before, we are working with a slightly inconsistent dataset when using annotations by human reviewers.
+They will sometimes miss an instance of an annotation, place it inconsistently, or unnecessarily create duplicate annotations.
+If this system is used in practice, the predictions could possibly be even better, since knowing about its existence might further motivate a reviewer to be more consistent in their reviews.
+The exercises were also reviewed by different people, which could also be an explanation for the differences in accuracy of predictions between the exercises.

 To evaluate the performance of our model for these experiments, we measure the training times, and the times required for each prediction (this corresponds to a teacher wanting to add an annotation to a line in practice).
 Figures\nbsp{}[[fig:feedbackpredictionrealworldtimings1]],\nbsp{}[[fig:feedbackpredictionrealworldtimings2]],\nbsp{}[[fig:feedbackpredictionrealworldtimings3]],\nbsp{}and\nbsp{}[[fig:feedbackpredictionrealworldtimings4]] show the timings for these experiments.
@ -3424,7 +3423,7 @@ The average prediction times never exceed a few seconds though.

 The timings show that even though there are some outliers, most predictions can be performed quickly enough to make this an interactive system.
 The outliers also correspond with higher training times, indicating this is mainly caused by a high number of underlying patterns for some annotations.
-Currently this process is also parallelized over the files, but in the real world, the process would probably be parallelized over the patterns, which would speed up the prediction even more.
+Currently this process is also parallelized over the files, but in practice, the process could be parallelized over the patterns, which would speed up the prediction even more.
 Note that the training time can also go down given more training data.
 If there are more instances per annotation, the diversity in related subtrees will usually increase, which decreases the number of patterns that can be found, which also decreases the training time.

@ -3460,7 +3459,7 @@ If there are more instances per annotation, the diversity in related subtrees wi

 We presented a prediction method to assist human reviewers in giving feedback while reviewing students submissions for an exercise by reusing annotations.
 Improving annotation reuse can be both a time-saver, and improve the consistency with which feedback is given.
-The latter itself might actually also improve the accuracy of the predictions when the strategy is applied during the review process.
+The latter itself might further improve the accuracy of the predictions when the strategy is applied during the review process.

 The method has already shown promising results.
 We validated the framework both by predicting automated linting annotations to establish a baseline and by predicting annotations from human reviewers.
@ -3469,21 +3468,21 @@ Thus, we can give a positive answer to our research question that reuse of feedb

 We can conclude that the proposed method has achieved the desired objective as expressed in the research question.
 Having this method at hand immediately raises some possible follow-up work.
-Currently, the proposed model is reactive: we suggest a ranking of most likely annotations when a reviewer wants to add an annotation to a particular line.
+Currently, the proposed model is reactive: we suggest a ranking of most likely annotations when a reviewer wants to add an annotation to a particular line of a submission.
 By introducing a confidence score, we could check beforehand if we have a confident match for each line, and then immediately propose those suggestions to the reviewer.
 Whether or not a reviewer accepts these suggestions could then also be used as an input to the model.
-This could also have an extra advantage, since it could help reviewers be more consistent in where and when they place an annotation.
+This could also have an extra advantage, since it could help reviewers be more consistent in where and when they place annotations.

 Annotations that don't lend themselves well to prediction also need further investigation.
-The context used could be expanded, although the important caveat here is that the method still needs to maintain its speed.
+The context used could be expanded, although the important caveat here is that the method still needs to maintain sufficient performance.
 We could also consider applying some of the source code pattern mining techniques proposed by\nbsp{}[cite/t:@phamMiningPatternsSource2019] to achieve further speed improvements.
-This could also help with the outliers seen in the timing data.
+This could help with the outliers seen in the timing data.
 Another important aspect that was explicitly left out of the scope of this chapter was its integration into a learning platform and user testing.

 Of course, alternative methods could also be considered.
 One cannot overlook the rise of Large Language Models (LLMs) and the way they could contribute to this problem.
-LLMs can also generate feedback for students, based on their code and a well-chosen system prompt.
-Fine-tuning of a model with feedback already given could also be considered.
+LLMs can generate feedback for students, based on their code and a well-chosen system prompt.
+Fine-tuning of a model with feedback already given is another option.

 * Looking ahead: opportunities and challenges
 :PROPERTIES:
--- a/diagrams/feedbackpattern.svg
+++ b/diagrams/feedbackpattern.svg
@ -29,12 +29,12 @@
     inkscape:pagecheckerboard="0"
     inkscape:deskcolor="#d1d1d1"
     inkscape:zoom="3.1395349"
-     inkscape:cx="135.05185"
+     inkscape:cx="135.21111"
     inkscape:cy="152.72963"
-     inkscape:window-width="1534"
-     inkscape:window-height="2132"
-     inkscape:window-x="10"
-     inkscape:window-y="10"
+     inkscape:window-width="2558"
+     inkscape:window-height="1412"
+     inkscape:window-x="11"
+     inkscape:window-y="11"
     inkscape:window-maximized="0"
     inkscape:current-layer="g11" />
  <defs
@ -205,11 +205,11 @@
      </g>
    </g>
    <path
-       style="fill:#000000;stroke:#000000;stroke-width:0.997736;stroke-opacity:1"
+       style="fill:#000000;stroke:#000000;stroke-width:0.998;stroke-opacity:1;stroke-dasharray:1.99600005,1.99600005;stroke-dashoffset:0"
       d="M 100.1758,39.144 61.395572,84.62626"
       id="path8" />
    <path
-       style="fill:#000000;stroke:#000000;stroke-width:1.03269;stroke-opacity:1"
+       style="fill:#000000;stroke:#000000;stroke-width:1.03269;stroke-opacity:1;stroke-dasharray:2.0653801,2.0653801;stroke-dashoffset:0"
       d="m 162.61678,37.592459 33.49996,47.432816"
       id="path9" />
    <path