Some more updates to chapter 6

This commit is contained in:
Charlotte Van Petegem 2024-04-26 16:29:34 +02:00
parent 9a4377bcda
commit cdf0b7aefa
No known key found for this signature in database
GPG key ID: 019E764B7184435A

View file

@ -3020,7 +3020,7 @@ We consider predicting relevant annotations to be a ranking problem, which we so
The approach to determine this similarity is based on tree mining.
This is a data mining technique for extracting frequently occurring patterns from data that can be represented as trees\nbsp{}[cite:@zakiEfficientlyMiningFrequent2005; @asaiEfficientSubstructureDiscovery2004].
Program code can be represented as an abstract syntax tree (AST), where the nodes of the tree represent the language constructs used in the program.
Recent work has used demonstrated the efficacy of this approach in efficiently identifying frequent patterns in source code\nbsp{}[cite:@phamMiningPatternsSource2019].
Recent work has demonstrated the efficacy of this approach in efficiently identifying frequent patterns in source code\nbsp{}[cite:@phamMiningPatternsSource2019].
In an educational context, these techniques have already been used to find patterns common to solutions that failed a given exercise\nbsp{}[cite:@mensGoodBadUgly2021].
Other work has demonstrated the potential of automatically generating unit tests from mined patterns\nbsp{}[cite:@lienard2023extracting].
We use tree mining to find commonalities between the lines of code where the same annotation has been added.
@ -3069,7 +3069,7 @@ The first step of ECHO is to extract a subtree for each instance of an annotatio
Currently, the context around a line is extracted by taking all the AST nodes from that line.
For example, Figure\nbsp{}[[fig:feedbacksubtree]] shows that the subtree extracted for the code on line 3 of Listing\nbsp{}[[lst:feedbacksubtreesample]].
Note that the context we extract here is very limited.
Previous iterations of our method considered all nodes that contained the relevant line (e.g. the function node for a line in a function), but these contexts proved too large to process in an acceptable time.
Previous iterations of ECHO considered all nodes that contained the relevant line (e.g. the function node for a line in a function), but these contexts proved too large to process in an acceptable time.
#+ATTR_LATEX: :float t
#+CAPTION: Example code that simply adds a number to the ASCII value of a character and converts it back to a character.
@ -3218,7 +3218,7 @@ We only look at the top five to get a good idea of how useful the suggested rank
We first ran Pylint[fn:: https://www.pylint.org/] (version 3.1.0) on the students' submissions.
Pylint is a static code analyser for Python that checks for errors and code smells, and enforces a standard programming style.
We used Pylint's machine annotations as our training and test data.
We test per exercise because that's our main use case for this method, but we also run a test that combines all submissions from all exercises.
We test per exercise because that's our main use case for ECHO, but we also run a test that combines all submissions from all exercises.
An overview of some annotation statistics for the data generated by Pylint can be found in Table\nbsp{}[[tab:feedbackresultsdatasetpylint]].
#+CAPTION: Statistics of Pylint annotations for the programming exercises used in the benchmark.
@ -3322,7 +3322,7 @@ The number of instances where the true annotation is ranked first is generally h
#+NAME: fig:feedbackpredictionrealworldglobal
[[./images/feedbackpredictionrealworldglobal.png]]
In this experiment, training took between 67 milliseconds and 22.4 seconds depending on the exercise.
In this experiment, training took between 67 milliseconds and 22.4 seconds per exercise.
The entire test phase took between 49 milliseconds and 27 seconds, depending on the exercise.
These evaluations were run on the same hardware as those for the machine annotations.
For one prediction, the average time ranged from 0.1 milliseconds to 150 milliseconds and the maxima from 0.5 milliseconds to 2.8 seconds.
@ -3335,7 +3335,7 @@ We will now look at the longitudinal prediction accuracy of ECHO, to test how ac
For the next experiment, we introduce two specific categories of negative prediction results, namely "No training instances" and "No patterns".
"No training instances" means that the annotation corresponding to the true instance had no instances in the training set, and therefore could never have been predicted.
"No patterns" means that =TreeminerD= was unable to find any frequent patterns for the set of subtrees extracted from the annotation instances (and there were also no nodes unique to this set of subtrees in the entire set of subtrees).
"No patterns" means that =TreeminerD= was unable to find any frequent patterns for the set of subtrees extracted from the annotation instances and there were also no nodes unique to this set of subtrees in the entire set of subtrees.
This could be because the collection of subtrees is too diverse to have common patterns.
Figures\nbsp{}[[fig:feedbackpredictionrealworldsimulation1]],\nbsp{}[[fig:feedbackpredictionrealworldsimulation2]],\nbsp{}[[fig:feedbackpredictionrealworldsimulation3]]\nbsp{}and\nbsp{}[[fig:feedbackpredictionrealworldsimulation4]] show the results of this experiment for four of the programming exercises used in the previous experiments.
@ -3384,10 +3384,11 @@ They will sometimes miss an instance of an annotation, place it inconsistently,
If ECHO is used in practice, the predictions may be even better, as the knowledge of its existence may further motivate reviewers to be more consistent in their reviews.
The programming exercises were also reviewed by different people, which may also explain the differences in prediction accuracy between the exercises.
To evaluate the performance of ECHO for these experiments, we measure the training times, and the times required for each prediction (this corresponds to a reviewer wanting to add an annotation to a line in practice).
To evaluate the performance of ECHO for these experiments, we measure the training times, and the times required for each prediction.
This corresponds to a reviewer wanting to add an annotation to a line in practice.
Figures\nbsp{}[[fig:feedbackpredictionrealworldtimings1]],\nbsp{}[[fig:feedbackpredictionrealworldtimings2]],\nbsp{}[[fig:feedbackpredictionrealworldtimings3]],\nbsp{}and\nbsp{}[[fig:feedbackpredictionrealworldtimings4]] show the performance of running these experiments.
As in the previous experiments, we can see that there is a considerable difference between the exercises.
However, the training time only exceeds one seconds in a few cases and remains well below that in most cases
However, the training time only exceeds one second in a few cases and remains well below that in most cases.
The prediction times are mostly below 50 milliseconds, except for a few outliers.
The average prediction time never exceeds 500 milliseconds.