Import current version of feedback prediction article
224
book.org
|
@ -72,7 +72,6 @@
|
|||
- Finish chapter [[#chap:feedback]].
|
||||
- Add some screenshots to the start.
|
||||
- Make sure there isn't too much overlap with\nbsp{}[[#subsec:whateval]].
|
||||
- Incorporate full text from https://docs.google.com/document/d/1K0PCDdmNRigJWH1Bd6u9Nh8P39YJqdCjHnKJ7hpe4pA/edit in [[#sec:feedbackprediction]] after this has received feedback.
|
||||
- Write [[#chap:summ]].
|
||||
- Redo screenshots/visualizations.
|
||||
I might even wait with this explicitly to do this closer to the deadline, to incorporate possible UI changes that might be done in the near future.
|
||||
|
@ -438,7 +437,7 @@ A qualitative user experience study of Dodona was performed in 2018.
|
|||
- What are the things that bother you while working with Dodona?
|
||||
- What are your suggestions for improvements to Dodona?
|
||||
Students praised its user-friendliness, beautiful interface, immediate feedback with visualized differences between expected and generated output, integration of the Python Tutor, linting feedback and large number of test cases.
|
||||
Negative points were related to differences between the students’ local execution environments and the environment in which Dodona runs the tests, and the strictness with which the tests are evaluated.
|
||||
Negative points were related to differences between the students' local execution environments and the environment in which Dodona runs the tests, and the strictness with which the tests are evaluated.
|
||||
Other negative feedback was mostly related to individual courses the students were taking instead of the platform itself.
|
||||
|
||||
** Case study
|
||||
|
@ -1543,7 +1542,7 @@ While this procedure does not rely on external background information, it has th
|
|||
Students can't work in their preferred programming environment and have to agree with extensive behaviour tracking.
|
||||
|
||||
Approaches that are not using machine learning also exist.
|
||||
[cite/t:@feldmanAnsweringAmRight2019] try to answer the question "Am I on the right track?" on the level of individual exercises, by checking if the student’s current progress can be used as a base to synthesise a correct program.
|
||||
[cite/t:@feldmanAnsweringAmRight2019] try to answer the question "Am I on the right track?" on the level of individual exercises, by checking if the student's current progress can be used as a base to synthesise a correct program.
|
||||
However, there is no clear way to transform this type of approach into an estimation of success on examinations.
|
||||
[cite/t:@werthPredictingStudentPerformance1986] found significant (\(p < 0.05\)) correlations between students' college grades, the number of hours worked, the number of high school mathematics classes and the students' grades for an introductory programming course.
|
||||
[cite/t:@gooldFactorsAffectingPerformance2000] also looked at learning style (surveyed using LSI2) as a factor in addition to demographics, academic ability, problem-solving ability and indicators of personal motivation.
|
||||
|
@ -1816,7 +1815,7 @@ Compared to the accuracy results of\nbsp{}[cite/t:@kovacicPredictingStudentSucce
|
|||
Our balanced accuracy results are similar to the accuracy results of\nbsp{}[cite/t:@livierisPredictingSecondarySchool2019], who used semi-supervised machine learning.
|
||||
[cite/t:@asifAnalyzingUndergraduateStudents2017] achieve an accuracy of about 80% when using one cohort of training and another cohort for testing, which is again similar to our balanced accuracy results.
|
||||
All of these studies used prior academic history as the basis for their methods, which we do not use in our framework.
|
||||
We also see similar results as compared to\nbsp{}[cite/t:@vihavainenPredictingStudentsPerformance2013] where we don’t have to rely on data collection that interferes with the learning process.
|
||||
We also see similar results as compared to\nbsp{}[cite/t:@vihavainenPredictingStudentsPerformance2013] where we don't have to rely on data collection that interferes with the learning process.
|
||||
Note that we are comparing the basic accuracy results of prior studies with the more reliable balanced accuracy results of our framework.
|
||||
|
||||
F_1-scores follow the same trend as balanced accuracy, but the inclination is even more pronounced because it starts lower and ends higher.
|
||||
|
@ -2062,7 +2061,7 @@ We applied this interpretability to some important feature types that popped up
|
|||
|
||||
Our study has several strengths and promising implications for future practice and research.
|
||||
First, we were able to predict success based on historical metadata from earlier cohorts, and we are already able to do that early on in the semester.
|
||||
In addition to that, the accuracy of our predictions is similar to those of earlier efforts\nbsp{}[cite:@asifAnalyzingUndergraduateStudents2017; @vihavainenPredictingStudentsPerformance2013; @kovacicPredictingStudentSuccess2012] while we are not using prior academic history or interfering with the students’ usual learning workflows.
|
||||
In addition to that, the accuracy of our predictions is similar to those of earlier efforts\nbsp{}[cite:@asifAnalyzingUndergraduateStudents2017; @vihavainenPredictingStudentsPerformance2013; @kovacicPredictingStudentSuccess2012] while we are not using prior academic history or interfering with the students' usual learning workflows.
|
||||
However, there are also some limitations and work for the future.
|
||||
While our visualizations of the features (Figures\nbsp{}[[fig:passfailfeaturesAevaluation]]\nbsp{}through\nbsp{}[[fig:passfailfeaturesBwrong]]) are helpful to indicate which features are important at which stage of the course in view of increasing versus decreasing the odds of passing the course, they may not be oversimplified and need to be carefully interpreted and placed into context.
|
||||
This is where the expertise and experience of teachers comes in.
|
||||
|
@ -2213,7 +2212,8 @@ This is exactly what we will explore in this section, which is based on an artic
|
|||
|
||||
Feedback is a very important element in student learning\nbsp{}[cite:@hattiePowerFeedback2007; @blackAssessmentClassroomLearning1998].
|
||||
In programming education, many steps have been taken to automate feedback using automated assessment systems\nbsp{}[cite:@paivaAutomatedAssessmentComputer2022; @ihantolaReviewRecentSystems2010; @ala-mutkaSurveyAutomatedAssessment2005].
|
||||
These automated assessment systems provide feedback on correctness, and can provide some feedback on style and best practices by using linters but are not able to provide the same high-level feedback on program design that a seasoned programmer can give.
|
||||
These automated assessment systems provide feedback on correctness, and can provide some feedback on style and best practices by using linters.
|
||||
They are, however, generally not able to provide the same high-level feedback on program design that a seasoned programmer can give.
|
||||
In many educational practices, automated assessment is therefore supplemented with manual feedback, especially when grading evaluations or exams.
|
||||
This requires a large time investment from teachers.
|
||||
|
||||
|
@ -2222,22 +2222,24 @@ Others have therefore tried to improve the process of giving feedback using AI.
|
|||
Others have used AI to enable students to conduct peer and self-evaluation\nbsp{}[cite:@leeSupportingStudentsGeneration2023].
|
||||
[cite/t:@berniusMachineLearningBased2022] introduced a framework based on clustering text segments to reduce the grading overhead.
|
||||
|
||||
In this section we present an approach based on pattern mining.
|
||||
Data mining techniques for extracting frequently occurring patterns from data that can be represented as trees were already developed in the early 2000s\nbsp{}[cite:@zakiEfficientlyMiningFrequent2005; @asaiEfficientSubstructureDiscovery2004].
|
||||
Since program code can be represented as an abstract syntax tree, more recent work looked into how these algorithms could be used to efficiently find frequent patterns in source code\nbsp{}[cite:@phamMiningPatternsSource2019].
|
||||
In an educational context, these techniques could then be used to for example find patterns common to solutions that failed a given exercise\nbsp{}[cite:@mensGoodBadUgly2021].
|
||||
In an educational context, these techniques could then be used to, for example, find patterns common to solutions that failed a given exercise\nbsp{}[cite:@mensGoodBadUgly2021].
|
||||
Other work looked into generating unit tests from mined patterns\nbsp{}[cite:@lienard2023extracting].
|
||||
|
||||
Dodona is the automated assessment system developed at Ghent University\nbsp{}[cite:@vanpetegemDodonaLearnCode2023].
|
||||
It has a built-in module for giving manual feedback on and grading student submissions.
|
||||
The context of our work is in our own assessment system, called Dodona, developed at Ghent University\nbsp{}[cite:@vanpetegemDodonaLearnCode2023].
|
||||
It has a built-in module for giving manual feedback on and (manually) assigning scores to student submissions.
|
||||
The process of giving feedback on a programming assignment in Dodona is very similar to a code review, where mistakes or suggestions for improvements are annotated at the relevant line(s).
|
||||
|
||||
There is however a very important difference with a traditional code review: the teacher gives feedback on many implementations of the same problem.
|
||||
Since students often make the same mistakes as other students, it follows that the same feedback is often given by a teacher on many solutions.
|
||||
In Dodona, we have already tried to anticipate this need by allowing teachers to save certain annotations, which can then later be re-used by simply searching for them.
|
||||
This gives us the data we’re using in this study: code submissions, where those submissions have been annotated on specific lines with remarks that are shared over those submissions.
|
||||
This is also the terminology we will use: an annotation is a specific instance of a remark on a line of code, a remark is the text that can be reused over multiple annotations.
|
||||
This gives us the data we're using in this study: code submissions, where those submissions have been annotated on specific lines with messages that are shared over those submissions.
|
||||
Note the terminology here: we consider an annotation to be a specific instance of a message placed on a line of code, and thus a message to be the text that can be reused over multiple annotations.
|
||||
|
||||
In the following sections we present a machine learning method for suggesting re-use of previously given feedback.
|
||||
We start with an in-depth explanation of the design of the method, and then presents and discusses the experimental results we obtained when testing the method on student submissions.
|
||||
In this section we present a machine learning method for suggesting re-use of previously given feedback.
|
||||
The section starts with an in-depth explanation of the design of the method, and then presents and discusses the experimental results we obtained when testing our method on student submissions.
|
||||
|
||||
*** Methodology
|
||||
:PROPERTIES:
|
||||
|
@ -2245,26 +2247,26 @@ We start with an in-depth explanation of the design of the method, and then pres
|
|||
:CUSTOM_ID: subsec:feedbackpredictionmethodology
|
||||
:END:
|
||||
|
||||
The general methodology used by our method is explained visually in Figure [[fig:feedbackmethodoverview]].
|
||||
The general methodology used by our method is explained visually in Figure\nbsp{}[[fig:feedbackmethodoverview]].
|
||||
We start by using tree-sitter to generate ASTs for every submission.
|
||||
For each annotation, we then extract a limited context from the AST around the line where it was placed.
|
||||
We then collect all the subtrees for each remark.
|
||||
Every remark’s forest of subtrees is given to the =TreeminerD= algorithm\nbsp{}[cite:@zakiEfficientlyMiningFrequent2005] which gives us a collection of patterns for each remark.
|
||||
Each pattern is then weighted according to its length and how often it occurs in the entire collection of patterns (for all remarks).
|
||||
We then collect all the subtrees for each message.
|
||||
Every message's forest of subtrees is given to the =TreeminerD= algorithm\nbsp{}[cite:@zakiEfficientlyMiningFrequent2005] which gives us a collection of patterns for each message.
|
||||
Each pattern is then weighted according to its length and how often it occurs in the entire collection of patterns (for all messages).
|
||||
The result of these operations is our trained model.
|
||||
A prediction can be made when a teacher selects a line in a given student's submission.
|
||||
This is done by again extracting the limited context around that line.
|
||||
We then compute a similarity score for each remark, using its weighted patterns.
|
||||
This similarity score is used to rank the remarks and this ranking is shown to the teacher.
|
||||
We then compute a similarity score for each message, using its weighted patterns.
|
||||
This similarity score is used to rank the messages and this ranking is shown to the teacher.
|
||||
We will now give a more detailed explanation of these steps.
|
||||
Note that in every step, we also have to consider its (impact on) speed.
|
||||
Since the model will be used while grading (and the training data for the model is continuously generated during grading) we can’t afford to train the model for multiple minutes.
|
||||
Since the model will be used while grading (and the training data for the model is continuously generated during grading) we can't afford to train the model for multiple minutes.
|
||||
|
||||
#+CAPTION: Overview of our machine learning method for predicting feedback re-use.
|
||||
#+CAPTION: Code is converted to its Abstract Syntax Tree form.
|
||||
#+CAPTION: Per remark, the context of each annotation is extracted and mined for patterns using the =TreeminerD= algorithm.
|
||||
#+CAPTION: Per message, the context of each annotation is extracted and mined for patterns using the =TreeminerD= algorithm.
|
||||
#+CAPTION: These patterns are then weighted, after which they make up our model.
|
||||
#+CAPTION: When a teacher wants to place an annotation on a line, remarks are ranked based on the similarity determined for that line.
|
||||
#+CAPTION: When a teacher wants to place an annotation on a line, messages are ranked based on the similarity determined for that line.
|
||||
#+NAME: fig:feedbackmethodoverview
|
||||
[[./diagrams/feedbackmethodoverview.svg]]
|
||||
|
||||
|
@ -2279,9 +2281,9 @@ For example the subtree extracted for the code on line 3 in Listing\nbsp{}[[lst:
|
|||
Note that the context we extract here is very limited.
|
||||
Previous iterations considered all the nodes that contained the relevant line (e.g. the function node for a line in a function), but these contexts turned out to be too large to process in an acceptable time frame.
|
||||
|
||||
#+ATTR_LATEX: :float t
|
||||
#+CAPTION: Sample code that simply reads a number from standard input and prints its digits.
|
||||
#+NAME: lst:feedbacksubtreesample
|
||||
#+ATTR_LATEX: :float t
|
||||
#+BEGIN_SRC python
|
||||
number = input()
|
||||
print(f'{number} has the following digits:')
|
||||
|
@ -2299,42 +2301,214 @@ for digit in number:
|
|||
:CUSTOM_ID: subsubsec:feedbackpredictiontreeminer
|
||||
:END:
|
||||
|
||||
We use the =TreeminerD= algorithm to find patterns in the AST subtrees for each message.
|
||||
=TreeminerD= is a more efficient version of the base =Treeminer= algorithm.
|
||||
It achieves this efficiency by not counting the amount of occurrences of a frequent pattern.
|
||||
Since we are not interested in this information for our method, it was an obvious choice to use the =TreeminerD= version.
|
||||
|
||||
=TreeminerD= also has an important /minimum support/ parameter.
|
||||
This signifies the percentage of trees in the forest that a pattern needs to be present in before it is considered frequent.
|
||||
We set the /minimum support/ parameter to 0.8 in our implementation.
|
||||
This value was experimentally determined.
|
||||
|
||||
As an example, one message in our real-world dataset was placed 92 times on 47 submissions by students.
|
||||
For this message =TreeminerD= finds 105\thinsp{}718 patterns.
|
||||
|
||||
**** Assigning weights to patterns
|
||||
:PROPERTIES:
|
||||
:CREATED: [2023-11-22 Wed 14:39]
|
||||
:CUSTOM_ID: subsubsec:feedbackpredictionweights
|
||||
:END:
|
||||
|
||||
**** Determining similarity
|
||||
Due to the iterative nature of =TreeminerD= a lot of patterns are (embedded) subtrees of other patterns.
|
||||
We don't do any post-processing to remove these patterns since they might be relevant for code we have not seen yet, but we do assign weights to them.
|
||||
Weights are assigned using two criteria.
|
||||
|
||||
The first criterion is the size of the pattern, since a pattern with twenty nodes is a lot more specific than a pattern with only one node.
|
||||
The second criterion is the amount of times a pattern occurs across all messages.
|
||||
If all messages contain a specific pattern, it can not be reliably used to determine which message should be predicted and will therefore be assigned a smaller weight.
|
||||
The weights are calculated by the following formula: \[weight(pattern) = \frac{len(pattern)}{\#occurences(pattern)}\]
|
||||
|
||||
**** Matching patterns to subtrees
|
||||
:PROPERTIES:
|
||||
:CREATED: [2024-02-01 Thu 14:25]
|
||||
:END:
|
||||
|
||||
To know whether a given pattern matches a given subtree, we iterate over all nodes in the subtree.
|
||||
Simultaneously, we also iterate over the nodes in the patterns.
|
||||
During iteration, we also store the current depth, both in the pattern and the subtree.
|
||||
We also keep a stack to store (some of) the subtree depths.
|
||||
If the current label in the subtree and the pattern are the same, we store the current subtree depth on the stack and move to the next node in the pattern.
|
||||
For moving upwards in the tree things are more complicated.
|
||||
If the current depth and the depth of the last match (stored on the stack) are the same, we can move forwards in the pattern (and the subtree).
|
||||
If this is not the case, we need to check that we are still in the embedded subtree, otherwise we need to reset our position in the pattern to the start.
|
||||
Because subtrees can contain the same label multiple times we also need to make sure we can backtrack.
|
||||
The full pseudocode for this algorithm can be seen in Listing\nbsp{}[[lst:feedbackmatchingpseudocode]].
|
||||
|
||||
#+ATTR_LATEX: :float t
|
||||
#+CAPTION: Pseudocode for checking whether a pattern matches a subtree.
|
||||
#+CAPTION: Note that both pattern and subtree are stored in the string encoding described by\nbsp{}[cite/t:@zakiEfficientlyMiningFrequent2005].
|
||||
#+NAME: lst:feedbackmatchingpseudocode
|
||||
#+BEGIN_SRC python
|
||||
subtree_matches(subtree, pattern):
|
||||
p_i, pattern_depth, depth = 0
|
||||
depth_stack = []
|
||||
for item in subtree:
|
||||
if item == -1:
|
||||
if depth_stack is not empty and \
|
||||
depth - 1 == depth_stack.last:
|
||||
last_depth = depth_stack.pop()
|
||||
if pattern [p_i] != -1 and \
|
||||
(last_depth < pattern_depth or \
|
||||
depth_stack is empty):
|
||||
p_i = 0
|
||||
if pattern[p_i] == -1:
|
||||
pattern_depth -= 1
|
||||
p_i += 1
|
||||
depth -= 1
|
||||
else:
|
||||
if pattern[p_i] == item:
|
||||
depth_stack.append(depth)
|
||||
pattern_depth += 1
|
||||
p_i += 1
|
||||
depth += 1
|
||||
if p_i == pattern_length:
|
||||
return True
|
||||
return False
|
||||
#+END_SRC
|
||||
|
||||
Checking whether a pattern matches a subtree is an operation that needs to happen a lot of times.
|
||||
For some messages, there are many patterns, and all patterns of all messages are checked.
|
||||
One important optimization we added was therefore to only execute the algorithm in Listing\nbsp{}[[lst:feedbackmatchingpseudocode]] if the set of labels in the pattern is a subset of the labels in the pattern.
|
||||
|
||||
**** Ranking the messages
|
||||
:PROPERTIES:
|
||||
:CREATED: [2023-11-22 Wed 14:47]
|
||||
:CUSTOM_ID: subsubsec:feedbackpredictionsimilarity
|
||||
:END:
|
||||
|
||||
Given a model where we have weighted patterns for each message, and a method for matching patterns to subtrees, we can now put these two together to make a final ranking of the messages for a line of code.
|
||||
We calculate a matching score for each message with the following formula: \[ score(message) = \frac{\displaystyle\sum_{pattern}^{patterns} \begin{cases} weight(pattern) & \quad \text{if } pattern \text{ matches} \\ 0 & \quad \text{otherwise} \end{cases}}{len(patterns)} \]
|
||||
The messages are sorted using this score.
|
||||
|
||||
*** Results and discussion
|
||||
:PROPERTIES:
|
||||
:CREATED: [2024-01-08 Mon 13:18]
|
||||
:CUSTOM_ID: subsec:feedbackpredictionresults
|
||||
:END:
|
||||
|
||||
**** PyLint messages
|
||||
We used two datasets to evaluate our method.
|
||||
Both are based on real (Python) code written by students for exams.
|
||||
To test our method, we split the datasets in half and used the first half to train and the second half to test.
|
||||
|
||||
In the first dataset, we run PyLint on those student submissions, and use PyLint's annotations as our training data and test data.
|
||||
In the second dataset, we use actual annotations left by graders on student code in Dodona.
|
||||
We differentiate between these two datasets, because we expect PyLint to be more consistent in when it places an annotation and also where it places that annotation.
|
||||
Most linting messages are detected through explicit pattern matching in the AST, so we expect our implicit pattern matching to perform rather well.
|
||||
Real-world data is more difficult, since graders are humans, and might miss a problem in one student's code that they annotated in another student's code, or they might not place the annotation for a certain message in a consistent location.
|
||||
The pattern matching performed by graders is also a lot more implicit than PyLint's pattern matching.
|
||||
|
||||
**** PyLint
|
||||
:PROPERTIES:
|
||||
:CUSTOM_ID: subsubsec:feedbackpredictionresultspylint
|
||||
:CREATED: [2023-11-20 Mon 13:33]
|
||||
:END:
|
||||
|
||||
We will first discuss the PyLint results.
|
||||
As can be seen in Figure\nbsp{}[[fig:feedbackpredictionpylintglobal]], for about 70% of the annotations, the actual message is ranked in the top five of messages.
|
||||
For about 30% of the annotations, the message is even ranked first.
|
||||
|
||||
#+CAPTION: Global overview of the performance of our method on PyLint data.
|
||||
#+CAPTION: For 70% of the annotations, the expected message is ranked in the top five.
|
||||
#+CAPTION: In 30% of the annotations, the expected message is ranked first.
|
||||
#+NAME: fig:feedbackpredictionpylintglobal
|
||||
[[./images/feedbackpredictionpylintglobal.png]]
|
||||
|
||||
In Figure\nbsp{}[[fig:feedbackpredictionpylintmessages]], we have highlighted some messages, some of which perform very well, and some of which perform worse.
|
||||
The differences in performance can be explained through the content of the message and the underlying patterns sought by PyLint.
|
||||
For example, the message "too many branches" performs rather poorly.
|
||||
This can be explained through the fact that we prune too much context for the pattern that PyLint used to be picked up by =TreeminerD=.
|
||||
There are also annotations that can not be predicted at all, because no patterns are found.
|
||||
|
||||
#+CAPTION: Detailed view of predictions for a few PyLint messages.
|
||||
#+CAPTION: Each bar is a message, and the amount of occurrences in the training set and in the test set (respectively) is denoted in brackets after the name.
|
||||
#+NAME: fig:feedbackpredictionpylintmessages
|
||||
[[./images/feedbackpredictionpylintmessages.png]]
|
||||
|
||||
**** Real-world data
|
||||
:PROPERTIES:
|
||||
:CREATED: [2023-11-20 Mon 13:33]
|
||||
:CUSTOM_ID: subsubsec:feedbackpredictionresultsrealworld
|
||||
:END:
|
||||
|
||||
For the real-world data, we applied two different techniques to test our method.
|
||||
Aside from using the same 50/50 split as with the PyLint data, we also tried to simulate how a grader would use the method, gradually increasing the size of the training set and decreasing the size of the test set.
|
||||
|
||||
The results of the first test can be seen on Figure\nbsp{}[[fig:feedbackpredictionrealworldglobal]].
|
||||
The results are similar to the PyLint data.
|
||||
The percentages of annotations where the correct message is ranked first are generally higher (between 35 and 55 percent), and the percentages of annotations where it is ranked in the first five annotations are between 65 and 80 percent.
|
||||
|
||||
#+CAPTION: Prediction results for four exercises that were part of an exam.
|
||||
#+CAPTION: Our model was trained on half of the submissions in the dataset and tested with the other half.
|
||||
#+NAME: fig:feedbackpredictionrealworldglobal
|
||||
[[./images/feedbackpredictionrealworldglobal.png]]
|
||||
|
||||
For the next test, an extra category of annotations was added, namely "Not yet seen".
|
||||
This means that the message for that annotation was not part of the test set, and thus could never have been predicted.
|
||||
Results of this test (for the same exercises as in Figure\nbsp{}[[fig:feedbackpredictionrealworldglobal]]) can be seen on Figures\nbsp{}[[fig:feedbackpredictionrealworldsimulation1]],\nbsp{}[[fig:feedbackpredictionrealworldsimulation2]],\nbsp{}[[fig:feedbackpredictionrealworldsimulation3]]\nbsp{}and\nbsp{}[[fig:feedbackpredictionrealworldsimulation4]].
|
||||
These figures show that while some build-up is required, once a critical mass of annotations is reached, the accuracy of the system jumps to its final state.
|
||||
|
||||
#+CAPTION: Prediction results for the exercise "Afscheidswoordje" over time.
|
||||
#+CAPTION: The amount of annotations for which the message has never been seen is marked separately.
|
||||
#+CAPTION: The chart on the right shows the messages that are in the training set.
|
||||
#+NAME: fig:feedbackpredictionrealworldsimulation1
|
||||
[[./images/feedbackpredictionrealworldsimulation1.png]]
|
||||
|
||||
#+CAPTION: Prediction results for the exercise "Narcissuscodering" over time.
|
||||
#+CAPTION: The amount of annotations for which the message has never been seen is marked separately.
|
||||
#+CAPTION: The chart on the right shows the messages that are in the training set.
|
||||
#+NAME: fig:feedbackpredictionrealworldsimulation2
|
||||
[[./images/feedbackpredictionrealworldsimulation2.png]]
|
||||
|
||||
#+CAPTION: Prediction results for the exercise "Cocktailbar" over time.
|
||||
#+CAPTION: The amount of annotations for which the message has never been seen is marked separately.
|
||||
#+CAPTION: The chart on the right shows the messages that are in the training set.
|
||||
#+NAME: fig:feedbackpredictionrealworldsimulation3
|
||||
[[./images/feedbackpredictionrealworldsimulation3.png]]
|
||||
|
||||
#+CAPTION: Prediction results for the exercise "Antropomorfe emoji" over time.
|
||||
#+CAPTION: The amount of annotations for which the message has never been seen is marked separately.
|
||||
#+CAPTION: The chart on the right shows the messages that are in the training set.
|
||||
#+NAME: fig:feedbackpredictionrealworldsimulation4
|
||||
[[./images/feedbackpredictionrealworldsimulation4.png]]
|
||||
|
||||
*** Conclusions and future work
|
||||
:PROPERTIES:
|
||||
:CREATED: [2023-11-20 Mon 13:33]
|
||||
:CUSTOM_ID: subsec:feedbackpredictionconclusion
|
||||
:END:
|
||||
|
||||
In this manuscript we presented a prediction method to help when giving feedback during grading by re-using messages.
|
||||
Improving re-use of messages can be both a time-saver, and improve the consistency with which feedback is given.
|
||||
|
||||
The framework already has promising results.
|
||||
We validated the framework by predicting both automated linting messages to establish a baseline and by using real-world data.
|
||||
The method performs about the same for real-world data as it does for PyLint's linting messages.
|
||||
|
||||
Of course, alternative methods could also be considered.
|
||||
One cannot overlook the rise of Large Language Models (LLMs) and the way they could contribute to this problem.
|
||||
LLMs can also generate feedback for students, based on their code and a well-chosen system prompt.
|
||||
Fine-tuning of a model with feedback already given could also be considered.
|
||||
|
||||
We can conclude that the proposed method achieved the aimed-for objective.
|
||||
Having this method at hand immediately raises some possible follow-up work.
|
||||
Messages that don't lend themselves well to being predicted need further investigation.
|
||||
The context used could also be extended (although the important caveat here is that the method also needs to still maintain its speed).
|
||||
Right now the model is also reactive: we propose a group of most likely messages when a grader wants to add an annotation on a line.
|
||||
By introducing a confidence score we could check beforehand if we have a confident match for each line and then immediately propose this to the grader.
|
||||
We could also look into applying some of the techniques for source code pattern mining proposed by\nbsp{}[cite/t:@phamMiningPatternsSource2019] to make further speed improvements.
|
||||
Another important aspect that was explicitly left out of scope in this manuscript was building it into a learning platform and doing user testing.
|
||||
|
||||
* Discussion and future work
|
||||
:PROPERTIES:
|
||||
:CREATED: [2023-10-23 Mon 08:51]
|
||||
|
|
BIN
images/feedbackpredictionpylintglobal.png
Normal file
After Width: | Height: | Size: 134 KiB |
BIN
images/feedbackpredictionpylintmessages.png
Normal file
After Width: | Height: | Size: 83 KiB |
BIN
images/feedbackpredictionrealworldglobal.png
Normal file
After Width: | Height: | Size: 470 KiB |
BIN
images/feedbackpredictionrealworldsimulation1.png
Normal file
After Width: | Height: | Size: 95 KiB |
BIN
images/feedbackpredictionrealworldsimulation2.png
Normal file
After Width: | Height: | Size: 82 KiB |
BIN
images/feedbackpredictionrealworldsimulation3.png
Normal file
After Width: | Height: | Size: 137 KiB |
BIN
images/feedbackpredictionrealworldsimulation4.png
Normal file
After Width: | Height: | Size: 118 KiB |