phd-thesis/book.org

#+TITLE: Dodona
#+SUBTITLE: Improving programming education through automated assessment, learning analytics, and educational data mining
#+AUTHOR: Charlotte Van Petegem
#+LANGUAGE: en-gb
#+DATE: 2024-06-19
#+LATEX_CLASS: book
#+LATEX_CLASS_OPTIONS: [paper=240mm:170mm,numbers=noendperiod,BCOR=10mm,DIV=10]
#+LATEX_COMPILER: lualatex
#+LATEX_HEADER: \usepackage[inline]{enumitem}
#+LATEX_HEADER: \usepackage{luacode}
#+LATEX_HEADER: \begin{luacode*}
#+LATEX_HEADER:   function parseargv()
#+LATEX_HEADER:     local rep = {}
#+LATEX_HEADER:     for k, x in pairs(arg) do
#+LATEX_HEADER:         local kw, vw = string.match(x, "([^=]+)=?([^=]*)")
#+LATEX_HEADER:         rep[kw] = vw
#+LATEX_HEADER:     end
#+LATEX_HEADER:     return rep
#+LATEX_HEADER:   end
#+LATEX_HEADER:   local arguments = parseargv()
#+LATEX_HEADER:   local outputdir = arguments["-output-directory"]
#+LATEX_HEADER:   if outputdir ~= nil then
#+LATEX_HEADER:     tex.print([[\PassOptionsToPackage{outputdir={]]..outputdir..[[}}{minted}]])
#+LATEX_HEADER:     tex.print([[\PassOptionsToPackage{inkscapepath={]]..outputdir..[[}}{svg}]])
#+LATEX_HEADER:   end
#+LATEX_HEADER: \end{luacode*}
#+LATEX_HEADER: \usepackage[newfloat]{minted}
#+LATEX_HEADER: \usepackage{color}
#+LATEX_HEADER: \usepackage{parskip}
#+LATEX_HEADER: \usepackage{url}
#+LATEX_HEADER: \usepackage{svg}
#+LATEX_HEADER: \usepackage[type=report]{ugent2016-title}
#+LATEX_HEADER: \usepackage[final]{microtype}
#+LATEX_HEADER: \usepackage[defaultlines=2,all]{nowidow}
#+LATEX_HEADER: \usepackage[dutch,AUTO]{polyglossia}
#+LATEX_HEADER: \usepackage{ragged2e}
#+LATEX_HEADER: \newenvironment{RIGHT}{\begin{FlushRight}}{\end{FlushRight}}
#+HTML_HEAD: <style>.RIGHT {text-align: right;}</style>
#+LATEX_HEADER: \academicyear{2023–2024}
#+LATEX_HEADER: \titletext{A dissertation submitted to Ghent University in partial fulfilment of\\ the requirements for the degree of Doctor of Computer Science.}
#+LATEX_HEADER: \promotors{%
#+LATEX_HEADER: Supervisors:\\
#+LATEX_HEADER: Prof.\ Dr.\ Peter Dawyndt\\
#+LATEX_HEADER: Prof.\ Dr.\ Ir.\ Bart Mesuere\\
#+LATEX_HEADER: Prof.\ Dr.\ Bram De Wever
#+LATEX_HEADER: }
#+LATEX_HEADER: \addtokomafont{caption}{\small}
#+LATEX_HEADER: \counterwithout{footnote}{chapter}
#+LATEX_HEADER: \setuptoc{toc}{numbered}
#+LATEX_HEADER: \addto\captionsenglish{\renewcommand{\contentsname}{Table of Contents}}
#+OPTIONS: ':t
#+OPTIONS: H:4
#+OPTIONS: toc:nil
#+OPTIONS: broken-links:mark
#+MACRO: num_commits 16 thousand
#+MACRO: num_prs 3\thinsp{}800
#+MACRO: num_contributors 26
#+MACRO: num_exercises 16 thousand
#+MACRO: num_releases 340
#+MACRO: num_schools 1\thinsp{}700
#+MACRO: num_submissions 17 million
#+MACRO: num_users 66\thinsp{}500
#+cite_export: csl citation-style.csl
#+bibliography: bibliography.bib

#+LATEX: \frontmatter
#+TOC: headlines 2

* Dankwoord :noexport:
:PROPERTIES:
:CREATED: [2023-10-23 Mon 09:25]
:CUSTOM_ID: chap:ack
:UNNUMBERED: t
:END:

#+LATEX: \begin{dutch}

Familie:
Mama.
Papa.
Hannelore, Tomas, Seppe.
Robbe, Esther.
Kero Kero.

Promotoren en jury.

Werk:
Rien.
Simon.
Niko.
Alexis.
Asmus.
Carol.
Dieter.
Steven.
Louise.
Robbert.
Tom.
Jonathan.
Heidi.
Felix.
Toon.
Pieter.
Tibo.
Mustapha.
Nico & Joyce.
Benjamin.
Oliver.
Roy.

Mede-lesgevers:
Annick.
Henri.
Adnan.
Niko.
Felix.
Louise.
Toon.
Lotte.
Yentl.
Felipe.
Silvija.
Antoine.
Oliver.
Dieter.
Ellen.
Tibo.

FR-collegas:
Toon.
Pieter.
Jozefien.
Francis.
Boris.

Zeus:
Jasper.
Klimcrew (Tom "gewoon doorstappen" Naessens, Felix, Ruben, Arthur, Titouan).

Rode Kruis:
Luc.
Wim.
Sarah.
Henk.
Pascal.
Jonas.
Rien.
Jietse.

Anje De Baets van de resto.

D&D:
Bart.
Kenneth.
Maxiem.
Arne.

Muziek:
Jan Swerts.
Eliza McLamb.
Katy Kirby.
Marika Hackman.
Boygenius: Phoebe Bridgers, Lucy Dacus, Julien Baker.
Pinegrove.
Charlotte Cardin.
Tate McRae.
Spinvis.
SOPHIE.
ANOHNI.

#+BEGIN_RIGHT
{{{author}}}

{{{date}}}
#+END_RIGHT

#+LATEX: \end{dutch}

* Summary in English
:PROPERTIES:
:CREATED: [2023-10-23 Mon 17:54]
:CUSTOM_ID: chap:summen
:UNNUMBERED: t
:END:

Ever since programming has been taught, its teachers have sought to automate and optimize their teaching.
Due to the ever-increasing digitalization of society, programming is also being taught to ever more and ever larger groups, and these groups often include students for whom programming is not necessarily their main subject.
This has led to the development of myriad automated assessment tools\nbsp{}[cite:@paivaAutomatedAssessmentComputer2022; @ihantolaReviewRecentSystems2010; @douceAutomaticTestbasedAssessment2005; @ala-mutkaSurveyAutomatedAssessment2005].
One of those platforms is Dodona[fn:: https://dodona.be], which is the platform this dissertation is centred around.

Chapters\nbsp{}[[#chap:what]],\nbsp{}[[#chap:use]],\nbsp{}and\nbsp{}[[#chap:technical]] focus on Dodona itself.
In Chapter\nbsp{}[[#chap:what]] we give an overview of the user-facing features of Dodona, from user management to how feedback is represented.
Chapter\nbsp{}[[#chap:use]] then focuses on how Dodona is used in practice, by presenting some facts and figures of its use, students' opinions of the platform, and an extensive case study on how Dodona's features are used to optimize teaching.
This case study also provides insight into the educational context for the research described in Chapters\nbsp{}[[#chap:passfail]]\nbsp{}and\nbsp{}[[#chap:feedback]].
Chapter\nbsp{}[[#chap:technical]] focuses on the technical aspects of developing Dodona and its related ecosystem of software tools.
This includes a discussion of the technical challenges related to developing a platform like Dodona, and how the Dodona team adheres to modern standards of software development.

Chapters\nbsp{}[[#chap:passfail]]\nbsp{}and\nbsp{}[[#chap:feedback]] are a bit different.
These chapters each detail a learning analytics/educational mining study we did, using the data that Dodona collects about the learning process.
Learning analytics and educational data mining stand at the intersection of computer science, data analytics, and the social sciences, and focus on understanding and improving learning.
They are made possible by the increased availability of data about students who are learning, due to the increasing move of education to digital platforms\nbsp{}[cite:@romeroDataMiningCourse2008].
They can also serve different actors in the educational landscape: they can help learners directly, help teachers to evaluate their own teaching, allow developers of education platforms to know what to focus on, allow educational institutions to guide their decisions, and even allow governments to take on data-driven policies\nbsp{}[cite:@fergusonLearningAnalyticsDrivers2012].

Chapter\nbsp{}[[#chap:passfail]] discusses a study where we tried to predict whether students would pass or fail a course at the end of the semester based solely on their submission history in Dodona.
It also briefly details a study we collaborated on with researchers from Jyväskylä University in Finland, where we replicated our study in their educational context, with data from their educational platform.

In Chapter\nbsp{}[[#chap:feedback]], we first give an overview of how Dodona changed manual assessment in our own educational context.
We then finish the chapter with some recent work on a machine learning method we developed to predict what feedback teachers will give when manually assessing student submissions.

Finally, Chapter\nbsp{}[[#chap:discussion]] concludes the dissertation with some discussion on Dodona's opportunities and challenges for the future.

* Nederlandstalige samenvatting
:PROPERTIES:
:CREATED: [2023-10-23 Mon 17:54]
:CUSTOM_ID: chap:summnl
:UNNUMBERED: t
:END:

#+LATEX: \begin{dutch}

Al van bij de start van het programmeeronderwijs, proberen lesgevers hun taken te automatiseren en optimaliseren.
De digitalisering van de samenleving gaat ook steeds verder, waardoor steeds meer en grotere groepen studenten leren programmeren.
Deze groepen bevatten ook vaker studenten voor wie programmeren niet het hoofdonderwerp van hun studies is.
Dit heeft geleid tot de ontwikkeling van zeer veel platformen voor de geautomatiseerde beoordeling van programmeeropdrachten\nbsp{}[cite:@paivaAutomatedAssessmentComputer2022; @ihantolaReviewRecentSystems2010; @douceAutomaticTestbasedAssessment2005; @ala-mutkaSurveyAutomatedAssessment2005].
Eén van deze platformen is Dodona[fn:: https://dodona.be], het platform waar dit proefschrift over handelt.

Hoofdstukken\nbsp{}[[#chap:what]],\nbsp{}[[#chap:use]]\nbsp{}en\nbsp{}[[#chap:technical]] focussen op Dodona zelf.
In Hoofdstuk\nbsp{}[[#chap:what]] geven we een overzicht van de gebruikersgerichte features van Dodona, van gebruikersbeheer tot hoe feedback getoond wordt.
Hoofdstuk\nbsp{}[[#chap:use]] focust zich dan op hoe Dodona in de praktijk gebruikt wordt, door wat statistieken over het gebruiken te presenteren, de meningen van studenten over het platform te presenteren en met een uitgebreide case study waarin getoond wordt hoe de verschillende features van Dodona kunnen bijdragen tot het optimaliseren van onderwijs.
Deze case study presenteert ook de context waarin Hoofdstukken\nbsp{}[[#chap:passfail]]\nbsp{}en\nbsp{}[[#chap:feedback]] zich situeren.
Hoofdstuk\nbsp{}[[#chap:technical]] focust op het technische aspect van het ontwikkelen van Dodona en het gerelateerde ecosysteem van software.
Dit bevat onder meer een bespreking van de technische uitdagingen gerelateerd aan het ontwikkelen van een platform zoals Dodona en hoe het Dodona-team zich aan de moderne standaarden van softwareontwikkeling houdt.

Hoofdstukken\nbsp{}[[#chap:passfail]]\nbsp{}en\nbsp{}[[#chap:feedback]] verschillen van de vorige hoofdstukken, in de zin dat ze elk een /learning analytics/educational data mining/ studie bespreken.
Deze studies werden uitgevoerd met de data die Dodona verzamelt over het leerproces.
/Learning analytics/ en /educational data mining/ bevinden zich op het kruispunt tussen informatica, datawetenschap en de sociale wetenschappen, en focussen zich op het begrijpen en verbeteren van leren.
Ze worden mogelijk gemaakt door de toegenomen beschikbaarheid van data over lerende studenten, wat op zijn beurt komt door de toegenomen beweging van onderwijs naar digitale platformen\nbsp{}[cite:@romeroDataMiningCourse2008].
Ze kunnen ook dienen voor verschillende actoren in het onderwijsveld: ze kunnen studenten direct helpen, lesgevers helpen om hun eigen onderwijs te evalueren, ontwikkelaars van onderwijsplatformen laten weten waar ze zich op moeten focussen, de beslissingen van onderwijsinstellingen helpen gidsen, en zelfs overheden toelaten om op data gebaseerd beleid te voeren\nbsp{}[cite:@fergusonLearningAnalyticsDrivers2012].

Hoofdstuk\nbsp{}[[#chap:passfail]] bespreekt een studie waarin we geprobeerd hebben te voorspellen of studenten al dan niet zouden slagen voor een vak op het einde van het semester, enkel en alleen gebaseerd op hun indiengedrag op Dodona.
Daarnaast wordt er kort een samenwerking besproken met onderzoekers van de universiteit van Jyväskylä in Finland, waar we onze studie herhaald hebben in hun educationele context, gebruikmakend van data afkomstig van hun platform.

In Hoofdstuk\nbsp{}[[#chap:feedback]] geven we eerst een overzicht van hoe Dodona het manueel verbeteren in onze eigen educationele context veranderd heeft.
We sluiten dan het hoofdstuk af met een recent door ons ontwikkelde /machine-learning/-methode om te voorspellen welke feedback lesgevers zullen geven tijden het manueel verbeteren van indieningen van studenten.

We sluiten af in Hoofdstuk\nbsp{}[[#chap:discussion]] met een bespreking van de mogelijkheden en uitdagingen waar Dodona in de toekomst voor staat.

#+LATEX: \end{dutch}

* List of Publications
:PROPERTIES:
:CREATED:  [2024-03-05 Tue 17:40]
:END:

Strijbol, N., Sels, B., *Van Petegem, C.*, Maertens, R., Scholliers, C., Mesuere B., Dawyndt, P., submitted.
TESTed-DSL: a domain-specific language to create programming exercises with language-agnostic automated assessment.
/Software Testing, Verification and Reliability/.

Maertens, R., Van Neyghem, M., Geldhof, M., *Van Petegem, C.*, Strijbol, N., Dawyndt, P., Mesuere, B., submitted.
Discovering and exploring cases of educational source code plagiarism with Dolos.
/SoftwareX/

Zhidkikh, D., Heilala, V., *Van Petegem, C.*, Dawyndt, P., Järvinen, M., Viitanen, S., De Wever, B., Mesuere, B., Lappalainen, V., Kettunen, L., & Hämäläinen, R., 2024.
Reproducing Predictive Learning Analytics in CS1: Toward Generalizable and Explainable Models for Enhancing Student Retention.
/Journal of Learning Analytics/, 1-21.
https://doi.org/10.18608/jla.2024.7979

*Van Petegem, C.*, Maertens, R., Strijbol, N., Van Renterghem, J., Van der Jeugt, F., De Wever, B., Dawyndt, P., Mesuere, B., 2023b.
Dodona: Learn to code with a virtual co-teacher that supports active learning.
/SoftwareX/ 24, 101578.
https://doi.org/10.1016/j.softx.2023.101578

*Van Petegem, C.*, Dawyndt, P., Mesuere, B., 2023a.
Dodona: Learn to Code with a Virtual Co-teacher that Supports Active Learning.
/Proceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V. 2/ ITiCSE 2023, 633.
https://doi.org/10.1145/3587103.3594165

Strijbol, N., *Van Petegem, C.*, Maertens, R., Sels, B., Scholliers, C., Dawyndt, P., Mesuere, B., 2023b.
TESTed -- An educational testing framework with language-agnostic test suites for programming exercises.
/SoftwareX/ 22, 101404.
https://doi.org/10.1016/j.softx.2023.101404

*Van Petegem, C.*, Deconinck, L., Mourisse, D., Maertens, R., Strijbol, N., Dhoedt, B., De Wever, B., Dawyndt, P., Mesuere, B., 2022.
Pass/Fail Prediction in Programming Courses.
/Journal of Educational Computing Research/ 68–95.
https://doi.org/10.1177/07356331221085595

Maertens, R., *Van Petegem, C.*, Strijbol, N., Baeyens, T., Jacobs, A.C., Dawyndt, P., Mesuere, B., 2022.
Dolos: Language-agnostic plagiarism detection in source code.
/Journal of Computer Assisted Learning/.
https://doi.org/10.1111/jcal.12662

Nüst, D., Eddelbuettel, D., Bennett, D., Cannoodt, R., Clark, D., Daróczi, G., Edmondson, M., Fay, C., Hughes, E., Kjeldgaard, L., Lopp, S., Marwick, B., Nolis, H., Nolis, J., Ooi, H., Ram, K., Ross, N., Shepherd, L., Sólymos, P., Swetnam, T.L., Turaga, N., *Van Petegem, C.*, Williams, J., Willis, C. Xiao, N., 2020.
The Rockerverse: Packages and Applications for Containerisation with R.
/The R Journal/, 12(1), 437–461.
https://doi.org/10.32614/RJ-2020-007

#+LATEX: \mainmatter

* Introduction
:PROPERTIES:
:CREATED: [2023-10-23 Mon 08:47]
:CUSTOM_ID: chap:intro
:END:

Ever since programming has been taught, its teachers have sought to automate and optimize their teaching.
Due to the ever-increasing digitalization of society, programming is also being taught to ever more and ever larger groups, and these groups often include students for whom programming is not necessarily their main subject.
This has led to the development of myriad automated assessment tools\nbsp{}[cite:@paivaAutomatedAssessmentComputer2022; @ihantolaReviewRecentSystems2010; @douceAutomaticTestbasedAssessment2005; @ala-mutkaSurveyAutomatedAssessment2005], of which we give a historical overview in this introduction.
We also discuss learning analytics and educational data mining, and how these techniques can help us to cope with the growing class sizes.
We then give an overview of programming education in Flanders, including recent societal changes around this topic.
Finally, we give a brief overview of the remaining chapters of this dissertation.

** Automated assessment in programming education
:PROPERTIES:
:CREATED:  [2024-02-01 Thu 10:46]
:CUSTOM_ID: sec:introhistory
:END:

Increasing interactivity in learning has long been considered important, and also something that can be achieved through the addition of (web-based) IT components to a course\nbsp{}[cite:@vanpetegemPowerfulLearningInteractive2004].
This isn't any different when learning to program: learning how to solve problems with computer programs requires practice, and programming assignments are the main way in which such practice is generated\nbsp{}[cite:@gibbsConditionsWhichAssessment2005].
[cite/t:@cheangAutomatedGradingProgramming2003] identified the labor-intensive nature of assessing programming assignments as the main reason why students are given few such assignments when in an ideal world they should be given many more.
Automated assessment allows students to receive immediate and personalized feedback on each submitted solution without the need for human intervention.
Because of its potential to provide feedback loops that are scalable and responsive enough for an active learning environment, automated source code assessment has become a driving force in programming courses.

*** Humble beginnings
:PROPERTIES:
:CREATED:  [2024-02-06 Tue 15:30]
:END:

Automated assessment was introduced into programming education in the late 1950s\nbsp{}[cite:@hollingsworthAutomaticGradersProgramming1960].
In this first system, programs were submitted in assembly on punch cards.
For the reader who is not familiar with punch cards, an example of one can be seen in Figure\nbsp{}[[fig:introductionpunchard]].
The assessment was then performed by combining the student's punch cards with the autograder's punch cards.
In the early days of computing, the time of tutors was not the only valuable resource that needed to be shared between students; the actual compute time was also a shared and limited resource.
Their system made more efficient use of both.
[cite/t:@hollingsworthAutomaticGradersProgramming1960] already notes that the class sizes were a main motivator to introduce their auto-grader.
At the time of publication, they had tested about 3\thinsp{}000 student submissions which, given a grading run took about 30 to 60 seconds, represents about a day and a half of computation time.

They also immediately identified some limitations, which are common problems that modern assessment systems still need to consider.
These limitations include handling faults in the student code, making sure students can't modify the grader, and having to define an interface through which the student code is run.

#+CAPTION: Example of a punch card.
#+CAPTION: Picture by Arnold Reinhold, released under the CC BY-SA 4.0 licence via WikiMedia Commons.
#+NAME: fig:introductionpunchard
[[./images/introductionpunchcard.jpg]]

In the next ten years, significant advances were already made.
Students could submit code written in a text-based programming language instead of assembly, and the actual testing was done by running their code using modified compilers and operating systems.

[cite/t:@naurAutomaticGradingStudents1964] was the first to explicitly note the difference between the formal correctness, and the efficiency and completeness of the programs being tested.
The distinction between formal correctness and completeness that he makes can be somewhat confusing from a modern standpoint: we would only consider a program or algorithm formally correct if it is complete (i.e. gives the correct response in all cases).
In more modern terminology, Naur's "formally correct" would be called "free of syntax errors".

[cite/t:@forsytheAutomaticGradingPrograms1965] note another issue when using automatic graders: students could use the feedback they get to hard-code the expected response in their programs.
This is again an issue that modern assessment systems (or the teachers creating exercises) still need to consider.
Forsythe & Wirth solve this issue by randomizing the inputs to the student's program.
While not explicitly explained by them, we can assume that to check the correctness of a student's answer, they calculate the expected answer themselves as well.
Note that in this system, they were still writing a grading program for each individual exercise.

[cite/t:@hextAutomaticGradingScheme1969] introduce a new innovation: their system could be used for exercises in multiple different programming languages.
They are also the first to implement a history of student's attempts in the assessment tool itself, and mention explicitly that enough data should be recorded in this history so that it can be used to calculate a mark for a student.

Other grader programs were in use at the time, but these did not necessarily bring any new innovations or ideas to the table\nbsp{}[cite:@braden1965introductory; @berryGraderPrograms1966; @temperlyGradingProcedurePL1968].

The systems described above share an important limitation, which is inherent to the time at which they were built.
Computers were big and heavy, and had operators who did not necessarily know whose program they were running or what those programs were.
The Mother of All Demos by\nbsp{}[cite/t:@engelbart1968research], widely considered the birth of the /idea/ of the personal computer, only happened after these systems were already running.
So, it should not come as a surprise that the feedback these systems gave was slow to return to the students.

*** Tool- and script-based assessment
:PROPERTIES:
:CREATED:  [2024-02-06 Tue 17:29]
:END:

We now take a leap forward in time.
The way people use computers has changed significantly, and the way assessment systems are implemented changed accordingly.
Note that while the previous section was complete (as far as we could find in published literature), this section is decidedly not so.
At this point, the explosion of automated assessment systems/automated grading systems for programming education had already set in.
To describe all platforms would take a full dissertation in and of itself.
So from now on, we will pick and choose systems that brought new and interesting ideas that stood the test of time.[fn::
The ideas, not the platforms.
As far as we know none of the platforms described in this section are still in use.
]

ACSES, by\nbsp{}[cite/t:@nievergeltACSESAutomatedComputer1976], was envisioned as a full course for learning computer programming.
They even designed it as a full replacement for a course: it was the first system that integrated both instructional texts and exercises.
Students following this course would not need personal instruction.
In the modern day, this would probably be considered a MOOC.[fn::
Except that it obviously wasn't an online course; TCP/IP wouldn't be standardized until 1982.
]

Another good example of this generation of grading systems is the system by\nbsp{}[cite/t:@isaacson1989automating].
They describe the functioning of a UNIX shell script that automatically e-mails students if their code did not compile, or if they had incorrect outputs.
It also had a configurable output file size limit and time limit.
Student programs would be stopped if they exceeded these limits.
Like all assessment systems up to this point, they only focus on whether the output of the student's program is correct, and not on the code style.

[cite/t:@reekTRYSystemHow1989] takes a different approach.
He identifies several issues with gathering students' source files, and then compiling and executing them in the teacher's environment.
Students could write destructive code that destroys the teacher's files, or even write a clever program that alters their grades (and covers its tracks while doing so).
Note that this is not a new issue: as we discussed before, this was already mentioned as a possibility by\nbsp{}[cite/t:@hollingsworthAutomaticGradersProgramming1960].
This was, however, the first system that tried to solve this problem.
His TRY system therefore has the avoidance of teachers needing to run their students' programs themselves as an explicit goal.
Another goal was avoiding giving the inputs that the program was tested on to students.
These goals were mostly achieved using the UNIX =setuid= mechanism.
Note that students were using a true multi-user system, as in common use at the time.
Every attempt was also recorded in a log file in the teacher's personal directory.
Generality of programming language was achieved through intermediate build and test scripts that had to be provided by the teacher.

This is also the first study we could find that pays explicit attention to how expected and generated output is compared.
In addition to the basic character-by-character comparison, it is also supported to define the interface for a function that students have to call with their outputs.
The instructor can then link an implementation of this function in the build script.

Even later, automated assessment systems were built with graphical user interfaces.
A good example of this is ASSYST\nbsp{}[cite:@jacksonGradingStudentPrograms1997].
ASSYST also added evaluation on other metrics, such as runtime or cyclomatic complexity as suggested by\nbsp{}[cite/t:@hungAutomaticProgrammingAssessment1993].

*** Moving to the web
:PROPERTIES:
:CREATED:  [2024-02-06 Tue 17:29]
:END:

After Tim Berners-Lee invented the web in 1989\nbsp{}[cite:@berners-leeWorldWideWeb1992], automated assessment systems also started moving to the web.
Especially with the rise of Web 2.0\nbsp{}[cite:@oreillyWhatWebDesign2007] and its increased interactivity, this became more and more common.
Systems like the one by\nbsp{}[cite/t:@reekTRYSystemHow1989] also became impossible to use because of the rise of the personal computer.
Mainly because the typical multi-user system was used less and less, but also because the primary way people interacted with a computer was no longer through the command line, but through graphical user interfaces.

[cite/t:@higginsCourseMarkerCBASystem2003] developed CourseMarker, which is a more general assessment system (not exclusively developed for programming assessment).
This was initially not yet a web-based platform, but it did communicate over the network using Java's Remote Method Invocation mechanism.
The system it was designed to replace, Ceilidh, did have a basic web submission interface\nbsp{}[cite:@hughesCeilidhCollaborativeWriting1998].
Designing a web client was also mentioned as future work in the paper announcing CourseMarker.

Perhaps the most famous example of the first web-based platforms is Web-CAT\nbsp{}[cite:@shah2003web].
In addition to being one of the first web-based automated assessment platforms, it also asked the students to write their own tests.
The coverage that these tests achieved was part of the testing done by the platform.
Tests are written using standard unit testing frameworks\nbsp{}[cite:@edwardsExperiencesUsingTestdriven2007].
An example of Web-CAT's submission screen can be seen in Figure\nbsp{}[[fig:introductionwebcatsubmission]].

#+CAPTION: Web-CAT's submission screen for students.
#+CAPTION: Image taken from\nbsp{}[cite/t:@edwardsWebCATWebbasedCenter2006].
#+NAME: fig:introductionwebcatsubmission
[[./images/introductionwebcatsubmission.png]]

This is also the time when we first start to see mentions of plagiarism and plagiarism detection in the context of automated assessment, presumably because the internet made plagiarizing a lot easier.
In one case at MIT over 30% of students were found to be plagiarizing\nbsp{}[cite:@wagner2000plagiarism].
[cite/t:@dalyPatternsPlagiarism2005] analysed plagiarizing behaviour by watermarking student submissions, where the watermark consisted of added whitespace at the end of lines.
If students carelessly copied another student's submission, they would also copy the whitespace.
[cite/t:@schleimerWinnowingLocalAlgorithms2003] also published MOSS around this time.

Another important platform is SPOJ\nbsp{}[cite:@kosowskiApplicationOnlineJudge2008].
SPOJ is especially important in the context of this dissertation, since it was the platform we used before Dodona.
SPOJ specifically notes the influence of online contest platforms (and in fact, is a platform that can be used to organize contests).
Online contest platforms usually differ from the automated assessment platforms for education in the way they handle feedback.
For online contests, the amount of feedback given to participants is often far less than the feedback given in education to students.
Although, depending on the educational vision of the teacher, this happens in education as well.

The SPOJ paper also details the security measures they took when executing untrusted code.
They use a patched Linux kernel's =rlimits=, the =chroot= mechanism, and traditional user isolation to prevent student code from malicious action.

Another interesting idea was contributed by\nbsp{}[cite/t:@brusilovskyIndividualizedExercisesSelfassessment2005] in QuizPACK.
They combined the idea of parametric exercises with automated assessment by executing source code.
In QuizPACK, teachers provide a parameterized piece of code, where the value of a specific variable is the answer that a student needs to give.
The piece of code is then evaluated, and the result is compared to the student's answer.
Note that in this platform, it is not the students themselves who are writing code.

*** Adding features
:PROPERTIES:
:CREATED:  [2024-02-06 Tue 15:31]
:END:

At this point in history, the idea of a web-based automated assessment system for programming education is no longer new.
But still, more and more new platforms are being written.
For a possible explanation, see Figure\nbsp{}[[fig:introductionxkcdstandards]].

#+CAPTION: Comic on the proliferation of standards, which is also applicable to the proliferation of automated assessment platforms.
#+CAPTION: Created by Randall Munroe, released under the CC\nbsp{}BY-NC\nbsp{}2.5 licence via https://xkcd.com/927/.
#+NAME: fig:introductionxkcdstandards
[[./images/introductionxkcdstandards.png]]


All of these platforms support automated assessment of code submitted by students, but try to differentiate themselves through the features they offer.
The FPGE platform by\nbsp{}[cite/t:@paivaManagingGamifiedProgramming2022] offers gamification, iWeb-TD\nbsp{}[cite:@fonsecaWebbasedPlatformMethodology2023] integrates a full-fledged editor, PLearn\nbsp{}[cite:@vasyliukDesignImplementationUkrainianLanguage2023] recommends extra exercises to its users, JavAssess\nbsp{}[cite:@insaAutomaticAssessmentJava2018] tries to automate grading, and GradeIT\nbsp{}[cite:@pariharAutomaticGradingFeedback2017] features automatic hint generation.

** Learning analytics and educational data mining
:PROPERTIES:
:CREATED:  [2024-02-01 Thu 10:47]
:CUSTOM_ID: sec:introlaedm
:END:

Learning analytics and educational data mining stand at the intersection of computer science, data analytics, and the social sciences, and focus on understanding and improving learning.
They are made possible by the increased availability of data about students who are learning, due to the increasing move of education to digital platforms\nbsp{}[cite:@romeroDataMiningCourse2008].
They can also serve different actors in the educational landscape: they can help learners directly, help teachers to evaluate their own teaching, allow developers of educational platforms to know what to focus on, allow educational institutions to guide their decisions, and even allow governments to take on data-driven policies\nbsp{}[cite:@fergusonLearningAnalyticsDrivers2012].
Learning analytics and educational data mining are overlapping fields, but in general, learning analytics is seen as focusing on the educational challenge, while educational data mining is more focused on the technical challenge\nbsp{}[cite:@fergusonLearningAnalyticsDrivers2012].
The analytics focusing on governments or educational institutions is called academic analytics.

[cite/t:@chattiReferenceModelLearning2012] defined a reference model for learning analytics and educational data mining based on four dimensions:
#+ATTR_LATEX: :environment enumerate*
#+ATTR_LATEX: :options [label={\emph{\roman*)}}, itemjoin={{ }}, itemjoin*={{ }}]
- What data is gathered and used?
- Who is targeted by the analysis?
- Why is the data analysed?
- How is the data analysed?
This gives an idea to researchers what to focus on when conceptualizing, executing, and publishing their research.

An example of educational data mining research is\nbsp{}[cite/t:@daudPredictingStudentPerformance2017], where the students' background (including family income, family expenditures, gender, martial status,\nbsp{}...) is used to predict the student's learning outcome at the end of the semester.
Evaluating this study using the reference model by\nbsp{}[cite/t:@chattiReferenceModelLearning2012], we can see that the data used is very personal and hard to collect.
As mentioned in the study, the primary target audience of the study are policymakers.
The data is analysed to evaluate the influence of a student's background on their performance, and this is done by using a number of machine learning techniques (which are compared to one another).

Another example of the research in this field is a study by\nbsp{}[cite/t:@akcapinarUsingLearningAnalytics2019].
They focus on the concept of an early warning system, where student performance can be predicted early and appropriate action could be undertaken.
Their study uses data from a blended learning environment, where students can see the lesson's resources, participate in discussions, and write down their own thoughts about the lessons.
Here, the primary target audience is the student.
Although the related actions are not yet in scope of the study, the primary goal is to develop such an early warning system.
Again, a number of machine learning techniques are compared, to determine which one gives the best results.

** Programming education in Flanders
:PROPERTIES:
:CREATED:  [2024-02-20 Tue 17:16]
:END:

In Flanders (Belgium), programming is taught in lots of ways, and at many levels.
This includes secondary and higher education, but it is also something children can do in their free time, as a hobby.
There are also trainings for the workforce, but these are not the focus of this dissertation.

Programming education as a hobby for children is provided by organizations such as CoderDojo[fn:: https://coderdojobelgium.be/nl (in Dutch)] and CodeFever[fn:: https://www.codefever.be/nl (in Dutch)].
CoderDojo is a non-profit organization that relies on volunteers to organize free sessions for children from 7 up to 18 years old.
They use tools like Scratch\nbsp{}[cite:@maloneyScratchProgrammingLanguage2010], AppInventor\nbsp{}[cite:@patton2019app], and Code.org[fn:: https://code.org/] to teach children the basics of programming in a fun, gamified way.
CodeFever is also a non-profit organization, but does ask for registration fees for enrolling in one of their courses.
They focus on children aged 8 to 15, and primarily use Scratch and JavaScript to teach programming concepts.

In secondary education, things recently changed.
Before 2021, education related to computing was very much up to the individual school and teacher.
While there were some schools where programming was taught, this was mostly a rare occurrence except for a few specific IT-oriented programmes.
In 2021, however, the Flemish parliament approved a new set of educational goals.
In these educational goals, there was an explicit focus on digital competence, where for a lot of educational programmes, this explicitly included programming.
Not much later though, one of the umbrella organizations for schools challenged the new educational goals in Belgium's constitutional court.
They felt that the government was overreaching in the specificity of the educational goals.[fn::
Traditionally, the educational goals were quite loose, allowing the umbrella organizations to add their own accents to the subjects being taught.
]
The constitutional court agreed, after which the government went back to the drawing board, and made a lot of the goals less detailed.
Digital competence is still a part of the new educational goals, but programming is now not explicitly listed.
For other programmes, mostly focused on the sciences, or with more mathematics, specific educational goals list competences that students should have when finishing secondary education.
These include programming, algorithms, data structures, numerical methods, etc.
For programmes focused on IT, there is an even bigger list of related competences that the students should have.
Python is the most common programming language used at this level, but other programming languages like Java and C# are also used.

In higher education, programming has made its way into a lot of programmes.
Almost all students studying exact sciences or engineering have at least one programming course, but programming is also taught outside these domains (e.g. as part of a statistics course).
Here we see the greatest diversity in the programming languages that are taught.
Python, Java, and R are the most common languages for students for whom computer science is not the main subject.
Computer science students are taught a plethora of languages, from Python and Java to Prolog, Haskell and Scheme.

** Structure of this dissertation
:PROPERTIES:
:CREATED:  [2024-02-01 Thu 10:18]
:CUSTOM_ID: sec:introstructure
:END:

This dissertation is centred around Dodona[fn:: https://dodona.be/].
Dodona is an online learning environment that recognizes the importance of active learning and just-in-time feedback in courses involving programming assignments.
Dodona was started because our own educational needs outgrew SPOJ\nbsp{}[cite:@kosowskiApplicationOnlineJudge2008].
SPOJ was chosen because it was one of the rare platforms that allowed the addition of courses, exercises (and even judges) by teachers.
This also informed the development of Dodona.
Every year since its inception in 2016, more and more teachers have started using Dodona.
It is now used in most higher education institutions in Flanders, and many secondary education institutions as well.

Chapters\nbsp{}[[#chap:what]],\nbsp{}[[#chap:use]],\nbsp{}and\nbsp{}[[#chap:technical]] focus on Dodona itself.
In Chapter\nbsp{}[[#chap:what]] we will give an overview of the user-facing features of Dodona, from user management to how feedback is represented.
Chapter\nbsp{}[[#chap:use]] then focuses on how Dodona is used in practice, by presenting some facts and figures of its use, students' opinions of the platform, and an extensive case study on how Dodona's features are used to optimize teaching.
This case study also provides insight into the educational context for the research described in Chapters\nbsp{}[[#chap:passfail]]\nbsp{}and\nbsp{}[[#chap:feedback]].
Chapter\nbsp{}[[#chap:technical]] focuses on the technical aspect of developing Dodona and its related ecosystem of software.
This includes a discussion of the technical challenges related to developing a platform like Dodona, and how the Dodona team adheres to modern standards of software development.

Chapter\nbsp{}[[#chap:passfail]] discusses an educational data mining study where we tried to predict whether students would pass or fail a programming course at the end of the semester based solely on their submission history in Dodona.
It also briefly details a study we collaborated on with researchers from Jyväskylä University in Finland, where we replicated our study in their own educational context, with data from their own educational platform.

In Chapter\nbsp{}[[#chap:feedback]], we first give an overview of how Dodona changed manual assessment in our own educational context.
We then finish the chapter with some recent work on a machine learning method we developed to predict what feedback teachers will give when manually assessing student submissions.

Finally, Chapter\nbsp{}[[#chap:discussion]] concludes the dissertation with some discussion on Dodona's opportunities and challenges for the future.

* A closer look at Dodona
:PROPERTIES:
:CREATED: [2023-10-23 Mon 08:47]
:CUSTOM_ID: chap:what
:END:

In this chapter, we will give an overview of Dodona's most important features.
We finish the chapter with a short overview of Dodona's most important releases and which features they included.

This chapter is partially based on\nbsp{}[cite/t:@vanpetegemDodonaLearnCode2023], published in SoftwareX.

** User management
:PROPERTIES:
:CREATED: [2023-10-24 Tue 09:44]
:CUSTOM_ID: subsec:whatuser
:END:

Establishing the identity of its users is very important for an educational platform.
For this reason, instead of providing its own authentication and authorization, Dodona delegates authentication to external identity providers (e.g.\nbsp{}educational and research institutions) through SAML\nbsp{}[cite:@farrellAssertionsProtocolOASIS2002], OAuth\nbsp{}[cite:@leibaOAuthWebAuthorization2012; @hardtOAuthAuthorizationFramework2012] and OpenID Connect\nbsp{}[cite:@sakimuraOpenidConnectCore2014].
The configured OAuth providers are Microsoft, Google, and Smartschool.
This support for *decentralized authentication* allows users to benefit from single sign-on when using their institutional account across multiple platforms and teachers to trust their students' identities when taking high-stakes tests and exams in Dodona.

Dodona automatically creates user accounts upon successful authentication and uses the association with external identity providers to assign an institution to users.
These institutions can have multiple sign-in methods.
If a user uses more than one of those methods, these logins are linked to the same user.
Institutions within Dodona can be used by teachers to establish filters about who is allowed to register for their courses, establishing an extra level of trust that their students have correctly signed in.
Institutions are also categorized internally in secundary education, higher education, and other (e.g. the Flemish government).

By default, newly created users are assigned a student role.
Teachers and instructors who wish to create content (courses, learning activities and judges), must first request teacher rights using a streamlined form[fn:: https://dodona.be/rights_requests/new/].
The sign in page can be seen in Figure\nbsp{}[[fig:whatsignin]].
After logging in, a user sees an overview of the courses they are registered with.

#+CAPTION: Sign in page showing the different options for users to sign in.
#+NAME: fig:whatsignin
[[./images/whatsignin.png]]

** Course management
:PROPERTIES:
:CREATED: [2023-10-24 Tue 09:31]
:CUSTOM_ID: subsec:whatclassroom
:END:

In Dodona, a *course* is where teachers and instructors effectively manage a learning environment by instructing, monitoring and evaluating their students and interacting with them, either individually or as a group.
A Dodona user who created a course becomes its first administrator and can promote other registered users as *course administrators*.
In what follows, we will also use the generic term teacher as a synonym for course administrators if this Dodona-specific interpretation is clear from the context, but keep in mind that courses may have multiple administrators.

The course itself is laid out as a *learning path* that consists of course units called *series*, each containing a sequence of *learning activities* (Figure\nbsp{}[[fig:whatcourse]]).
Among the learning activities we differentiate between *reading activities* that can be marked as read and *programming assignments* with support for automated assessment of submitted solutions.
Learning paths are composed as a recommended sequence of learning activities to build knowledge progressively, allowing students to monitor their own progress at any point in time.
Courses can either be created from scratch or from copying an existing course and making additions, deletions and rearrangements to its learning path.

#+CAPTION: Main course page (administrator view) showing some series with deadlines, reading activities and programming assignments in its learning path.
#+CAPTION: At any point in time, students can see their own progress through the learning path of the course.
#+CAPTION: Teachers have some additional icons in the navigation bar (top) that lead to an overview of all students and their progress, an overview of all submissions for programming assignments, general learning analytics about the course, course management and a dashboard with questions from students in various stages from being answered (Figure\nbsp{}[[fig:whatquestions]]).
#+CAPTION: The red dot on the latter icon notifies that some student questions are still pending.
#+NAME: fig:whatcourse
[[./images/whatcourse.png]]

Students can *self-register* to courses in order to avoid unnecessary user management.
A course can either be announced in the public overview of Dodona for everyone to see, or be limited in visibility to students from a certain educational institution.
Alternatively, students can be invited to a hidden course by sharing a secret link.
Independent of course visibility, registration for a course can either be open to everyone, restricted to users from the institution the course is associated with, or new registrations can be disabled altogether.
Registrations are either approved automatically or require explicit approval by a teacher.
Registered users can be tagged with one or more labels to create subgroups that may play a role in learning analytics and reporting.

Students and teachers more or less see the same course page, except for some management features and learning analytics that are reserved for teachers.
Teachers can make content in the learning path temporarily inaccessible and/or invisible to students.
Content is typically made inaccessible when it is still in preparation or if it will be used for evaluating students during a specific period.
A token link can be used to grant access to invisible content, e.g.\nbsp{}when taking a test or exam from a subgroup of students.

Students can only mark reading activities as read once, but there is no restriction on the number of solutions they can submit for programming assignments.
Submitted solutions are automatically assessed and students receive immediate feedback as soon as the assessment has completed, usually within a few seconds.
Dodona stores all submissions, along with submission metadata and generated feedback, such that the submission and feedback history can be reclaimed at all times.
On top of automated assessment, student submissions may be further assessed and graded manually by a teacher (see Chapter\nbsp{}[[#chap:feedback]]).

Series can have a *deadline*.
Passed deadlines do not prevent students from marking reading activities or submitting solutions for programming assignments in their series.
However, learning analytics, reports and exports usually only take into account submissions before the deadline.
Because of the importance of deadlines and to avoid discussions with students about missed deadlines, series deadlines are not only announced on the course page.
The student's home page highlights upcoming deadlines for individual courses and across all courses.
While working on a programming assignment, students will also see a clear warning starting from ten minutes before a deadline.
Courses also provide an iCalendar link\nbsp{}[cite:@stenersonInternetCalendaringScheduling1998] that students can use to publish course deadlines in their personal calendar application.

Because Dodona logs all student submissions and their metadata, including feedback and grades from automated and manual assessment, we use that data to integrate reports and learning analytics in the course page\nbsp{}[cite:@fergusonLearningAnalyticsDrivers2012].
This includes heatmaps (Figure\nbsp{}[[fig:whatcourseheatmap]]) and punch cards (Figure\nbsp{}[[fig:whatcoursepunchcard]]) of user activity, graphs showing class progress (Figure\nbsp{}[[fig:whatcourseprogress]]), and so on.

#+CAPTION: Heatmap showing on which days in the semester students are more active or less active.
#+NAME: fig:whatcourseheatmap
[[./images/whatcourseheatmap.png]]

#+CAPTION: Punchcard showing when during the week students are working on their exercises.
#+NAME: fig:whatcoursepunchcard
[[./images/whatcoursepunchcard.png]]

#+CAPTION: Graph showing the percentage of students that correctly solved the exercises in a certain series over time.
#+NAME: fig:whatcourseprogress
[[./images/whatcourseprogress.png]]

We also provide export wizards that enable the extraction of raw and aggregated data in CSV format for downstream processing and educational data mining\nbsp{}[cite:@romeroEducationalDataMining2010; @bakerStateEducationalData2009].
This allows teachers to better understand student behaviour, progress and knowledge, and might give deeper insight into the underlying factors that contribute to student actions\nbsp{}[cite:@ihantolaReviewRecentSystems2010].
Understanding, knowledge and insights that can be used to make informed decisions about courses and their pedagogy, increase student engagement, and identify at-risk students\nbsp{}(see\nbsp{}Chapter\nbsp{}[[#chap:passfail]]).

** Exercises
:PROPERTIES:
:CREATED:  [2024-02-20 Tue 14:32]
:END:

There are two types of assignments in Dodona: reading activities and programming exercises.
While reading activities only consist of descriptions, programming exercises need an additional *assessment configuration* that sets a programming language and a judge (for more information on judges, see Section\nbsp{}[[#subsec:whatjudges]]).
This configuration is used to perform the automated assessment.
The configuration may also set a Docker image, a time limit, a memory limit and grant Internet access to the container that is instantiated from the image, but these settings have proper default values.
The configuration might also provide additional *assessment resources*: files made accessible to the judge during assessment.
The specification of how these resources must be structured and how they are used during assessment is completely up to the judge developers.
Finally, the configuration might also contain *boilerplate code*: a skeleton students can use to start the implementation that is provided in the code editor along with the description.
Directories that contain a learning activity also have their own internal directory structure that includes a *description* in HTML or Markdown.
Descriptions may reference data files and multimedia content included in the repository, and such content can be shared across all learning activities in the repository.
Embedded images are automatically encapsulated in a responsive lightbox to improve readability.
Mathematical formulas in descriptions are supported through MathJax\nbsp{}[cite:@cervoneMathJaxPlatformMathematics2012].

Where automatic assessment and feedback generation is outsourced to the judge linked to an assignment, Dodona itself takes up the responsibility for rendering the feedback.
This frees judge developers from putting effort in feedback rendering and gives a coherent look-and-feel even for students that solve programming assignments assessed by different judges.
Because the way feedback is presented is very important\nbsp{}[cite:@maniBetterFeedbackEducational2014], we took great care in designing how feedback is displayed to make its interpretation as easy as possible (Figure\nbsp{}[[fig:whatfeedback]]).
Differences between generated and expected output are automatically highlighted for each failed test\nbsp{}[cite:@myersAnONDDifference1986], and users can swap between displaying the output lines side-by-side or interleaved to make differences more comparable.
We even provide specific support for highlighting differences between tabular data such as CSV files, database tables and data frames.
Users have the option to dynamically hide contexts whose test cases all succeeded, allowing them to immediately pinpoint reported mistakes in feedback that contains lots of succeeded test cases.
To ease debugging the source code of submissions for Python assignments, the Python Tutor\nbsp{}[cite:@guoOnlinePythonTutor2013] can be launched directly from any context with a combination of the submitted source code and the test code from the context.
Students typically report this as one of the most useful features of Dodona.

#+CAPTION: Dodona rendering of feedback generated for a submission of the Python programming assignment "Curling".
#+CAPTION: The feedback is split across three tabs: ~isinside~, ~isvalid~ and ~score~.
#+CAPTION: 48 tests under the ~score~ tab failed as can be seen from the badge in the tab header.
#+CAPTION: The "Code" tab displays the source code of the submission with annotations added during automatic and/or manual assessment (Figure\nbsp{}[[fig:whatannotations]]).
#+CAPTION: The differences between the generated and expected return values were automatically highlighted and the judge used HTML snippets to add a graphical representation (SVG) of the problem for the failed test cases.
#+CAPTION: In addition to highlighting differences between the generated and expected return values of the first (failed) test case, the judge also added a text snippet that points the user to a type error.
#+NAME: fig:whatfeedback
[[./images/whatfeedback.png]]

** Judges
:PROPERTIES:
:CREATED:  [2024-02-20 Tue 15:28]
:CUSTOM_ID: subsec:whatjudges
:END:

The range of approaches, techniques and tools for software testing that may underpin assessing the quality of software under test is incredibly diverse.
Static testing directly analyses the syntax, structure and data flow of source code, whereas dynamic testing involves running the code with a given set of test cases\nbsp{}[cite:@oberkampfVerificationValidationScientific2010; @grahamFoundationsSoftwareTesting2021].
Black-box testing uses test cases that examine functionality exposed to end-users without looking at the actual source code, whereas white-box testing hooks test cases onto the internal structure of the code to test specific paths within a single unit, between units during integration, or between subsystems\nbsp{}[cite:@nidhraBlackBoxWhite2012].
So, broadly speaking, there are three levels of white-box testing: unit testing, integration testing and system testing\nbsp{}[cite:@wiegersCreatingSoftwareEngineering1996; @dooleySoftwareDevelopmentProfessional2011].
Source code submitted by students can therefore be verified and validated against a multitude of criteria: functional completeness and correctness, architectural design, usability, performance and scalability in terms of speed, concurrency and memory footprint, security, readability (programming style), maintainability (test quality) and reliability\nbsp{}[cite:@staubitzPracticalProgrammingExercises2015].
This is also reflected by the fact that a diverse range of metrics for measuring software quality have come forward, such as cohesion/coupling\nbsp{}[cite:@yourdonStructuredDesignFundamentals1979; @stevensStructuredDesign1999], cyclomatic complexity\nbsp{}[cite:@mccabeComplexityMeasure1976] or test coverage\nbsp{}[cite:@millerSystematicMistakeAnalysis1963].

To cope with such a diversity in software testing alternatives, Dodona is centred around a generic infrastructure for *programming assignments that support automated assessment*.
Assessment of a student submission for an assignment comprises three loosely coupled components: containers, judges and assignment-specific assessment configurations.
Judges have a default Docker image that is used if the configuration of a programming assignment does not specify one explicitly.
Dodona builds the available images from Dockerfiles specified in a separate git repository.
More information on this underlying mechanism can be found in Chapter\nbsp{}[[#chap:technical]].
An overview of the existing judges and the corresponding number of exercises and submissions in Dodona can be found in Table\nbsp{}[[tab:whatoverviewjudges]].

#+CAPTION: Overview of the judges in Dodona, together with the corresponding number of exercises and submissions in Dodona.
#+CAPTION: The data was gathered in March 2024.
#+CAPTION: The TESTed judge is a special case in that it supports multiple programming languages.
#+CAPTION: More information on it can be found in Section\nbsp{}[[#sec:techtested]].
#+CAPTION: The number of exercises and submissions for the JavaScript judge is undercounted: most of its exercises were converted to TESTed exercises, which also moved the submissions to those exercises to TESTed.
#+NAME: tab:whatoverviewjudges
| Judge      |   # exercises |              # submissions |
|------------+---------------+----------------------------|
| <l>        |           <r> |                        <r> |
| Bash       |           289 |            675\thinsp{}902 |
| C          |            77 |             31\thinsp{}822 |
| C#         |           256 |             44\thinsp{}294 |
| Compilers  |             3 |                         38 |
| HTML       |           187 |             24\thinsp{}947 |
| Haskell    |            76 |             76\thinsp{}556 |
| Java 8     |            93 |             90\thinsp{}084 |
| Java 21    |           450 |            730\thinsp{}383 |
| JavaScript |            36 |                         68 |
| Markdown   |            14 |                        354 |
| Prolog     |            54 |             37\thinsp{}609 |
| Python     | 8\thinsp{}481 | 13\thinsp{}798\thinsp{}051 |
| R          | 1\thinsp{}293 |            958\thinsp{}069 |
| SQL        |           298 |            114\thinsp{}725 |
| Scheme     |           277 |            125\thinsp{}138 |
| TESTed     | 1\thinsp{}139 |            333\thinsp{}507 |
| Turtle     |            17 |                        446 |

** Repositories
:PROPERTIES:
:CREATED:  [2024-02-20 Tue 15:20]
:END:

Where courses are created and managed in Dodona itself, other content is managed in external git *repositories* (Figure\nbsp{}[[fig:whatrepositories]]).
In this distributed content management model, a repository either contains a single judge or a collection of learning activities.
Setting up a *webhook* for the repository guarantees that any changes pushed to its default branch are automatically and immediately synchronized with Dodona.
This even works without the need to make repositories public, as they may contain information that should not be disclosed such as programming assignments that are under construction, contain model solutions, or will be used during tests or exams.
Instead, a *Dodona service account* must be granted push and pull access to the repository.
Some settings of a learning activity can be modified through the web interface of Dodona, but any changes are always pushed back to the repository in which the learning activity is configured so that it always remains the master copy.

#+CAPTION: Distributed content management model that allows to seamlessly integrate custom learning activities (reading activities and programming assignments with support for automated assessment) and judges (frameworks for automated assessment) into Dodona.
#+CAPTION: Content creators manage their content in external git repositories, keep ownership over their content, control who can co-create, and set up webhooks to automatically synchronize any changes with the content as published on Dodona.
#+NAME: fig:whatrepositories
[[./images/whatrepositories.png]]

Due to the distributed nature of content management, creators also keep ownership over their content and control who may co-create.
After all, access to a repository is completely independent of access to its learning activities that are published in Dodona.
The latter is part of the configuration of learning activities, with the option to either share learning activities so that all teachers can include them in their courses or to restrict inclusion of learning activities to courses that are explicitly granted access.
Dodona automatically stores metadata about all learning activities such as content type, natural language, programming language and repository to increase their findability in our large collection.
Learning activities may also be tagged with additional labels as part of their configuration.
Any repository containing learning activities must have a predefined directory structure[fn:: https://docs.dodona.be/en/references/exercise-directory-structure/].

** Internationalization and localization
:PROPERTIES:
:CREATED: [2023-10-24 Tue 10:55]
:CUSTOM_ID: subsec:whati18n
:END:
*Internationalization* (i18n) is a shared responsibility between Dodona, learning activities and judges.
All boilerplate text in the user interface that comes from Dodona itself is supported in English and Dutch, and users can select their preferred language.
Content creators can specify descriptions of learning activities in both languages, and Dodona will render a learning activity in the user's preferred language if available.
When users submit solutions for a programming assignment, their preferred language is passed as submission metadata to the judge.
It's then up to the judge to take this information into account while generating feedback.

Dodona always displays *localized deadlines* based on a time zone setting in the user profile, and users are warned when the current time zone detected by their browser differs from the one in their profile.

** Questions, answers and code reviews
:PROPERTIES:
:CREATED: [2023-10-24 Tue 10:56]
:CUSTOM_ID: subsec:whatqa
:END:

A downside of using discussion forums in programming courses is that students can ask questions about programming assignments that are either disconnected from their current implementation or contain code snippets that may give away (part of) the solution to other students\nbsp{}[cite:@nandiEvaluatingQualityInteraction2012].
Dodona therefore allows students to address teachers with questions they directly attach to their submitted source code.
We support both general questions and questions linked to specific lines of their submission (Figure\nbsp{}[[fig:whatquestion]]).
Questions are written in Markdown (e.g. to include markup, tables, syntax highlighted code snippets or multimedia), with support for MathJax (e.g. to include mathematical formulas).

#+CAPTION: A student (Matilda) previously asked a question that has already been answered by her teacher (Miss Honey).
#+CAPTION: Based on this response, the student is now asking a follow-up question that can be formatted using Markdown.
#+NAME: fig:whatquestion
[[./images/whatquestion.png]]

Teachers are notified whenever there are pending questions (Figure\nbsp{}[[fig:whatcourse]]).
They can process these questions from a dedicated dashboard with live updates (Figure\nbsp{}[[fig:whatquestions]]).
The dashboard immediately guides them from an incoming question to the location in the source code of the submission it relates to, where they can answer the question similar to how students ask questions.
To avoid questions being inadvertently handled simultaneously by multiple teachers, they have a three-state lifecycle: pending, in progress and answered.
In addition to teachers changing question states while answering them, students can also mark their own questions as being answered.
The latter might reflect the rubber duck debugging\nbsp{}[cite:@huntPragmaticProgrammer1999] effect that is triggered when students are forced to explain a problem to someone else while asking questions in Dodona.
Teachers can (temporarily) disable the option for students to ask questions in a course, e.g.\nbsp{}when a course is over or during hands-on sessions or exams when students are expected to ask questions face-to-face rather than online.

#+CAPTION: Live updated dashboard showing all incoming questions in a course while asking questions is enabled.
#+CAPTION: Questions are grouped into three categories: unanswered, in progress and answered.
#+NAME: fig:whatquestions
[[./images/whatquestions.png]]

Manual source code annotations from students (questions) and teachers (answers) are rendered in the same way as source code annotations resulting from automated assessment.
They are mixed in the source code displayed in the "Code" tab, showing their complementary nature.
It is not required that students take the initiative for the conversation.
Teachers can also start adding source code annotations while reviewing a submission.
Such *code reviews* will be used as a building block for manual assessment.

** Manual assessment
:PROPERTIES:
:CREATED: [2023-10-24 Tue 11:01]
:CUSTOM_ID: subsec:whateval
:END:

Teachers can create an *evaluation* for a series to manually assess student submissions for its programming assignments after a specific period, typically following the deadline of some homework, an intermediate test or a final exam.
An example of an evaluation overview can be seen on Figure\nbsp{}[[fig:whatevaluationoverview]].
The evaluation embodies all programming assignments in the series and a group of students that submitted solutions for these assignments.
Because a student may have submitted multiple solutions for the same assignment, the last submission before a given deadline is automatically selected for each student and each assignment in the evaluation.
This automatic selection can be manually overruled afterwards.
The evaluation deadline defaults to the deadline set for the associated series, if any, but an alternative deadline can be selected as well.

#+CAPTION: Pseudonymized overview of an evaluation in Dodona.
#+CAPTION: For each student, both the correctness of their submission and whether it has been graded is shown.
#+NAME: fig:whatevaluationoverview
[[./images/whatevaluationoverview.png]]

Evaluations support *two-way navigation* through all selected submissions: per assignment and per student.
For evaluations with multiple assignments, it is generally recommended to assess per assignment and not per student, as students can build a reputation throughout an assessment\nbsp{}[cite:@malouffBiasGradingMetaanalysis2016].
As a result, they might be rated more favourably with a moderate solution if they had excellent solutions for assignments that were assessed previously, and vice versa\nbsp{}[cite:@malouffRiskHaloBias2013].
Assessment per assignment breaks this reputation as it interferes less with the quality of previously assessed assignments from the same student.
Possible bias from the same sequence effect is reduced during assessment per assignment as students are visited in random order for each assignment in the evaluation.
In addition, *anonymous mode* can be activated as a measure to eliminate the actual or perceived halo effect conveyed through seeing a student's name during assessment\nbsp{}[cite:@lebudaTellMeYour2013].
While anonymous mode is active, all students are automatically pseudonymized.
Anonymous mode is not restricted to the context of assessment and can be used across Dodona, for example while giving in-class demos.

When reviewing a selected submission from a student, assessors have direct access to the feedback that was previously generated during automated assessment: source code annotations in the "Code" tab and other structured and unstructured feedback in the remaining tabs.
Moreover, next to the feedback that was made available to the student, the specification of the assignment may also add feedback generated by the judge that is only visible to the assessor.
Assessors might then complement the assessment made by the judge by adding *source code annotations* as formative feedback and by *grading* the evaluative criteria in a scoring rubric as summative feedback (Figure\nbsp{}[[fig:whatannotations]]).
Previous annotations can be reused to speed up the code review process, because remarks or suggestions tend to recur frequently when reviewing submissions for the same assignment.
Grading requires setting up a specific *scoring rubric* for each assignment in the evaluation, as a guidance for evaluating the quality of submissions\nbsp{}[cite:@dawsonAssessmentRubricsClearer2017; @pophamWhatWrongWhat1997].
The evaluation tracks which submissions have been manually assessed, so that analytics about the assessment progress can be displayed and to allow multiple assessors working simultaneously on the same evaluation, for example one (part of a) programming assignment each.

#+CAPTION: Manual assessment of a submission: a teacher (Miss Honey) is giving feedback on the source code by adding inline annotations and is grading the submission by filling up the scoring rubric that was set up for the programming assignment "The Feynman ciphers".
#+NAME: fig:whatannotations
[[./images/whatannotations.png]]

** Overview of Dodona releases
:PROPERTIES:
:CREATED:  [2024-02-27 Tue 14:12]
:END:

In this section, we give an overview of the most important Dodona releases, and the changes they introduced, organized per academic year.
This is not a full overview of all Dodona releases, and does not mention all changes in a particular release.[[fn::
A full overview of all Dodona releases, with their full changelog, can be found at https://github.com/dodona-edu/dodona/releases/.
]

**** 2015--2016
:PROPERTIES:
:CREATED:  [2024-02-28 Wed 11:53]
:END:

- 0.1 (2016-04-07) :: Minimal Rails app, where a list of exercises is shown based on files in the filesystem.
  This was only for JavaScript, and the code was executed locally in the browser.
- 0.2 (2016-04-14) :: Addition of a webhook to automatically update exercises.
  Assignments are rendered from Markdown, and can include media and formulas.
  Ace was introduced as the editor.
- 0.5 (2016-08-10) :: This is the first release supporting Python through server-side judging.
- 0.6 (2016-08-16) :: Judges can now be auto-updated through a webhook.
- 0.7 (2016-09-07) :: The concept of a series was introduced.

**** 2016--2017
:PROPERTIES:
:CREATED:  [2024-02-28 Wed 11:53]
:END:

- 1.0 (2016-09-23) :: Dodona now runs on multiple servers, and series have gained a deadline.
- 1.1 (2016-09-28) :: Teachers can now configure boilerplate code per exercise.
- 1.2 (2016-10-10) :: The Python Tutor was added to Dodona.
- 1.3 (2016-11-02) :: Hidden series using a token link were added.
  Users could now also download their solutions for a series.
- 1.4.6 (2017-03-17) :: Use the student's latest submission instead of their best submission in most places.

**** 2017--2018
:PROPERTIES:
:CREATED:  [2024-02-28 Wed 11:53]
:END:

- 2.0 (2017-09-15) :: Introduction of the concept of a course administrator.
  Courses could also be set to hidden, and options for managing registration were added.
- 2.3 (2018-07-26) :: OAuth sign in support was added, allowing users from other institutions to use Dodona.[fn:: This is also the first release where I was personally involved with Dodona's development.]

**** 2018--2019
:PROPERTIES:
:CREATED:  [2024-02-28 Wed 11:53]
:END:

- 2.4 (2018-09-17) :: Add management and ownership of exercises and repositories by users.
  Users with teacher rights could no longer see and edit all users.
- 2.5 (2018-10-26) :: Improved search functionality. Courses were now also linked to an institution for improved searchability.
- 2.6 (2018-11-21) :: Diffing in the feedback table was fully reworked (see Chapter\nbsp{}[[#chap:technical]] for more details).
- 2.7 (2018-12-04) :: The punchcard was added to the course page.
  Labels could now also be added to course members.
- 2.8 (2019-03-05) :: Submissions and their feedback were moved from the database to the filesystem.
- 2.9 (2019-03-27) :: Large UI rework of Dodona, adding the class progress visualization.
  This release also adds a page with Dodona's privacy policy.
- 2.10 (2019-05-06) :: Allow courses to be copied.
  Anonymous mode (called demo mode at the time) was added.
- 2.11 (2019-06-27) :: Introduction of dark mode.
  This release also adds the heatmap visualization.

**** 2019--2020
:PROPERTIES:
:CREATED:  [2024-02-28 Wed 11:53]
:END:

- 3.0 (2019-09-12) :: Dodona was made open source.
  Support for the R programming language was added around this time.
- 3.1 (2019-10-17) :: Exercise descriptions were moved to iframes.
  Diffing was further improved.
- 3.2 (2019-11-28) :: Fully reworks the exporting of submissions.
- 3.3 (2020-02-26) :: The exercise info page was added.
- 3.4 (2020-04-19) :: Allow adding annotations on a submission.
- 3.6 (2020-04-27) :: Add reading activities as a new assignment type.
- 3.7 (2020-06-03) :: This release adds evaluations to Dodona.

**** 2020--2021
:PROPERTIES:
:CREATED:  [2024-02-28 Wed 11:53]
:END:

- 4.0 (2020-09-16) :: Q&A support got added in this release, along with LTI support.
- 4.3 (2021-04-26) :: Add the teacher rights request form.
- 4.4 (2021-05-05) :: Add grading to evaluations (as a private beta).
- 4.6 (2021-06-18) :: Featured courses were added in this release.

**** 2021--2022
:PROPERTIES:
:CREATED:  [2024-02-28 Wed 11:53]
:END:

- 5.0 (2021-09-13) :: New learning analytics were added to each series.
  This release also includes the full release of grading after an extensive private beta.
- 5.3 (2022-02-04) :: A new heatmap graph was added to the series analytics.
- 5.5 (2022-04-25) :: Introduction of Papyros, our own online code editor.
- 5.6 (2022-07-04) :: Another visual refresh of Dodona, this time to follow the Material Design 3 spec.

**** 2022--2023
:PROPERTIES:
:CREATED:  [2024-02-28 Wed 11:53]
:END:

- 6.0 (2022-08-18) :: Allow users to sign in with a personal Google or Microsoft account.
- 6.1 (2022-09-19) :: Allow reuse of annotations in evaluations.
- 6.8 (2023-05-17) :: Threading of questions was added.
- 2023.07 (2023-07-04) :: Introduction of monthly releases, whose contents are continuously deployed.
- 2023.08 (2023-08-01) :: Switch from dodona.ugent.be to dodona.be

**** 2023--2024
:PROPERTIES:
:CREATED:  [2024-02-28 Wed 11:53]
:END:

- 2023.10 (2023-10-01) :: Annotation reuse is rolled out to all users.
- 2023.11 (2023-11-01) :: The Python Tutor is moved client-side.
- 2023.12 (2023-12-01) :: The feedback table was reworked, moving every context to its own card.
- 2024.02 (2024-02-01) :: Papyros now also has an integrated debugger based on the Python Tutor.


* Dodona in educational practice
:PROPERTIES:
:CREATED: [2023-10-23 Mon 08:48]
:CUSTOM_ID: chap:use
:END:

This chapter discusses the use of Dodona.
We start by mentioning some facts and figures, and discussing a user study we performed.
We then explain how Dodona can be used on the basis of a case study.
This case study also provides insight into the educational context for the research described in Chapters\nbsp{}[[#chap:passfail]]\nbsp{}and\nbsp{}[[#chap:feedback]].
The chapter is partially based on\nbsp{}[cite/t:@vanpetegemDodonaLearnCode2023], published in SoftwareX.

** Facts and figures
:PROPERTIES:
:CREATED:  [2024-01-22 Mon 18:15]
:CUSTOM_ID: sec:usefacts
:END:

Dodona's design decisions have allowed it to spread to more than {{{num_schools}}} schools, colleges and universities, mainly in Flanders (Belgium) and the Netherlands.
The renewed interest in embedding computational thinking in formal education has undoubtedly been an important stimulus for such a wide uptake\nbsp{}[cite:@wingComputationalThinking2006].
All other educational institutions use the instance of Dodona hosted at Ghent University, which is free to use for educational purposes.

Dodona currently hosts a collection of {{{num_exercises}}} learning activities that are freely available to all teachers, allowing them to create their own learning paths tailored to their teaching practice.
In total, {{{num_users}}} students have submitted more than {{{num_submissions}}} solutions to Dodona in the seven years that it has been running (Figures\nbsp{}[[fig:useadoption1]],\nbsp{}[[fig:useadoption2]]\nbsp{}&\nbsp{}[[fig:useadoption3]]).

#+CAPTION: Overview of the number of submitted solutions by academic year.
#+CAPTION: Note that the data for the academic year 2023--2024 is incomplete, since the academic year has not finished yet at the time of data collection (March 2024).
#+NAME: fig:useadoption1
[[./images/useadoption1.png]]

#+CAPTION: Overview of the number of active users by academic year.
#+CAPTION: Users were active when they submitted at least one solution for a programming assignment during the academic year.
#+CAPTION: Note that the data for the academic year 2023--2024 is incomplete, since the academic year has not finished yet at the time of data collection (March 2024).
#+NAME: fig:useadoption2
[[./images/useadoption2.png]]

#+CAPTION: Overview of the number of active users by academic year per institution type.
#+CAPTION: Users were active when they submitted at least one solution for a programming assignment during the academic year.
#+CAPTION: Note that the data for the academic year 2023--2024 is incomplete, since the academic year has not finished yet at the time of data collection (March 2024).
#+NAME: fig:useadoption3
[[./images/useadoption3.png]]

In the year 2023, the highest number of monthly active users was reached in November, when 9\thinsp{}678 users submitted at least one solution.
About half of these users are from secondary education, a quarter from Ghent University, and the rest mostly from other higher education institutions.
Every year, we see the largest increase of new users during September, where the same ratios between Ghent University, higher, and secondary education are kept.
The record for most submissions in one day was recently broken on the 12th of January 2024, when the course described in Section\nbsp{}[[#sec:usecasestudy]] had one exam for all students for the first time in its history, and those students submitted 38\thinsp{}771 solutions in total.
Interestingly enough, the day before (the 11th of January) was the third-busiest day ever.
The day with the most distinct users was the 23rd of October 2023, when there were 2\thinsp{}680 users who submitted at least one solution.
This is due to the fact that there were a lot of exercise sessions on Fridays in the first semester of the academic year; a lot of the other Fridays at the start of the semester are also in the top 10 of busiest days ever (both in submissions and in amount of users).
The full top 10 of submissions can be seen in Table\nbsp{}[[tab:usetop10submissions]].
The top 10 of active users can be seen in Table\nbsp{}[[tab:usetop10users]].

#+CAPTION: Top 10 of days with the most submissions on Dodona.
#+CAPTION: This analysis was done in March 2024.
#+NAME: tab:usetop10submissions
| Date       |  # submissions |
|------------+----------------|
| <l>        |            <r> |
| 2024-01-12 | 38\thinsp{}771 |
| 2023-10-23 | 38\thinsp{}431 |
| 2024-01-11 | 38\thinsp{}148 |
| 2020-01-22 | 33\thinsp{}161 |
| 2023-10-09 | 32\thinsp{}668 |
| 2019-01-23 | 32\thinsp{}464 |
| 2023-10-02 | 32\thinsp{}447 |
| 2019-01-24 | 32\thinsp{}113 |
| 2023-11-06 | 30\thinsp{}896 |
| 2023-10-16 | 30\thinsp{}103 |

#+CAPTION: Top 10 of days with the most users who submitted at least once on Dodona.
#+CAPTION: This analysis was done in March 2024.
#+NAME: tab:usetop10users
| Date       | # active users |
|------------+----------------|
| <l>        |            <r> |
| 2023-10-23 |  2\thinsp{}680 |
| 2023-10-09 |  2\thinsp{}659 |
| 2023-11-20 |  2\thinsp{}581 |
| 2023-10-02 |  2\thinsp{}381 |
| 2023-10-16 |  2\thinsp{}364 |
| 2023-11-06 |  2\thinsp{}343 |
| 2023-10-17 |  2\thinsp{}287 |
| 2023-11-27 |  2\thinsp{}274 |
| 2022-10-03 |  2\thinsp{}265 |
| 2023-11-13 |  2\thinsp{}167 |

In addition to the quantitative figures above, we also performed a qualitative user experience study of Dodona in 2018.
271 tertiary education students responded to a questionnaire that contained the following three questions:
#+ATTR_LATEX: :environment enumerate*
#+ATTR_LATEX: :options [label={\emph{\roman*)}}, itemjoin={{ }}, itemjoin*={{ }}]
- What are the things you value while working with Dodona?
- What are the things that bother you while working with Dodona?
- What are your suggestions for improvements to Dodona?
Students praised its user-friendliness, beautiful interface, immediate feedback with visualized differences between expected and generated output, integration of the Python Tutor, linting feedback and large number of test cases.
Negative points were related to differences between the students' local execution environments and the environment in which Dodona runs the tests, and the strictness with which the tests are evaluated.
Other negative feedback was mostly related to individual courses the students were taking instead of the platform itself.

** Use in a programming course
:PROPERTIES:
:CREATED: [2023-10-23 Mon 08:48]
:CUSTOM_ID: sec:usecasestudy
:END:

Since the academic year 2011--2012 we have organized an introductory Python course at Ghent University (Belgium) with a strong focus on active and online learning.
Initially the course was offered twice a year in the first and second term, but from academic year 2014--2015 onwards it was only offered in the first term.
The course is taken by a mix of undergraduate, graduate, and postgraduate students enrolled in various study programmes (mainly formal and natural sciences, but not computer science), with 442 students enrolled for the 2021--2022 edition[fn:: https://dodona.be/courses/773/].

*** Course structure
:PROPERTIES:
:CREATED: [2023-10-24 Tue 11:47]
:CUSTOM_ID: subsec:usecourse
:END:

Each course edition has a fixed structure, with 13 weeks of educational activities subdivided in two successive instructional units that each cover five topics of the Python programming language -- one topic per week -- followed by a graded test about all topics covered in the unit (Figure\nbsp{}[[fig:usefwecoursestructure]]).
The final exam at the end of the term evaluates all topics covered in the entire course.
Students who fail the course during the first exam in January can take a resit exam in August/September that gives them a second chance to pass the course.

#+CAPTION: *Top*: Structure of the Python course that runs each academic year across a 13-week term (September--December).
#+CAPTION: Students submit solutions for ten series with six mandatory assignments, two tests with two assignments and an exam with three assignments.
#+CAPTION: There is also a resit exam with three assignments in August/September if they failed the first exam in January.
#+CAPTION: *Bottom*: Heatmap from Dodona learning analytics page showing distribution per day of all 331\thinsp{}734 solutions submitted during the 2021--2022 edition of the course (442 students).
#+CAPTION: The darker the colour, the more solutions were submitted that day.
#+CAPTION: Weekly lab sessions for different groups on Monday afternoon, Friday morning and Friday afternoon, where we can see darker squares.
#+CAPTION: Weekly deadlines for mandatory assignments on Tuesdays at 22:00.
#+CAPTION: Three exam sessions for different groups in January.
#+CAPTION: Two more exam sessions for different groups in August/September.
#+NAME: fig:usefwecoursestructure
[[./images/usefwecoursestructure.png]]

In the regular weeks, when a new programming topic is covered, students prepare themselves by reading the textbook chapters covering the topic, following the flipped classroom approach\nbsp{}[cite:@bishopFlippedClassroomSurvey2013; @akcayirFlippedClassroomReview2018].
Lectures are interactive programming sessions that aim at bridging the initial gap between theory and practice, advancing concepts, and engaging in collaborative learning\nbsp{}[cite:@tuckerFlippedClassroom2012].
Along the same lines, the first assignment for each topic is an ISBN-themed programming challenge whose model solution is shared with the students, together with an instructional video that works step-by-step towards the model solution.
Students must then try to solve five other programming assignments on that topic before a deadline one week later.
That results in 60 mandatory assignments across the semester.
Students can work on their programming assignments during weekly computer labs, where they can collaborate in small groups and ask help from teaching assistants.
They can also work on their assignments and submit solutions outside lab sessions.
In addition to the mandatory assignments, students can further elaborate on their programming skills by tackling additional programming exercises they select from a pool of over 900 exercises linked to the ten programming topics.
Submissions for these additional exercises are not taken into account in the final grade.

*** Assessment, feedback and grading
:PROPERTIES:
:CREATED: [2023-10-24 Tue 11:47]
:CUSTOM_ID: subsec:useassessment
:END:

We use Dodona to promote students' active learning through problem-solving\nbsp{}[cite:@princeDoesActiveLearning2004].
Each course edition has its own dedicated course in Dodona, with a learning path containing all mandatory, test, and exam assignments grouped into series with corresponding deadlines.
Mandatory assignments for the first unit are published at the start of the semester, and those for the second unit after the test of the first unit.
For each test and exam we organize multiple sessions for different groups of students.
Assignments for test and exam sessions are provided in a hidden series that is only accessible for students participating in the session using a shared token link.
The test and exam assignments are published afterwards for all students, when grades are announced.
Students can see class progress when working on their mandatory assignments, nudging them to avoid procrastination.
Only teachers can see class progress for test and exam series so as not to accidentally stress out students.
For the same reason, we intentionally organize tests and exams following exactly the same procedure, so that students can take high-stake exams in a familiar context and adjust their approach based on previous experiences.
The only difference is that test assignments are not as hard as exam assignments, as students are still in the midst of learning programming skills when tests are taken.

Students are stimulated to use an integrated development environment (IDE) to work on their programming assignments.
IDEs bundle a battery of programming tools to support today's generation of software developers in writing, building, running, testing, and debugging software.
Working with such tools can be a true blessing for both seasoned and novice programmers, but there is no silver bullet\nbsp{}[cite:@brooksNoSilverBullet1987].
Learning to code remains inherently hard\nbsp{}[cite:@kelleherAlice2ProgrammingSyntax2002] and consists of challenges that are different to reading and learning natural languages\nbsp{}[cite:@fincherWhatAreWe1999].
As an additional aid, students can continuously submit (intermediate) solutions for their programming assignments and immediately receive automatically generated feedback upon each submission, even during tests and exams.
Guided by that feedback, they can track potential errors in their code, remedy them and submit updated solutions.
There is no restriction on the number of solutions that can be submitted per assignment.
All submitted solutions are stored, but for each assignment only the last submission before the deadline is taken into account to grade students.
This allows students to update their solutions after the deadline (i.e.\nbsp{}after model solutions are published) without impacting their grades, as a way to further practice their programming skills.
One effect of active learning, triggered by mandatory assignments with weekly deadlines and intermediate tests, is that most learning happens during the term (Figure\nbsp{}[[fig:usefwecoursestructure]]).
In contrast to other courses, students do not spend a lot of time practising their coding skills for this course in the days before an exam.
We want to explicitly encourage this behaviour, because we strongly believe that one cannot learn to code in a few days' time\nbsp{}[cite:@peternorvigTeachYourselfProgramming2001].

For the assessment of tests and exams, we follow the line of thought that human expert feedback through source code annotations is a valuable complement to feedback coming from automated assessment, and that human interpretation is an absolute necessity when it comes to grading\nbsp{}[cite:@staubitzPracticalProgrammingExercises2015; @jacksonGradingStudentPrograms1997; @ala-mutkaSurveyAutomatedAssessment2005].
We shifted from paper-based to digital code reviews and grading when support for manual assessment was released in version 3.7 of Dodona (summer 2020).
Although online reviewing positively impacted our productivity, the biggest gain did not come from an immediate speed-up in the process of generating feedback and grades compared to the paper-based approach.
While time-on-task remained about the same, our online source code reviews were much more elaborate than what we produced before on printed copies of student submissions.
This was triggered by improved reusability of digital annotations and the foresight of streamlined feedback delivery.
Where delivering custom feedback only requires a single click after the assessment of an evaluation has been completed in Dodona, it took us much more effort before to distribute our paper-based feedback.
Students were direct beneficiaries from more and richer feedback, as observed from the fact that 75% of our students looked at their personalized feedback within 24 hours after it had been released, before we even published grades in Dodona.
What did not change is the fact that we complement personalized feedback with collective feedback sessions in which we discuss model solutions for test and exam assignments, and the low numbers of questions we received from students on their personalized feedback.
As a future development, we hope to reduce the time spent on manual assessment through improved computer-assisted reuse of digital source code annotations in Dodona (see Chapter\nbsp{}[[#chap:feedback]]).

We primarily rely on automated assessment as a first step in providing formative feedback while students work on their mandatory assignments.
After all, a back-of-the-envelope calculation tells us it would take us 72 full-time equivalents (FTE) to generate equivalent amounts of manual feedback for mandatory assignments compared to what we do for tests and exams.
In addition to volume, automated assessment also yields the responsiveness needed to establish an interactive feedback loop\nbsp{}[cite:@gibbsConditionsWhichAssessment2005].
Automated assessment thus allows us to motivate students working through enough programming assignments and to stimulate their self-monitoring and self-regulated learning\nbsp{}[cite:@schunkSelfregulationLearningPerformance1994; @pintrichUnderstandingSelfregulatedLearning1995].
It results in triggering additional questions from students that we manage to respond to with one-to-one personalized human tutoring, either synchronously during hands-on sessions or asynchronously through Dodona's Q&A module.
We observe that individual students seem to have a strong bias towards either asking for face-to-face help during hands-on sessions or asking questions online.
This could be influenced by the time when they mainly work on their assignments, by their way of collaboration on assignments, or by reservations because of perceived threats to self-esteem or social embarrassment\nbsp{}[cite:@newmanStudentsPerceptionsTeacher1993; @karabenickRelationshipAcademicHelp1991].

In computing a final score for the course, we try to find an appropriate balance between stimulating students to find solutions for programming assignments themselves and collaborating with and learning from peers, instructors and teachers while working on assignments.
The final score is computed as the sum of a score obtained for the exam (80%) and a score for each unit that combines the student's performance on the mandatory and test assignments (10% per unit).
We use Dodona's grading module to determine scores for tests and exams based on correctness, programming style, choice made between the use of different programming techniques and the overall quality of the implementation.
The score for a unit is calculated as the score \(s\) for the two test assignments multiplied by the fraction \(f\) of mandatory assignments the student has solved correctly.
A solution for a mandatory assignment is considered correct if it passes all unit tests.
Evaluating mandatory assignments therefore doesn't require any human intervention, except for writing unit tests when designing the assignments, and is performed entirely by our Python judge.
In our experience, most students traditionally perform much better on mandatory assignments compared to test and exam assignments\nbsp{}[cite:@glassFewerStudentsAre2022], given the possibilities for collaboration on mandatory assignments.

*** Open and collaborative learning environment
:PROPERTIES:
:CREATED: [2023-10-24 Tue 11:59]
:CUSTOM_ID: subsec:useopen
:END:

We strongly believe that effective collaboration among small groups of students is beneficial for learning\nbsp{}[cite:@princeDoesActiveLearning2004], and encourage students to collaborate and ask questions to tutors and other students during and outside lab sessions.
We also demonstrate how they can embrace collaborative coding and pair programming services provided by modern integrated development environments\nbsp{}[cite:@williamsSupportPairProgramming2002; @hanksPairProgrammingEducation2011].
However, we do recommend them to collaborate in groups of no more than three students, and to exchange and discuss ideas and strategies for solving assignments rather than sharing literal code with each other.
After all, our main reason for working with mandatory assignments is to give students sufficient opportunity to learn topic-oriented programming skills by applying them in practice, and shared solutions spoil the learning experience.
The factor \(f\) in the score for a unit encourages students to keep fine-tuning their solutions for programming assignments until all test cases succeed before the deadline passes.
But maximizing that factor without proper learning of programming skills will likely yield a low test score \(s\) and thus an overall low score for the unit, even if many mandatory exercises were solved correctly.

Fostering an open collaboration environment to work on mandatory assignments with strict deadlines and taking them into account for computing the final score is a potential promoter for plagiarism, but using it as a weight factor for the test score rather than as an independent score item should promote learning by avoiding that plagiarism is rewarded.
It takes some effort to properly explain this to students.
We initially used MOSS\nbsp{}[cite:@schleimerWinnowingLocalAlgorithms2003] and now use Dolos\nbsp{}[cite:@maertensDolosLanguageagnosticPlagiarism2022] to monitor submitted solutions for mandatory assignments, both before and at the deadline.
The solution space for the first few mandatory assignments is too small for linking high similarity to plagiarism: submitted solutions only contain a few lines of code and the diversity of implementation strategies is small.
But at some point, as the solution space broadens, we start to see highly similar solutions that are reliable signals of code exchange among larger groups of students.
Strikingly this usually happens among students enrolled in the same study programme (Figure\nbsp{}[[fig:usefweplagiarism]]).
As soon as this happens -- typically in week 3 or 4 of the course -- plagiarism is discussed during the next lecture.
Usually this is a lecture about working with the string data type, so we can introduce plagiarism detection as a possible application of string processing.

#+CAPTION: Dolos plagiarism graphs for the Python programming assignment "\pi{}-ramidal constants" that was created and used for a test of the 2020--2021 edition of the course (left) and reused as a mandatory assignment in the 2021--2022 edition (right).
#+CAPTION: Graphs constructed from the last submission before the deadline of 142 and 382 students respectively.
#+CAPTION: The colour of each node represents the student's study programme.
#+CAPTION: Edges connect highly similar pairs of submissions, with similarity threshold set to 0.8 in both graphs.
#+CAPTION: Edge directions are based on submission timestamps in Dodona.
#+CAPTION: Clusters of connected nodes are highlighted with a distinct background colour and have one node with a solid border that indicates the first correct submission among all submissions in that cluster.
#+CAPTION: All students submitted unique solutions during the test, except for two students who confessed they exchanged a solution during the test.
#+CAPTION: Submissions for the mandatory assignment show that most students work either individually or in groups of two or three students, but we also observe some clusters of four or more students that exchanged solutions and submitted them with hardly any varying types and amounts of modifications.
#+NAME: fig:usefweplagiarism
[[./images/usefweplagiarism.png]]

In an announcement entitled "copy-paste \neq{} learn to code" we show students some pseudonymized Dolos plagiarism graphs that act as mirrors to make them reflect upon which node in the graph they could be (Figure\nbsp{}[[fig:usefweplagiarism]]).
We stress that the learning effect dramatically drops in groups of four or more students.
Typically, we notice that in such a group only one or a few students make the effort to learn to code, while the other students usually piggyback by copy-pasting solutions.
We make students aware that understanding someone else's code for programming assignments is a lot easier than trying to find solutions themselves.
Over the years, we have experienced that a lot of students are caught in the trap of genuinely believing that being able to understand code is the same as being able to write code that solves a problem until they take a test at the end of a unit.
That's where the \(s\) factor of the test score comes into play.
After all, the goal of summative tests is to evaluate if individual students have acquired the skills to solve programming challenges on their own.

When talking to students about plagiarism, we also point out that the plagiarism graphs are directed graphs, indicating which student is the potential source of exchanging a solution among a cluster of students.
We specifically address these students by pointing out that they are probably good at programming and might want to exchange their solutions with other students in a way to help their peers.
Instead of really helping them out though, they actually take away learning opportunities from their fellow students by giving away the solution as a spoiler.
Stated differently, they help maximize the factor \(f\) but effectively also reduce the \(s\) factor of the test score, where both factors need to be high to yield a high score for the unit.
After this lecture, we usually notice a stark decline in the amount of plagiarized solutions.

The goal of plagiarism detection at this stage is prevention rather than penalization, because we want students to take responsibility over their learning.
The combination of realizing that teachers and instructors can easily detect plagiarism and an upcoming test that evaluates if students can solve programming challenges on their own, usually has an immediate and persistent effect on reducing cluster sizes in the plagiarism graphs to at most three students.
At the same time, the signal is given that plagiarism detection is one of the tools we have to detect fraud during tests and exams.
The entire group of students is only addressed once about plagiarism, without going into detail about how plagiarism detection itself works, because we believe that overemphasizing this topic is not very effective and explaining how it works might drive students towards spending time thinking on how they could bypass the detection process, which is time they'd better spend on learning to code.
Every three or four years we see a persistent cluster of students exchanging code for mandatory assignments over multiple weeks.
If this is the case, we individually address these students to point them again on their responsibilities, again differentiating between students that share their solution and students that receive solutions from others.

Tests and exams, on the other hand, are taken on-campus under human surveillance and allow no communication with fellow students or other persons (and, more recently, also no generative AI).
Students can work on their personal computers and get exactly two hours to solve two programming assignments during a test, and three hours and thirty minutes to solve three programming assignments during an exam.

Tests and exams are "open book/open Internet", so any hard copy and digital resources can be consulted while solving test or exam assignments.
Students are instructed that they can only be passive users of the Internet: all information available on the Internet at the start of a test or exam can be consulted, but no new information can be added.
When taking over code fragments from the Internet, students have to add a proper citation as a comment in their submitted source code.

After each test and exam, we again use MOSS/Dolos to detect and inspect highly similar code snippets among submitted solutions and to find convincing evidence they result from exchange of code or other forms of interpersonal communication (Figure\nbsp{}[[fig:usefweplagiarism]]).
If we catalogue cases as plagiarism beyond reasonable doubt, the examination board is informed to take further action\nbsp{}[cite:@maertensDolosLanguageagnosticPlagiarism2022].

*** Workload for running a course edition
:PROPERTIES:
:CREATED: [2023-10-24 Tue 13:46]
:CUSTOM_ID: subsec:useworkload
:END:

To organize "open book/open Internet" tests and exams that are valid and reliable, we always create new assignments and avoid assignments whose solutions or parts thereof are readily available online.
At the start of a test or exam, we share a token link that gives students access to the assignments in a hidden series on Dodona.

For each edition of the course, mandatory assignments were initially a combination of selected test and exam exercises reused from the previous edition of the course and newly designed exercises.
The former to give students an idea about the level of exercises they can expect during tests and exams, and the latter to avoid solution slippage.
As feedback for the students we publish sample solutions for all mandatory exercises after the weekly deadline has passed.
This also indicates that students must strictly adhere to deadlines, because sample solutions are available afterwards.
As deadlines are very clear and adjusted to timezone settings in Dodona, we never experience discussions with students about deadlines.

After nine editions of the course, we felt we had a large enough portfolio of exercises to start reusing mandatory exercises from four or more years ago instead of designing new exercises for each edition.
However, we still continue to design new exercises for each test and exam.
After each test and exam, exercises are published and students receive manual reviews on the code they submitted, on top of the automated feedback they already got during the test or exam.
But in contrast to mandatory exercises we do not publish sample solutions for test and exam exercises, so that these exercises can be reused during the next edition of the course.
When students ask for sample solutions of test or exam exercises, we explain that we want to give the next generation of students the same learning opportunities they had.

So far, we have created more than 900 programming assignments for this introductory Python course alone.
All these assignments are publicly shared on Dodona as open educational resources\nbsp{}[cite:@hylenOpenEducationalResources2021; @tuomiOpenEducationalResources2013; @wileyOpenEducationalResources2014; @downesModelsSustainableOpen2007; @caswellOpenEducationalResources2008].
They are used in many other courses on Dodona (on average 10.8 courses per assignment) and by many students (on average 503.7 students and 4\thinsp{}801.5 submitted solutions per assignment).
We estimate that it takes about 10 person-hours on average to create a new assignment for a test or an exam: 2 hours for coming up with an idea, 30 minutes for implementing and tweaking a sample solution that meets the educational goals of the assignment and can be used to generate a test suite for automated assessment, 4 hours for describing the assignment (including background research), 30 minutes for translating the description from Dutch into English, one hour to configure support for automated assessment, and another 2 hours for reviewing the result by some extra pairs of eyes.

Generating a test suite usually takes 30 to 60 minutes for assignments that can rely on basic test and feedback generation features that are built into the judge.
The configuration for automated assessment might take 2 to 3 hours for assignments that require more elaborate test generation or that need to extend the judge with custom components for dedicated forms of assessment (e.g.\nbsp{}assessing non-deterministic behaviour) or feedback generation (e.g.\nbsp{}generating visual feedback).
[cite/t:@keuningSystematicLiteratureReview2018] found that publications rarely describe how difficult and time-consuming it is to add assignments to automated assessment platforms, or even if this is possible at all.

The ease of extending Dodona with new programming assignments is reflected by more than {{{num_exercises}}} assignments that have been added to the platform so far.
Our experience is that configuring support for automated assessment only takes a fraction of the total time for designing and implementing assignments for our programming course, and in absolute numbers stays far away from the one person-week reported for adding assignments to Bridge\nbsp{}[cite:@bonarBridgeIntelligentTutoring1988].
Because the automated assessment infrastructure of Dodona provides common resources and functionality through a Docker container and a judge, the assignment-specific configuration usually remains lightweight.
Only around 5% of the assignments need extensions on top of the built-in test and feedback generation features of the judge.

So how much effort does it cost us to run one edition of our programming course?
For the most recent 2021--2022 edition we estimate about 34 person-weeks in total (Table\nbsp{}[[tab:usefweworkload]]), the bulk of which is spent on on-campus tutoring of students during hands-on sessions (30%), manual assessment and grading (22%), and creating new assignments (21%).
About half of the workload (53%) is devoted to summative feedback through tests and exams: creating assignments, supervision, manual assessment and grading.
Most of the other work (42%) goes into providing formative feedback through on-campus and online assistance while students work on their mandatory assignments.
Out of 2\thinsp{}215 questions that students asked through Dodona's online Q&A module, 1\thinsp{}983 (90%) were answered by teaching assistants and 232 (10%) were marked as answered by the student who originally asked the question.
Because automated assessment provides first-line support, the need for human tutoring is already heavily reduced.
We have drastically cut the time we initially spent on mandatory assignments by reusing existing assignments and because the Python judge is stable enough to require hardly any maintenance or further development.

#+CAPTION: Estimated workload to run the 2021--2022 edition of the introductory Python programming course for 442 students with 1 lecturer, 7 teaching assistants and 3 undergraduate students who serve as teaching assistants\nbsp{}[cite:@gordonUndergraduateTeachingAssistants2013].
#+NAME: tab:usefweworkload
| Task                                | Estimated workload (hours) |
|-------------------------------------+----------------------------|
| <l>                                 |                        <r> |
| Lectures                            |                         60 |
|-------------------------------------+----------------------------|
| Mandatory assignments               |                        540 |
| \emsp{} Select assignments          |                         10 |
| \emsp{} Review selected assignments |                         30 |
| \emsp{} Tips & tricks               |                         10 |
| \emsp{} Automated assessment        |                          0 |
| \emsp{} Hands-on sessions           |                        390 |
| \emsp{} Answering online questions  |                        100 |
|-------------------------------------+----------------------------|
| Tests & exams                       |                        690 |
| \emsp{} Create new assignments      |                        270 |
| \emsp{} Supervise tests and exams   |                        130 |
| \emsp{} Automated assessment        |                          0 |
| \emsp{} Manual assessment           |                        288 |
| \emsp{} Plagiarism detection        |                          2 |
|-------------------------------------+----------------------------|
| Total                               |              1\thinsp{}290 |

*** Learning analytics and educational data mining
:PROPERTIES:
:CREATED: [2023-10-24 Tue 14:04]
:CUSTOM_ID: subsec:uselearninganalytics
:END:

A longitudinal analysis of student submissions across the term shows that most learning happens during the 13 weeks of educational activities and that students don't have to catch up practising their programming skills during the exam period (Figure\nbsp{}[[fig:usefwecoursestructure]]).
Active learning thus effectively avoids procrastination.
We observe that students submit solutions every day of the week and show increased activity around hands-on sessions and in the run-up to the weekly deadlines (Figure\nbsp{}[[fig:usefwepunchcard]]).
Weekends are also used to work further on programming assignments, but students seem to be watching over a good night's sleep.

#+CAPTION: Punchcard from the Dodona learning analytics page showing the distribution per weekday and per hour of all 331\thinsp{}734 solutions submitted during the 2021--2022 edition of the course (442 students).
#+NAME: fig:usefwepunchcard
[[./images/usefwepunchcard.png]]

Throughout a course edition, we use Dodona's series analytics to monitor how students perform on our selection of programming assignments (Figures\nbsp{}[[fig:usefweanalyticssubmissions]],\nbsp{}[[fig:usefweanalyticsstatuses]],\nbsp{}and\nbsp{}[[fig:usefweanalyticscorrect]]).
This allows us to make informed decisions and appropriate interventions, for example when students experience issues with the automated assessment configuration of a particular assignment or if the original order of assignments in a series does not seem to align with our design goal to present them in increasing order of difficulty.
The first students that start working on assignments usually are good performers.
Seeing these early birds having trouble with solving one of the assignments may give an early warning that action is needed, such as improving the problem specification, adding extra tips & tricks, or better explaining certain programming concepts to all students during lectures or hands-on sessions.
Reversely, observing that many students postpone working on their assignments until just before the deadline might indicate that some assignments are simply too hard at this moment in time through the learning pathway of the students or that completing the collection of programming assignments interferes with the workload from other courses.
Such "deadline hugging" patterns are also a good breeding ground for students to resort on exchanging solutions among each other.

#+CAPTION: Distribution of the number of student submissions per programming assignment.
#+CAPTION: The larger the zone, the more students submitted a particular number of solutions.
#+CAPTION: Black dot indicates the average number of submissions per student.
#+NAME: fig:usefweanalyticssubmissions
[[./images/usefweanalyticssubmissions.png]]

#+CAPTION: Distribution of top-level submission statuses per programming assignment.
#+NAME: fig:usefweanalyticsstatuses
[[./images/usefweanalyticsstatuses.png]]

#+CAPTION: Progression over time of the percentage of students that correctly solved each assignment.
#+CAPTION: The visualisation starts two weeks before the deadline, which is on the 19th of October.
#+NAME: fig:usefweanalyticscorrect
[[./images/usefweanalyticscorrect.png]]

Using educational data mining techniques on historical data exported from several editions of the course, we further investigated what aspects of practising programming skills promote or inhibit learning, or have no or minor effect on the learning process\nbsp{}(see Chapter\nbsp{}[[#chap:passfail]]).
It won't come as a surprise that midterm test scores are good predictors for a student's final grade, because tests and exams are both summative assessments that are organized and graded in the same way.
However, we found that organizing a final exam end-of-term is still a catalyst of learning, even for courses with a strong focus of active learning during weeks of educational activities.

In evaluating if students gain deeper understanding when learning from their mistakes while working progressively on their programming assignments, we found the old adage that practice makes perfect to depend on what kind of mistakes students make.
Learning to code requires mastering two major competences:
#+ATTR_LATEX: :environment enumerate*
#+ATTR_LATEX: :options [label={\emph{\roman*)}}, itemjoin={{, }}, itemjoin*={{, and }}]
- getting familiar with the syntax and semantics of a programming language to express the steps for solving a problem in a formal way, so that the algorithm can be executed by a computer
- problem-solving itself.
  It turns out that staying stuck longer on compilation errors (mistakes against the syntax of the programming language) inhibits learning, whereas taking progressively more time to get rid of logical errors (reflective of solving a problem with a wrong algorithm) as assignments get more complex actually promotes learning.
  After all, time spent in discovering solution strategies while thinking about logical errors can be reclaimed multifold when confronted with similar issues in later assignments\nbsp{}[cite:@glassFewerStudentsAre2022].

These findings neatly align with the claim of\nbsp{}[cite/t:@edwardsSeparationSyntaxProblem2018] that problem-solving is a higher-order learning task in the Taxonomy by\nbsp{}[cite/t:@bloom1956handbook] (analysis and synthesis) than language syntax (knowledge, comprehension, and application).

Using historical data from previous course editions, we can also make highly accurate predictions about what students will pass or fail the current course edition\nbsp{}(see Chapter\nbsp{}[[#chap:passfail]]).
This can already be done after a few weeks into the course, so remedial actions for at-risk students can be started well in time.
The approach is privacy-friendly as we only need to process metadata on student submissions for programming assignments and results from automated and manual assessment extracted from Dodona.
Given that cohort sizes are large enough, historical data from a single course edition are already enough to make accurate predictions.

* Under the hood: technical architecture and design
:PROPERTIES:
:CREATED: [2023-10-23 Mon 08:49]
:CUSTOM_ID: chap:technical
:END:

Dodona and its ecosystem comprise a lot of code.
This chapter discusses the technical background of Dodona itself\nbsp{}[cite:@vanpetegemDodonaLearnCode2023] and a stand-alone online code editor, Papyros (\url{https://papyros.dodona.be}), that was integrated into Dodona\nbsp{}[cite:@deridderPapyrosSchrijvenUitvoeren2022].
We also discuss two judges that were developed in the context of this dissertation.
The R judge was written entirely by myself\nbsp{}[cite:@nustRockerversePackagesApplications2020].
The TESTed judge was first prototyped in a master's thesis\nbsp{}[cite:@vanpetegemComputationeleBenaderingenVoor2018] and was further developed in two other master's theses\nbsp{}[cite:@selsTESTedProgrammeertaalonafhankelijkTesten2021; @strijbolTESTedOneJudge2020].
In this chapter we assume the reader is familiar with Dodona's features and how they are used, as detailed in Chapters\nbsp{}[[#chap:what]]\nbsp{}and\nbsp{}[[#chap:use]].

** Dodona
:PROPERTIES:
:CREATED: [2023-10-23 Mon 08:49]
:CUSTOM_ID: sec:techdodona
:END:

To ensure that Dodona[fn:: https://github.com/dodona-edu/dodona] is robust against sudden increases in workload and when serving hundreds of concurrent users, it has a multi-tier service architecture that delegates different parts of the application to different servers, as can be seen on Figure\nbsp{}[[fig:technicaldodonaservers]].
More specifically, the web server, database (MySQL) and caching system (Memcached) each run on their own machine.
In addition, a scalable pool of interchangeable worker servers are available to automatically assess incoming student submissions.
In this section, we will highlight a few of these components.

#+CAPTION: Diagram of all the servers involved with running and developing Dodona.
#+CAPTION: The role of each server in the deployment is listed below its name.
#+CAPTION: Worker servers are marked in blue, development servers are marked in red.
#+CAPTION: Servers are connected if they communicate.
#+CAPTION: The direction of the connection signifies which server initiates the connection.
#+CAPTION: Every server also has an implicit connection with Phocus (the monitoring server), since metrics such as load, CPU usage, disk usage, etc. are collected and sent to Phocus on every server.
#+CAPTION: The Pandora server is greyed out because it has been decommissioned (see Section\nbsp{}[[#subsec:techdodonatutor]] for more info).
#+NAME: fig:technicaldodonaservers
[[./diagrams/technicaldodonaservers.svg]]

*** The Dodona web application
:PROPERTIES:
:CREATED: [2023-11-23 Thu 17:12]
:CUSTOM_ID: subsec:techdodonaweb
:END:

The user-facing part of Dodona runs on the main web server, which is also called Dodona (see Figure\nbsp{}[[fig:technicaldodonaservers]]).
Dodona is a Ruby-on-Rails web application, currently running on Ruby 3.1 and Rails 7.1.
We use Apache 2.4.52 to proxy our requests to the actual application.
We follow the Rails-standard way of organizing functionality in models, views and controllers.
In Rails, requests are sent to the appropriate action in the appropriate controller by the router.
In these actions, models are queried and/or edited, after which they are used to construct the data for a response.
This data is then rendered by the corresponding view, which can be HTML, JSON, JavaScript, or even a CSV.

The way we handle complex logic in the frontend has seen a number of changes along the years.
When Dodona was started, there were only a few places where JavaScript was used.
Dodona also used the Rails-standard way of serving dynamically generated JavaScript to replace parts of pages (e.g. for pagination or search).
With the introduction of more complex features like evaluations, we switched to using lightweight web components where this made sense.
We also eliminated jQuery, because more and more of its functionality was implemented natively by browsers.
And lastly, all JavaScript was rewritten to TypeScript.

**** Security and performance
:PROPERTIES:
:CREATED:  [2024-01-10 Wed 14:23]
:CUSTOM_ID: subsubsec:techdodonawebsecurity
:END:

Another important aspect of running a public web application is its security.
Dodona needs to operate in a challenging environment where students simultaneously submit untrusted code to be executed on its servers ("remote code execution as a service") and expect automatically generated feedback, ideally within a few seconds.
Many design decisions are therefore aimed at maintaining and improving the performance, reliability, and security of its systems.
This includes using Cloudflare as a CDN and common protections such as a Content Security Policy or Cross Site Request Forgery protection, but is also reflected in the implementation of Dodona itself, as we will explain in this section.

Since Dodona grew from being used to teach mostly by people we knew personally to being used in secondary schools all over Flanders, we went from being able to fully trust exercise authors to having this trust reduced (as it is impossible for a team of our size to vet all the people we give teacher's rights in Dodona).
This meant that our threat model and therefore the security measures we had to take also changed over the years.
Once Dodona was opened up to more and more teachers, we gradually locked down what teachers could do with e.g. their exercise descriptions.
Content where teachers can inject raw HTML into Dodona was moved to iframes, to make sure that teachers could still be as creative as they wanted while writing exercises, while simultaneously not allowing them to execute JavaScript in a session where users are logged in.
For user content where this creative freedom is not as necessary (e.g. series or course descriptions), but some Markdown/HTML content is still wanted, we sanitize the (generated) HTML so that it can only include HTML elements and attributes that are specifically allowed.

One of the most important components of Dodona is the feedback table.
It has, therefore, seen a lot of security, optimization and UI work over the years.
Judge and exercise authors (and even students, through their submissions) can determine a lot of the content that eventually ends up in the feedback table.
Therefore, the same sanitization that is used for series and course descriptions is used for the messages that are added to the feedback table (since these can contain Markdown and arbitrary HTML as well).
The increase in teachers that added exercises to Dodona also meant that the variety in feedback given grew, sometimes resulting in a huge volume of testcases and long output.

Optimization work was needed to cope with this volume of feedback.
For example, one of the biggest optimizations was in how expected and generated feedback are diffed and how these diffs are rendered.
When Dodona was first written, the library used for creating diffs of the generated and expected results (=diffy=[fn:: https://github.com/samg/diffy]) actually shelled out to the GNU =diff= command.
This output was parsed and transformed into HTML by the library using find and replace operations.
As one can expect, starting a new process and doing a lot of string operations every time outputs had to be diffed resulted in very slow loading times for the feedback table.
The library was replaced with a pure Ruby library (=diff-lcs=[fn:: https://github.com/halostatue/diff-lcs]), and its outputs were built into HTML using Rails' efficient =Builder= class.
This change of diffing method also fixed a number of bugs we were experiencing along the way.

Even this was not enough to handle the most extreme of exercises though.
Diffing hundreds of lines hundreds of times still takes a long time, even if done in-process while optimized by a JIT.
The resulting feedback tables also contained so much HTML that the browsers on our development machines (which are pretty powerful machines) noticeably slowed down when loading and rendering them.
To handle these cases, we needed to do less work and needed to output less HTML.
We decided to only diff line-by-line (instead of character-by-character) in most of these cases and to not diff at all in the most extreme cases, reducing the amount of HTML required to render them as well.
This was also motivated by usability.
If there are lots of small differences between a very long generated and expected output, the diff view in the feedback table could also become visually overwhelming for students.

*** Judging submissions
:PROPERTIES:
:CREATED:  [2024-01-10 Wed 14:01]
:CUSTOM_ID: subsec:techdodonajudging
:END:

Student submissions are automatically assessed in background jobs by our worker servers (Salmoneus, Sisyphus, Tantalus, Tityos and Ixion; Figure\nbsp{}[[fig:technicaldodonaservers]]).
To divide the work over these servers we make use of a job queue, based on =delayed_job=[fn:: https://github.com/collectiveidea/delayed_job].
Each worker server has 6 job runners, which regularly poll the job queue when idle.

For proper virtualization we use Docker containers\nbsp{}[cite:@pevelerComparingJailedSandboxes2019] that use OS-level containerization technologies and define runtime environments in which all data and executable software (e.g. scripts, compilers, interpreters, linters, database systems) are provided and executed.
These resources are typically pre-installed in the image of the container.
Prior to launching the actual assessment, the container is extended with the submission, the judge and the resources included in the assessment configuration (Figure\nbsp{}[[fig:technicaloutline]]).
Additional resources can be downloaded and/or installed during the assessment itself, provided that Internet access is granted to the container.
When the container is started, limits are placed on the amount of resources it can consume.
This includes a limit in runtime, memory usage, disk usage, network access and the amount of processes a container can have running at the same time.
Some of these limits are (partially) configurable per exercise, but sane upper bounds are always applied.
This is also the case for network access, where even if the container is allowed internet access, it can not access other Dodona hosts (such as the database server).

#+CAPTION: Outline of the procedure to automatically assess a student submission for a programming assignment.
#+CAPTION: Dodona instantiates a Docker container (1) from the image linked to the assignment (or from the default image linked to the judge of the assignment) and loads the submission and its metadata (2), the judge linked to the assignment (3) and the assessment resources of the assignment (4) into the container.
#+CAPTION: Dodona then launches the actual assessment, collects and bundles the generated feedback (5), and stores it into a database along with the submission and its metadata.
#+NAME: fig:technicaloutline
[[./images/technicaloutline.png]]

The actual assessment of the student submission is done by a software component called a /judge/\nbsp{}[cite:@wasikSurveyOnlineJudge2018].
The judge must be robust enough to provide feedback on all possible submissions for the assignment, especially submissions that are incorrect or deliberately want to tamper with the automatic assessment procedure\nbsp{}[cite:@forisekSuitabilityProgrammingTasks2006].
Following the principles of software reuse, the judge is ideally also a generic framework that can be used to assess submissions for multiple assignments.
This is enabled by the submission metadata that is passed when calling the judge, which includes the path to the source code of the submission, the path to the assessment resources of the assignment and other metadata such as programming language, natural language, time limit and memory limit.

Rather than providing a fixed set of judges, Dodona adopts a minimalistic interface that allows third parties to create new judges: automatic assessment is bootstrapped by launching the judge's =run= executable that can fetch the JSON formatted submission metadata from standard input and must generate JSON formatted feedback on standard output.
The feedback has a standardized hierarchical structure that is specified in a JSON schema[fn:: https://github.com/dodona-edu/dodona/tree/main/public/schemas].
At the lowest level, /tests/ are a form of structured feedback expressed as a pair of generated and expected results.
They typically test some behaviour of the submitted code against expected behaviour.
Tests can have a brief description and snippets of unstructured feedback called messages.
Descriptions and messages can be formatted as plain text, HTML (including images), Markdown, or source code.
Tests can be grouped into /test cases/, which in turn can be grouped into /contexts/ and eventually into /tabs/.
All these hierarchical levels can have descriptions and messages of their own and serve no other purpose than visually grouping tests in the user interface.
At the top level, a submission has a fine-grained status that reflects the overall assessment of the submission: =compilation error= (the submitted code did not compile), =runtime error= (executing the submitted code failed during assessment), =memory limit exceeded= (memory limit was exceeded during assessment), =time limit exceeded= (assessment did not complete within the given time), =output limit exceeded= (too much output was generated during assessment), =wrong= (assessment completed but not all strict requirements were fulfilled), or =correct= (assessment completed, and all strict requirements were fulfilled).

Taken together, a Docker image, a judge and a programming assignment configuration (including both a description and an assessment configuration) constitute a /task package/ as defined by\nbsp{}[cite:@verhoeffProgrammingTaskPackages2008]: a unit Dodona uses to render the description of the assignment and to automatically assess its submissions.
However, Dodona's layered design embodies the separation of concerns\nbsp{}[cite:@laplanteWhatEveryEngineer2007] needed to develop, update and maintain the three modules in isolation and to maximize their reuse: multiple judges can use the same docker image and multiple programming assignments can use the same judge.
Related to this, an explicit design goal for judges is to make the assessment configuration for individual assignments as lightweight as possible.
After all, minimal configurations reduce the time and effort teachers and instructors need to create programming assignments that support automated assessment.
Sharing of data files and multimedia content among the programming assignments in a repository also implements the inheritance mechanism for /bundle packages/ as hinted by\nbsp{}[cite/t:@verhoeffProgrammingTaskPackages2008].
Another form of inheritance is specifying default assessment configurations at the directory level, which takes advantage of the hierarchical grouping of learning activities in a repository to share common settings.

*** Python Tutor
:PROPERTIES:
:CREATED:  [2024-01-17 Wed 13:23]
:CUSTOM_ID: subsec:techdodonatutor
:END:

The Python Tutor\nbsp{}[cite:@guoOnlinePythonTutor2013] is a debugger built into Dodona.
It provides timeline debugging, where for each step in the timeline, each corresponding to a line being executed, all variables on the stack are visualized.

The deployment of the Python Tutor also saw a number of changes over the years.
The Python Tutor itself is written in Python, so could not be part of Dodona itself.
It started out as a Docker container on the same server as the main Dodona web application.
Because it is used mainly by students who want to figure out their mistakes, the service responsible for running student code could become overwhelmed and in extreme cases even make the entire server unresponsive.
After we identified this issue, the Python Tutor was moved to its own server (Pandora in Figure\nbsp{}[[fig:technicaldodonaservers]]).
This did not fix the Tutor itself becoming overwhelmed however, which meant that students that depended on the Tutor were sometimes unable to use it.
This of course happened more during periods where the Tutor was being used a lot, such as evaluations and exams.
One can imagine that the experience for students who are already quite stressed out about the exam they are taking when the Tutor suddenly failed was not very good.
In the meantime, we had started to experiment with running Python code client-side in the browser (see Section\nbsp{}[[#sec:papyros]] for more info).
Because these experiments were successful, we migrated the Python Tutor from its own server to being run by students in their own browser using Pyodide.
This means that the only student that can be impacted by the Python Tutor failing for a testcase is the student themselves (and because the Tutor is being run on a device that is under a far less heavy load, the Python Tutor fails much less often).
In practice, we got no questions or complaints about the Python Tutor's performance after these changes, even during exams where 460 students were submitting simultaneously.

*** Development process
:PROPERTIES:
:CREATED: [2023-11-23 Thu 17:13]
:CUSTOM_ID: subsec:techdodonadevelopment
:END:

Development of Dodona is done on GitHub.
Over the years, Dodona has seen over {{{num_commits}}} commits by {{{num_contributors}}} contributors, and there have been {{{num_releases}}} releases.
All new features and bug fixes are added to the =main= branch through pull requests, of which there have been about {{{num_prs}}}.
These pull requests are reviewed by (at least) two developers of the Dodona team before they are merged.
We also treat pull requests as a form of internal documentation by writing an extensive PR description and adding screenshots for all visual changes or additions.
The extensive test suite also runs automatically for every pull request (using GitHub Actions), and developers are encouraged to add new tests for each feature or bug fix.
We've also made it very easy to deploy to our testing (Mestra) and staging (Naos) environments so that reviewers can test changes without having to spin up their local development instance of Dodona.
These are the two unconnected servers seen in Figure\nbsp{}[[fig:technicaldodonaservers]].
Mestra runs a Dodona instance much like the instance developers use locally.
There is no production data present and in fact, the database is wiped and reseeded on every deploy.
Naos is much closer to the production setup.
It runs on a pseudonymized version of the production database, and has all the judges configured.

We also make sure that our dependencies are always up-to-date using Dependabot[fn:: https://docs.github.com/en/code-security/dependabot/working-with-dependabot].
By updating our dependencies regularly, we make sure that we are not met by incompatibilities that take a long time to integrate when there is an important security update.
Since Dodona is accessible over the public web, it would be problematic if we could not quickly apply security updates.

The way we release Dodona has seen a few changes over the years.
We've gone from a few large releases with bugfix point-releases between them, to lots of smaller releases, to now a /release/ per pull request.
Releasing every pull request immediately after merging makes getting feedback from our users a very quick process.
When we did versioned releases we also wrote release notes at the time of release.
Because we don't have versioned releases any more, we now bundle the changes into release notes for every month.
They are mostly autogenerated from the merged PRs, but bigger features are given more context and explanation.

*** Deployment process
:PROPERTIES:
:CREATED: [2023-11-23 Thu 17:13]
:CUSTOM_ID: subsec:techdodonadeployment
:END:

After a pull request is merged, it is automatically deployed by a GitHub action.
This action first runs all the tests again, deploys to the staging server and then deploys to the production servers.
Since Naos has a copy of the production database, the deploy would be stopped if there are any migrations that fail in production.
This way we can be sure the actual production database is never in an inconsistent migration state.
The actual deployment is done by Capistrano[fn:: https://capistranorb.com/].
Capistrano allows us to roll back any deploys and makes clever use of symlinking to make sure that deploys happen without any service interruption.

Backups of the database are automatically saved every day and kept for 12 months.
The backups are rotated according to a grandfather-father-son scheme\nbsp{}[cite:@jessen2010overview].
The backups are taken by dumping a replica database.
The replica database is used because dumping the main database write-locks it while it is being dumped, which would result in Dodona being unusable for a significant amount of time.
We regularly test the backups by restoring them on Naos.

We also have an extensive monitoring and alerting system in place, based on Grafana[fn:: https://grafana.com/].
This gives us some superficial analytics about Dodona usage, but can also tell us if there are problems with one of our servers.
See Figure\nbsp{}[[fig:technicaldashboard]] for an example of the data this dashboard gives us.
The analytics are also calculated using the replica database to avoid putting unnecessary load on our main production database.

The web server and worker servers also send notifications when an error occurs in their runtime.
This is one of the main ways we discover bugs that got through our tests, since our users don't regularly report bugs themselves.
We also get notified when there are long-running requests, since we consider our users having to wait a long time to see the page they requested a bug in itself.
These notifications were an important driver to optimize some pages or to make certain operations asynchronous.

#+CAPTION: Grafana dashboard for Dodona, giving a quick overview of important metrics.
#+NAME: fig:technicaldashboard
[[./images/technicaldashboard.png]]

** Papyros
:PROPERTIES:
:CREATED: [2023-11-23 Thu 17:29]
:CUSTOM_ID: sec:papyros
:END:

Papyros[fn:: https://github.com/dodona-edu/papyros] is a stand-alone basic online IDE we developed, primarily focused on secondary education (see Figure\nbsp{}[[fig:technicalpapyros]] for a screenshot).
Recurring feedback we got from secondary education teachers when introducing Dodona to them was that Dodona did not have a simple way for students to run and test their code themselves.
Testing their code in this case also means manually typing a response to an input prompt when an =input= statement is run by the interpreter.
In the educational practice that Dodona was born out of, this was an explicit design goal.
We wanted to guide students to use an IDE locally instead of programming in Dodona directly, since if they needed to program later in life, they would not have Dodona available as their programming environment.
This same goal is not present in secondary education.
In that context, the challenge of programming is already big enough, without complicating things by installing a real IDE with a lot of buttons and menus that students will never use.
Students might also be working on devices that they don't own (PCs in the school), where installing an IDE might not even be possible.

#+CAPTION: User interface of Papyros.
#+CAPTION: The editor can be seen on the left, with the output window to the right of it.
#+CAPTION: The input window is below the output window and is currently in batch mode.
#+CAPTION: All empty text fields have placeholder text that explains how they can be used.
#+NAME: fig:technicalpapyros
[[./images/technicalpapyros.png]]

There are a few reasons why we could not initially offer a simple online IDE.
Even though we can use a lot of the infrastructure very graciously offered by Ghent University, these resources are not limitless.
The extra (interactive) evaluation of student code was something we did not have the resources for, nor did we have any architectural components in place to easily integrate this into Dodona.
The main goal of Papyros was thus to provide a client-side Python execution environment we could then include in Dodona.
We focused on Python because it is the most widely used programming language in secondary education, at least in Flanders.
Note that we don't want to replace Dodona's entire execution model with client-side execution, as the client is an untrusted execution environment where debugging tools could be used to manipulate the results.
Because the main idea is integration in Dodona, we primarily wanted users to be able to execute entire programs, and not necessarily offer a REPL at first.

Given that the target audience for Papyros is secondary education students, we identified a number of secondary requirements:
- The editor of our online IDE should have syntax highlighting.
  Recent literature\nbsp{}[cite:@hannebauerDoesSyntaxHighlighting2018] has shown that this does not necessarily have an impact on students' learning, but as the authors point out, it was the prevailing wisdom for a long time that it does help.
- It should also include linting.
  Linters notify students about syntax errors, but also about style guide violations and anti-patterns.
- Error messages for errors that occur during execution should be user-friendly\nbsp{}[cite:@beckerCompilerErrorMessages2019].
- Code completion should be available. When starting out with programming, it is hard to remember all the different functions available.
  Completion frameworks allow students to search for functions, and can show inline documentation for these functions.

*** Execution
:PROPERTIES:
:CREATED: [2023-11-27 Mon 17:28]
:CUSTOM_ID: subsec:papyrosexecution
:END:

Python can't be executed directly by a browser, since only JavaScript and WebAssembly are natively supported.
We investigated a number of solutions for running Python code in the browser.

The first of these is Brython[fn:: https://brython.info].
Brython works by transpiling Python code to JavaScript, where the transpilation is implemented in JavaScript.
The project is conceptualized as a way to develop web applications in Python, and not to run arbitrary Python code in the browser, so a lot of its tooling is not directly applicable to our use case, especially concerning interactive input prompts.
It also runs on the main thread of the browser, so executing a student's code would freeze the browser until it is done running.

Another solution we looked into is Skulpt[fn:: https://skulpt.org].
It also transpiles Python code to JavaScript, and supports Python 2 and Python 3.7.
After loading Skulpt, a global object is added to the page where Python code can be executed through JavaScript.

The final option we looked into was Pyodide[fn:: https://pyodide.org/en/stable].
Pyodide was initially developed by Mozilla as part of their Iodide project, aiming to make scientific research shareable and reproducible via the browser.
It is now a stand-alone project.
Pyodide is a port of the Python interpreter to WebAssembly, allowing it to be executed by the browser.
Since the project is focused on scientific research, it has wide support for external libraries such as NumPy.
Because Pyodide can be treated as a regular JavaScript library, it can be run in a web worker, making sure that the page stays responsive while the user's code is being executed.

We also looked into integrating other platforms such as Repl.it, but all of them were either not free or did not provide a suitable interface for integration.
We chose to base Papyros on Pyodide given its active development, support for recent Python versions and its ability to be executed on a separate thread.

*** Implementation
:PROPERTIES:
:CREATED: [2023-11-27 Mon 17:28]
:CUSTOM_ID: subsec:papyrosimplementation
:END:

There are two aspects to the implementation: the user interface and the technical inner workings.
Given that this work will primarily be used by secondary school students, the user interface is an important part of this work that should not be neglected.

**** User interface
:PROPERTIES:
:CREATED: [2023-11-29 Wed 14:48]
:CUSTOM_ID: subsubsec:papyrosui
:END:

The most important choice in the user interface was the choice of the editor.
There were three main options:
#+ATTR_LATEX: :environment enumerate*
#+ATTR_LATEX: :options [label={\emph{\roman*)}}, itemjoin={{, }}, itemjoin*={{, and }}]
- Ace[fn:: https://ace.c9.io/]
- Monaco[fn:: https://microsoft.github.io/monaco-editor/]
- CodeMirror[fn:: https://codemirror.net/].

Ace was the editor used by Dodona at the time.
It supports syntax highlighting and has some built-in linting.
However, it is not very extensible, it doesn't support mobile devices well, and it's no longer actively developed.

Monaco is the editor extracted from Visual Studio Code and often used by people building full-fledged web IDE's.
It also has syntax highlighting and linting and is much more extensible.
As with Ace though, support for mobile devices is lacking.

CodeMirror is a modern editor made for the web, and not linked to any specific project.
It is also extensible and has modular syntax highlighting and linting support.
In contrast with Ace and Monaco, it has very good support for mobile devices.
Its documentation is also very clear and extensive.
Given the clear advantages, we decided to use CodeMirror for Papyros.

The two other main components of Papyros are the output window and the input window.
The output window is a simple read-only text area.
The input window is a text area that has two modes: interactive mode and batch input.
In interactive mode, the user is expected to write the input needed by their program the moment it asks for it (similar to running their program on the command line and answering the prompts when they appear).
In batch mode, the user can prefill all the input required by their program.

**** Inner workings
:PROPERTIES:
:CREATED: [2023-11-29 Wed 14:48]
:CUSTOM_ID: subsubsec:papyrosinner
:END:

Since Pyodide does the heavy lifting of executing the actual Python code, most of the implementation work consisted of making Pyodide run in a web worker and hooking up the Python internals to our user interface.
The communication between the main UI thread and the web worker happens via message passing.
With message passing, all data has to be copied.
To avoid having to copy large amounts of data, and to be able to copy actual functions, classes or HTML elements, shared memory can be used.
To work correctly with shared memory, synchronization primitives have to be used.

After loading Pyodide, we load a Python script that overwrites certain functions with our versions.
For example, base Pyodide will overwrite =input= with a function that calls into JavaScript-land and executes =prompt=.
Since we're running Pyodide in a web worker, =prompt= is not available (and we want to implement custom input handling anyway).
For =input= we actually run into another problem: =input= is synchronous in Python.
In a normal Python environment, =input= will only return a value once the user entered a line of text on the command line.
We don't want to edit user code (to make it asynchronous) because that process is error-prone and fragile.
So we need a way to make our overwritten version of =input= synchronous as well.

The best way to do this is by using the synchronization primitives of shared memory.
We can block on some other thread writing to a certain memory location, and since blocking is synchronous, this makes our =input= synchronous as well.
Unfortunately, not all browsers supported shared memory at the time.
Other browsers also severely constrain the environment in which shared memory can be used, since a number of CPU side channel attacks related to it were discovered.

Luckily, there is another way we can make the browser perform indefinite synchronous operations from a web worker.
Web workers can perform synchronous HTTP requests.
We can then intercept these HTTP requests from a service worker.
Service workers were originally conceived to allow web applications to continue functioning even when devices go offline.
In that case, a service worker could respond to network requests with data it has in its cache.
So, putting this together, the web worker tells the main thread that it needs input and then fires off a synchronous HTTP request to some non-existent endpoint.
The service worker intercepts this request, and responds to the request once it receives some input from the main thread.

The functionality for performing synchronous communication with the main thread from a web worker was parcelled off into its own library (=sync-message=[fn:: https://github.com/alexmojaki/sync-message]).
This library could then decide which of these two methods to use, depending on the available environment.
Another package, =python_runner=[fn:: https://github.com/alexmojaki/python_runner], bundles all required modifications to the Python environment in Pyodide.
This work was done in collaboration with Alex Hall.

**** Extensions
:PROPERTIES:
:CREATED: [2023-12-07 Thu 15:19]
:CUSTOM_ID: subsubsec:papyrosextensions
:END:

CodeMirror already has a number of functionalities it supports out of the box such as linting and code completion.
It is, however, a pure JavaScript library.
This means that these functionalities had to be newly implemented, since the standard tooling for Python is almost entirely implemented in Python.
Fortunately CodeMirror also supports supplying one's own linting message and code completion.
Since we have a working Python environment, we can also use it to run the standard Python tools for linting (PyLint) and code completion (Jedi) and hook up their results to CodeMirror.
For code completion this has the added benefit of also showing the documentation for the autocompleted items, which is especially useful for people new to programming (which is exactly our target audience).

Usability was further improved by adding the =FriendlyTraceback= library.
=FriendlyTraceback= is a Python library that changes error messages in Python to be clearer to beginners, by explicitly answering questions such as where and why an error occurred.
An example of what this looks like can be seen in Figure\nbsp{}[[fig:technicalpapyrostraceback]]

#+CAPTION: Papyros execution where a student tried to add a type declaration to a variable, which =FriendlyTraceback= shows a fitting error message for.
#+NAME: fig:technicalpapyrostraceback
[[./images/technicalpapyrostraceback.png]]

** R judge
:PROPERTIES:
:CREATED: [2023-10-23 Mon 08:49]
:CUSTOM_ID: sec:techr
:END:

Because Dodona had proven itself as a useful tool for teaching Python and Java to students, colleagues teaching statistics started asking if we could build R support into Dodona.
We started working on an R judge[fn:: https://github.com/dodona-edu/judge-r] soon after.
By now, more than 1\thinsp{}250 R exercises have been added, and almost 1 million submissions have been made to an R exercise.

Because R is the /lingua franca/ of statistics, there are a few extra features that come to mind that are not typically handled by judges, such as handling of data frames and outputting visual graphs (or even evaluating that a graph was built correctly).
Another feature that teachers wanted that we had not built into a judge previously was support for inspecting the student's source code, e.g. for making sure that certain functions were or were not used.

*** Exercise API
:PROPERTIES:
:CREATED:  [2024-01-05 Fri 14:06]
:CUSTOM_ID: subsec:techrapi
:END:

The API for the R judge was designed to follow the visual structure of the feedback table as closely as possible, as can be seen in the sample evaluation code in Listing\nbsp{}[[lst:technicalrsample]].
Tabs are represented by different evaluation files.
In addition to the =testEqual= function demonstrated in Listing\nbsp{}[[lst:technicalrsample]] there are some other functions to specifically support the requested functionality.
=testImage= will set up some handlers in the R environment so that generated plots (or other images) are sent to the feedback table (in a base 64 encoded string) instead of the filesystem.
It will also by default make the test fail if no image was generated (but does not do any verification of the image contents).
An example of what the feedback table looks like when an image is generated can be seen in Figure\nbsp{}[[fig:technicalrplot]].
=testDF= has some extra functionality for testing the equality of data frames, where it is possible to ignore row and column order.
The generated feedback is also limited to 5 lines of output, to avoid overwhelming students (and their browsers) with the entire table.
=testGGPlot= can be used to introspect plots generated with GGPlot\nbsp{}[cite:@wickhamGgplot2CreateElegant2023].
To test whether students use certain functions, =testFunctionUsed= and =testFunctionUsedInVar= can be used.
The latter tests whether the specific function is used when initializing a specific variable.

#+CAPTION: Feedback table showing the feedback for an R exercise where the goal is to generate a plot.
#+CAPTION: The code generates a plot showing a simple sine function, which is reflected in the feedback table.
#+NAME: fig:technicalrplot
[[./images/technicalrplot.png]]

If some code needs to be executed in the student's environment before the student's code is run (e.g. to make some dataset available, or to fix a random seed), the =preExec= argument of the =context= function can be used to do so.

#+CAPTION: Sample evaluation code for a simple R exercise.
#+CAPTION: The feedback table will contain one context with two test cases in it.
#+CAPTION: The first test case checks whether some t-test was performed correctly, and does this by performing two equality checks.
#+CAPTION: The second test case checks that the \(p\)-value calculated by the t-test is correct.
#+CAPTION: The =preExec= is executed in the student's environment and here fixes a random seed for the student's execution.
#+NAME: lst:technicalrsample
#+ATTR_LATEX: :float t
#+BEGIN_SRC r
context({
  testcase('The correct method was used', {
    testEqual("test$alternative",
              function(studentEnv) {
                studentEnv$test$alternative
              },
              'two.sided')
    testEqual("test$method",
              function(studentEnv) {
                studentEnv$test$method
              },
              ' Two Sample t-test')
  })
  testcase('p value is correct', {
    testEqual("test$p.value",
              function(studentEnv) {
                studentEnv$test$p.value
              },
              0.175)
  })
}, preExec = {
  set.seed(20190322)
})
#+END_SRC

*** Security
:PROPERTIES:
:CREATED:  [2024-01-05 Fri 14:06]
:CUSTOM_ID: subsec:techrsecurity
:END:

Other than the API for teachers creating exercises, encapsulation of student code is also an important part of a judge.
Students should not be able to access functions defined by the judge, or be able to find the correct solution or the evaluating code.
The R judge makes sure of this by making extensive use of environments.
This is also reflected in the teacher API: they can access variables or execute functions in the student environment, but this environment has to be explicitly passed to the function generating the student result.
In R, all environments except the root environment have a parent, essentially creating a tree structure of environments.
In most cases, this tree will actually be a path, but in the R judge, the student environment is explicitly attached to the base environment.
This even makes sure that libraries loaded by the judge are not initially available to the student code (thus allowing teachers to test that students can correctly load libraries).
The judge itself runs in an anonymous environment, so that even students with intimate knowledge of the inner workings of R and the judge itself would not be able to find it.

The judge is also programmed very defensively.
Every time execution is handed off to student code (or even teacher code), appropriate error handlers and output redirections are installed.
This prevents the student and teacher code from e.g. writing to standard output (and thus messing up the JSON expected by Dodona).

** TESTed
:PROPERTIES:
:CREATED: [2023-10-23 Mon 08:49]
:CUSTOM_ID: sec:techtested
:END:

TESTed[fn:: https://github.com/dodona-edu/universal-judge] is a universal judge for Dodona.
TESTed was developed to solve two major drawbacks with the current judge system of Dodona:
- When creating the same exercise in multiple programming languages, the exercise description and test cases need to be redone for every programming language.
  This is especially relevant for very simple exercises that students almost always start with, and for exercises in algorithms courses, where the programming language a student solves an exercise in is of lesser importance than the way they solve it.
  Mistakes in exercises also have to be fixed in all instances of the exercise when there are multiple instances of the exercise.
- The judges themselves have to be created from scratch every time.
  Most judges offer the same basic concepts and features, most of which are independent of programming language (communication with Dodona, checking correctness, I/O,\nbsp{}...).

The goal of TESTed was to implement a judge so that programming exercises only have to be created once to be available in all programming languages TESTed supports.
An exercise should also not have to be changed when support for a new programming language is added.
As a secondary goal, we also wanted to make it as easy as possible to create new exercises.
Teachers who have not used Dodona before should be able to create a new basic exercise without too many issues.

We first developed it as a proof of concept in my master's thesis\nbsp{}[cite:@vanpetegemComputationeleBenaderingenVoor2018], which presented a method for estimating the computational complexity of solutions for programming exercises.
One of the goals was to make this method work over many programming languages.
To do this, we wrote a framework based on Jupyter kernels[fn:: https://jupyter.org] where the interaction with each programming language was abstracted away behind a common interface.
We realized this framework could be useful in itself, but it was only developed as far as we needed for the thesis.
Further work then developed this proof of concept into the full judge we will present in the following sections.

*** Overview
:PROPERTIES:
:CREATED:  [2024-01-05 Fri 14:03]
:CUSTOM_ID: subsec:techtestedoverview
:END:

TESTed generally works using the following steps:
1. Receive the submission, exercise test plan, and any auxiliary files from Dodona.
1. Validate the test plan and making sure the submission's programming language is supported for the given exercise.
1. Generate test code for each context in the test plan.
1. Optionally compile the test code, either in batch mode or per context.
   This step is skipped if evaluation a submission written in an interpreted language.
1. Execute the test code.
1. Evaluate the results, either with programming language-specific evaluation, programmed evaluation, or generic evaluation.
1. Send the evaluation results to Dodona.

In the following sections we will expand on these steps using an example exercise to demonstrate them in practice.
In this exercise, students need to rotate a list.
For example, in Python, ~rotate([0, 1, 2, 3, 4], 2)~ should return ~[3, 4, 0, 1, 2]~.

*** Test plan
:PROPERTIES:
:CREATED: [2024-01-02 Tue 10:23]
:CUSTOM_ID: subsec:techtestedtestplan
:END:

One of the most important elements that is needed to perform these steps is the test plan.
This test plan is a hierarchical structure, which closely resembles the underlying structure of Dodona's feedback table.
There are, however, a few important differences.
The first of these is the /context testcase/.
This is a special testcase per context that executes the main function (or the entire program in case this is more appropriate for the language being executed).
The only possible inputs for this testcase are text for the standard input stream, command-line arguments and files in the working directory.
The exit status code can only be checked in this testcase as well.

Like the communication with Dodona, this test plan is a JSON document under the hood.
In the following sections, we will use the JSON representation of the test plan to discuss how TESTed works.
Exercise authors use the DSL to write their tests, which we discuss in Section\nbsp{}[[#subsec:techtesteddsl]].
This DSL is internally converted by TESTed to the more extensive underlying structure before execution.
A test plan of the example exercise can be seen in Listing\nbsp{}[[lst:technicaltestedtestplan]].

#+CAPTION: Basic structure of a test plan.
#+CAPTION: The structure of Dodona's feedback table is followed closely.
#+CAPTION: The function arguments have been left out, as they are explained in Section\nbsp{}[[#subsec:techtestedserialization]].
#+NAME: lst:technicaltestedtestplan
#+ATTR_LATEX: :float t
#+BEGIN_SRC js
{
  "tabs": [
    {
      "name": "Feedback",
      "contexts": [
        {
          "testcases": [
            {
              "input": {
                "type": "function",
                "name": "rotate",
                "arguments": [
                  ...
                ]
              },
              "output": {
                "result": {
                  "value": {
                    ...
                  }
                }
              }
            },
            ...
          ]
        }
      ]
    }
  ]
}
#+END_SRC

*** Data serialization
:PROPERTIES:
:CREATED: [2024-01-02 Tue 10:50]
:CUSTOM_ID: subsec:techtestedserialization
:END:

As part of the test plan, we also need a way to generically describe values and their types.
This is what we will call the /serialization format/.
The serialization format should be able to represent all the basic data types we want to support in the programming language independent part of the test plan.
These data types are basic primitives like integers, reals (floating point numbers), booleans, and strings, but also more complex collection types like arrays (or lists), sets and mapping types (maps, dictionaries, and objects).
Note that the serialization format is also used on the side of the programming language, to receive (function) arguments and send back execution results.

Of course, a number of data serialization formats already exist, like =MessagePack=[fn:: https://msgpack.org/], =ProtoBuf=[fn:: https://protobuf.dev/],\nbsp{}...
Binary formats were excluded from the start, because they can't easily be embedded in our JSON test plan, but more importantly, they can neither be written nor read by humans.
Other formats did not support all the types we wanted to support and could not be extended to do so.
Because of our goal in supporting many programming languages, the format also had to be either widely implemented or be easily implementable.
None of the formats we investigated met all these requirements.
We opted to make the serialization format in JSON as well.
Values are represented by objects containing the encoded value and the accompanying type.
Note that this is a recursive format: the values in a collection are also serialized according to this specification.

The types of values are split in three categories.
The first category are the basic types listed above.
The second category are the advanced types.
These are specialized versions of the basic types, for example to specify the number of bits that a number should be, or whether a collection should be a tuple or a list.
The final category of types can only be used to specify an expected type.
In addition to the other categories, =any= can be specified.
Like the name says, =any= signifies that the expected type is unknown, and the student can therefore return any type.

The encoded expected return value of our example exercise can be seen in Listing\nbsp{}[[lst:technicaltestedtypes]].

#+CAPTION: A list encoded using TESTed's data serialization format.
#+CAPTION: The corresponding Python list would be ~[3, 4, 0, 1, 2]~.
#+NAME: lst:technicaltestedtypes
#+ATTR_LATEX: :float t
#+BEGIN_SRC js
{
  "type": "sequence",
  "data": [
    { "type": "integer", "data": 3 },
    { "type": "integer", "data": 4 },
    { "type": "integer", "data": 0 },
    { "type": "integer", "data": 1 },
    { "type": "integer", "data": 2 }
  ]
}
#+END_SRC

*** Statements
:PROPERTIES:
:CREATED:  [2024-01-03 Wed 17:09]
:CUSTOM_ID: subsec:techtestedstatements
:END:

There is more complexity hidden in the idea of creating a variable of a custom type.
It implies that we need to be able to create variables, instead of just capturing the result of function calls or other expressions.
To support this, specific structures were added to the test plan JSON schema.
Listing\nbsp{}[[lst:technicaltestedassignment]] shows what it would look like if we wanted to assign the function argument of our example exercise to a variable.

#+CAPTION: A TESTed testcase containing a statement.
#+CAPTION: The corresponding Python statement would be ~numbers01 = [0, 1, 2, 3, 4]~.
#+NAME: lst:technicaltestedassignment
#+ATTR_LATEX: :float t
#+BEGIN_SRC js
"testcases": [
  {
    "input": {
      "type": "sequence",
      "variable": "numbers01",
      "expression": {
        "type": "sequence",
        "data": [
          { "type": "integer", "data": 0 },
          { "type": "integer", "data": 1 },
          { "type": "integer", "data": 2 },
          { "type": "integer", "data": 3 },
          { "type": "integer", "data": 4 }
        ],
      }
    }
  }
]
#+END_SRC

*** Checking programming language support
:PROPERTIES:
:CREATED:  [2024-01-04 Thu 09:16]
:CUSTOM_ID: subsec:techtestedsupport
:END:

We also need to make sure that the programming language of the submission under test is supported by the test plan of its exercise.
The two things that are checked are whether a programming language supports all the types that are used and whether the language has all the necessary language constructs.
For example, if the test plan uses a =tuple=, but the language doesn't support it, it's obviously not possible to evaluate a submission in that language.
The same is true for overloaded functions: if it is necessary that a function can be called with a string and with a number, a language like C will not be able to support this.
Collections also are not yet supported for C, since the way arrays and their lengths work in C is quite different from other languages.
Our example exercise will not work in C for this reason.

*** Execution
:PROPERTIES:
:CREATED:  [2024-01-04 Thu 09:43]
:CUSTOM_ID: subsec:techtestedexecution
:END:

To go from the generic test plan to something that can actually be executed in the given language, we need to generate test code.
This is done by way of a templating system.
For each programming language supported by TESTed, a few templates need to be defined.
The serialization format also needs to be implemented in the given programming language.
Because the serialization format is based on JSON and JSON is a widely used format, this requirement is usually pretty easy to fulfil.

For some languages, the code needs to be compiled as well.
All test code is usually compiled into one executable, since this only results in one call to the compiler (which is usually a pretty slow process).
There is one big drawback to this way of compiling code: if there is a compilation error (for example because a student has not yet implemented all requested functions) the compilation will fail for all contexts.
Because of this, TESTed will fall back to separate compilations for each context if a compilation error occurs.
Subsequently, the test code is executed and its results collected.

*** Evaluation
:PROPERTIES:
:CREATED:  [2024-01-04 Thu 10:45]
:CUSTOM_ID: subsec:techtestedevaluation
:END:

The generated output is usually evaluated by TESTed itself.
TESTed can however only evaluate the output as far as it is programmed to do so.
There are two other ways the results can be evaluated: programmed evaluation and programming-language specific evaluation.
With programmed evaluation, the results are passed to code written by a teacher.
For efficiency's sake, this code has to be written in Python (which means TESTed does not need to launch a new process for the evaluation).
This code will then check the results, and generate appropriate feedback.
Programming-language specific evaluation is executed immediately after the test code in the process of the test code.
This can be used to evaluate programming-language specific concepts, for example the correct use of pointers in C.

*** Linting
:PROPERTIES:
:CREATED:  [2024-01-04 Thu 10:47]
:CUSTOM_ID: subsec:techtestedlinting
:END:

Next to correctness, style is also an important aspect of source code.
In a lot of contexts, linters are used to perform basic style checks.
Linting was also implemented in TESTed.
For each supported programming language, both the linter to be used and how its output should be interpreted are specified.

*** DSL
:PROPERTIES:
:CREATED:  [2023-12-11 Mon 17:22]
:CUSTOM_ID: subsec:techtesteddsl
:END:

As mentioned in Section\nbsp{}[[#subsec:techtestedtestplan]], exercise authors are not expected to write their test plans in JSON.
It is very verbose and error-prone when writing (trailing commas are not allowed, all object keys are strings and need to be written as such, etc.).
This aspect of usability was not the initial focus of TESTed, since most Dodona power users already use code to generate their test plans.
Because code is very good at outputting an exact and verbose format like JSON, this avoids its main drawback.
However, we wanted teachers in secondary education to be able to work with TESTed, and they mostly do not have enough experience with programming themselves to generate a test plan.
To solve this problem we wanted to integrate a domain-specific language (DSL) to describe TESTed test plans.

We first investigated whether we could use an existing format to do so.
The best option of these was PEML: the Programming Exercise Markup Language\nbsp{}[cite:@mishraProgrammingExerciseMarkup2023].
Envisioned as a universal format for programming exercise descriptions, their goals seemed to align with ours.
Unfortunately, they did not base themselves on any existing formats.
This means that there is little tooling around PEML.
Parsing it as part of TESTed would require a lot of implementation work, and IDEs or other editors don't do syntax highlighting for it.
The format itself is also quite error-prone when writing.
Because of these reasons, we discarded PEML and started working on our own DSL.

Our own DSL is based on YAML[fn:: https://yaml.org].
YAML is a superset of JSON and describes itself as "a human-friendly data serialization language for all programming languages".
The DSL structure is quite similar to the actual test plan, though it does limit the amount of repetition required for common operations.
YAML's concise nature also contributes to the read- and writability of its test plans.

The main addition of the DSL is an abstract programming language, made to look somewhat like Python 3.
Note that this is not a full programming language, but only supports language constructs as far as they are needed by TESTed.
Values are interpreted as basic types, but can be cast explicitly to one of the more advanced types.
The DSL version of the test plan for the example exercise can be seen in Listing\nbsp{}[[lst:technicaltesteddsl]].

#+CAPTION: DSL version of the test plan for the example exercise.
#+CAPTION: This version also demonstrates the use of an assignment.
#+NAME: lst:technicaltesteddsl
#+ATTR_LATEX: :float t
#+BEGIN_SRC yaml
  - tab: "Feedback"
    contexts:
      - testcases:
          - statement: "numbers01 = [0, 1, 2, 3, 4]"
          - expression: "rotate(numbers01, 2)"
            return: [3, 4, 0, 1, 2]
          - expression: "rotate(numbers01, 1)"
            return: [4, 0, 1, 2, 3]
      - testcases:
          - statement: "numbers02 = [0, 1, 2, 3, 4, 5]"
          - expression: "rotate(numbers02, 2)"
            return: [4, 5, 0, 1, 2, 3]
          - expression: "rotate(numbers02, 1)"
            return: [5, 0, 1, 2, 3, 4]
#+END_SRC

* Pass/fail prediction in programming courses
:PROPERTIES:
:CREATED: [2023-10-23 Mon 08:50]
:CUSTOM_ID: chap:passfail
:END:

We now shift to the chapters where we make use of the data provided by Dodona to perform educational data mining research.
This chapter is based on\nbsp{}[cite/t:@vanpetegemPassFailPrediction2022], published in the Journal of Educational Computing Research, and also briefly discusses the work performed in\nbsp{}[cite/t:@zhidkikhReproducingPredictiveLearning2024], published in the Journal of Learning Analytics.

** Introduction
:PROPERTIES:
:CREATED: [2023-10-23 Mon 08:50]
:CUSTOM_ID: sec:passfailintro
:END:

A lot of educational opportunities are missed by keeping assessment separate from learning\nbsp{}[cite:@wiliamWhatAssessmentLearning2011; @blackAssessmentClassroomLearning1998].
Educational technology can bridge this divide by providing real-time data and feedback to help students learn better, teachers teach better, and educational systems become more effective\nbsp{}[cite:@oecdOECDDigitalEducation2021].
Earlier research demonstrated that the adoption of interactive platforms may lead to better learning outcomes\nbsp{}[cite:@khalifaWebbasedLearningEffects2002] and allows collecting rich data on student behaviour throughout the learning process in non-evasive ways.
Effectively using such data to extract knowledge and further improve the underlying processes, which is called educational data mining\nbsp{}[cite:@bakerStateEducationalData2009], is increasingly explored as a way to enhance learning and educational processes\nbsp{}[cite:@duttSystematicReviewEducational2017].

About one third of the students enrolled in introductory programming courses fail\nbsp{}[cite:@watsonFailureRatesIntroductory2014; @bennedsenFailureRatesIntroductory2007].
Such high failure rates are problematic in light of low enrolment numbers and high industrial demand for software engineering and data science profiles\nbsp{}[cite:@watsonFailureRatesIntroductory2014].
To remedy this situation, it is important to have detection systems for monitoring at-risk students, understand why they are failing, and develop preventive strategies.
Ideally, detection happens early on in the learning process to leave room for timely feedback and interventions that can help students increase their chances of passing a course.

Previous approaches for predicting performance on examinations either take into account prior knowledge such as educational history and socio-economic background of students or require extensive tracking of student behaviour.
Extensive behaviour tracking may directly impact the learning process itself.
[cite/t:@rountreeInteractingFactorsThat2004] used decision trees to find that the chance of failure strongly correlates with a combination of academic background, mathematical background, age, year of study, and expectation of a grade other than "A".
They conclude that students with a skewed view on workload and content are more likely to fail.
[cite/t:@kovacicPredictingStudentSuccess2012] used data mining techniques and logistic regression on enrolment data to conclude that ethnicity and curriculum are the most important factors for predicting student success.
They were able to predict success with 60% accuracy.
[cite/t:@asifAnalyzingUndergraduateStudents2017] combine examination results from the last two years in high school and the first two years in higher education to predict student performance in the remaining two years of their academic study program.
They used data from one cohort to train models and from another cohort to test that the accuracy of their predictions is about 80%.
This evaluates their models in a similar scenario in which they could be applied in practice.

A downside of the previous studies is that collecting uniform and complete data on student enrolment, educational history and socio-economic background is impractical for use in educational practice.
Data collection is time-consuming and the data itself can be considered privacy-sensitive.
Usability of predictive models therefore not only depends on their accuracy, but also on their dependency on findable, accessible, interoperable and reusable data\nbsp{}[cite:@wilkinsonFAIRGuidingPrinciples2016].
Predictions based on educational history and socio-economic background also raise ethical concerns.
Such background information definitely does not explain everything and lowers the perceived fairness of predictions\nbsp{}[cite:@grgic-hlacaCaseProcessFairness2018; @binnsItReducingHuman2018].
A student can also not change their background, so these items are not actionable for any corrective intervention.

It might be more convenient and acceptable if predictive models are restricted to data collected on student behaviour during the learning process of a single course.
An example of such an approach comes from\nbsp{}[cite/t:@vihavainenPredictingStudentsPerformance2013], using snapshots of source code written by students to capture their work attitude.
Students are actively monitored while writing source code and a snapshot is taken automatically each time they edit a document.
These snapshots undergo static and dynamic analysis to detect good practices and code smells, which are fed as features to a non-parametric Bayesian network classifier whose pass/fail predictions are 78% accurate by the end of the semester.
In a follow-up study they applied the same data and classifier to accurately predict learning outcomes for the same student cohort in another course\nbsp{}[cite:@vihavainenUsingStudentsProgramming2013].
In this case, their predictions were 98.1% accurate, although the sample size was rather small.
While this procedure does not rely on external background information, it has the drawback that data collection is more invasive and directly intervenes with the learning process.
Students can't work in their preferred programming environment and have to agree with extensive behaviour tracking.

Approaches that are not using machine learning also exist.
[cite/t:@feldmanAnsweringAmRight2019] try to answer the question "Am I on the right track?" on the level of individual exercises, by checking if the student's current progress can be used as a base to synthesize a correct program.
However, there is no clear way to transform this type of approach into an estimation of success on examinations.
[cite/t:@werthPredictingStudentPerformance1986] found significant (\(p < 0.05\)) correlations between students' college grades, the number of hours worked, the number of high school mathematics classes and the students' grades for an introductory programming course.
[cite/t:@gooldFactorsAffectingPerformance2000] also looked at learning style (surveyed using LSI2) as a factor in addition to demographics, academic ability, problem-solving ability and indicators of personal motivation.
The regressions in their study account for 42 to 65 percent of the variation in cohort performances.

In this chapter, we present an alternative framework (Figure\nbsp{}[[fig:passfailmethodoverview]]) to predict if students will pass or fail a course within the same context of learning to code.
The method only relies on submission behaviour for programming exercises to make accurate predictions and does not require any prior knowledge or intrusive behaviour tracking.
Interpretability of the resulting models was an important design goal to enable further investigation on learning habits.
We also focused on early detection of at-risk students, because predictive models are only effective for the cohort under investigation if remedial actions can be started long before students take their final exam.

#+CAPTION: Step-by-step process of the proposed pass/fail prediction framework for programming courses: 1) Collect metadata from student submissions during successive course editions.
#+CAPTION: 2) Align course editions by identifying corresponding time points and calculating snapshots at these time points.
#+CAPTION: A snapshot measures student performance only from metadata available in the course edition at the time the snapshot was taken.
#+CAPTION: 3) Train a machine learning model on snapshot data from previous course editions and predict which students will likely pass or fail the current course edition by applying the model on a snapshot of the current edition.
#+CAPTION: 4) Infer what learning behaviour has a positive or negative learning effect by interpreting feature weights of the machine learning model.
#+CAPTION: Teachers can use insights from both steps 3 and 4 to take actions in their teaching practice.
#+NAME: fig:passfailmethodoverview
[[./images/passfailmethodoverview.png]]

The chapter starts with a description of how data is collected, what data is used and which machine learning methods have been evaluated to make pass/fail predictions.
We evaluated the same models and features in multiple courses to test their robustness against differences in teaching styles and student backgrounds.
The results are discussed from a methodological and educational perspective with a focus on
#+ATTR_LATEX: :environment enumerate*
#+ATTR_LATEX: :options [label={\emph{\roman*)}}, itemjoin={{, }}, itemjoin*={{, and }}]
- accuracy (What machine learning algorithms yield the best predictions?)
- early detection (Can we already make accurate predictions early on in the semester?)
- interpretability (Are resulting models clear about which features are important? Can we explain why certain features are identified as important? How self-evident are important features?).

** Materials and methods
:PROPERTIES:
:CREATED: [2023-10-23 Mon 08:50]
:CUSTOM_ID: sec:passfailmaterials
:END:

*** Course structures
:PROPERTIES:
:CREATED: [2023-10-23 Mon 16:28]
:CUSTOM_ID: subsec:passfailstructures
:END:

This study uses data from two introductory programming courses, referenced as course A and course B, collected during 3 editions of each course in academic years 2016--2017, 2017--2018, and 2018--2019.
Course A is the course described in Section\nbsp{}[[#sec:usecasestudy]].
Course B is the introductory programming course taught at the Faculty of Engineering at Ghent University.
Both courses run once per academic year across a 12-week semester (September--December).
They have separate lecturers and teaching assistants, and are taken by students of different faculties.
The courses have their own structure, but each edition of a course follows the same structure.
Table\nbsp{}[[tab:passfailcoursestatistics]] summarizes some statistics on the course editions included in this study.

#+ATTR_LATEX: :float t
#+CAPTION: Statistics for course editions included in this study.
#+CAPTION: The courses are taken by different student cohorts at different faculties and differ in structure, lecturers and teaching assistants.
#+CAPTION: The number of tries is the average number of solutions submitted by a student per exercise they worked on (i.e. for which the student submitted at least one solution in the course edition).
#+NAME: tab:passfailcoursestatistics
|     | year       | students | # ex. |       solutions | tries | pass rate |
|-----+------------+----------+-------+-----------------+-------+-----------|
| <l> | <l>        |      <r> |   <r> |             <r> |   <r> |       <r> |
| A   | 2016--2017 |      322 |    60 | 167\thinsp{}675 |  9.56 |    60.86% |
| A   | 2017--2018 |      249 |    60 | 125\thinsp{}920 |  9.19 |    61.44% |
| A   | 2018--2019 |      307 |    60 | 176\thinsp{}535 | 10.29 |    65.14% |
| B   | 2016--2017 |      372 |   138 | 371\thinsp{}891 |  9.10 |    56.72% |
| B   | 2017--2018 |      393 |   187 | 407\thinsp{}696 |  7.31 |    60.81% |
| B   | 2018--2019 |      437 |   201 | 421\thinsp{}461 |  6.26 |    62.47% |

Course A is subdivided into two successive instructional units that each cover five programming topics -- one topic per week -- followed by an evaluation about all topics covered in the unit.
Students must solve six programming exercises on each topic before a deadline one week later.
Submitted solutions for these mandatory exercises are automatically evaluated and considered correct if they pass all unit tests for the exercise.
Failing to submit a correct solution for a mandatory exercise has a small impact on the score for the evaluation at the end of the unit.
The final exam at the end of the semester evaluates all topics covered in the entire course.
Students need to solve new programming exercises during evaluations (2 exercises) and exams (3 exercises), where reviewers manually evaluate and grade submitted solutions based on correctness, programming style used, choice made between the use of different programming techniques, and the overall quality of the solution.
Each edition of the course is taken by about 300 students.

Course B has 20 lab sessions across the semester, with evaluations after the 10th and 17th lab session and a final exam at the end of the semester.
Each lab session comes with a set of exercises and has an indicative deadline for submitting solutions.
However, these exercises are not taken into account when computing the final score for the course, so students are completely free to work on exercises as a way to practice their coding skills.
Students need to solve new programming exercises during evaluations (3 exercises) and exams (4 exercises).
Solutions submitted during evaluations are automatically graded based on the number of passed unit tests for the exercise.
Solutions submitted during exams are manually graded in the same way as for course A.
Each edition of the course is taken by about 400 students.

We opted to use two different courses that are structured quite differently to make sure our framework is generally applicable in other courses where the same behavioural data can be collected.

*** Learning environment
:PROPERTIES:
:CREATED: [2023-10-23 Mon 16:28]
:CUSTOM_ID: subsec:passfaillearningenvironment
:END:

Both courses use Dodona as their online learning environment\nbsp{}[cite:@vanpetegemDodonaLearnCode2023].
Dodona promotes active learning through problem-solving\nbsp{}[cite:@princeDoesActiveLearning2004].
Each course edition has its own Dodona course, with a learning path that groups exercises in separate series (Figure\nbsp{}[[fig:passfailstudentcourse]]).
Course A has one series per covered programming topic (10 series in total) and course B has one series per lab session (20 series in total).
A submission deadline is set for each series.
Dodona is also used to take tests and exams, within series that are only accessible for participating students.

#+CAPTION: Student view of a course in Dodona, showing two series of six exercises in the learning path of course A.
#+CAPTION: Each series has its own deadline.
#+CAPTION: The status column shows a global status for each exercise based on the last solution submitted.
#+CAPTION: The class progress column visualizes global status for each exercise for all students subscribed in the course.
#+CAPTION: Icons on the left show a global status for each exercise based on the last submission submitted before the series deadline.
#+NAME: fig:passfailstudentcourse
[[./images/passfailstudentcourse.png]]

Throughout an edition of a course, students can continuously submit solutions for programming exercises and immediately receive feedback upon each submission, even during tests and exams.
This rich feedback is automatically generated by an online judge and unit tests linked to each exercise\nbsp{}[cite:@wasikSurveyOnlineJudge2018].
Guided by that feedback, students can track potential errors in their code, remedy them and submit an updated solution.
There is no restriction on the number of solutions that can be submitted per exercise, and students can continue to submit solutions after a series deadline.
All submitted solutions are stored, but only the last submission before the deadline is taken into account to determine the status (and grade) of an exercise for a student.
One of the effects of active learning, triggered by exercises with deadlines and automated feedback, is that most learning happens during the semester as can be seen on the heatmap in Figure\nbsp{}[[fig:passfailheatmap]].

#+CAPTION: Heatmap showing the distribution per day of all 176\thinsp{}535 solutions submitted during the 2018--2019 edition of course A.
#+CAPTION: The darker the colour, the more submissions were made on that day.
#+CAPTION: A lighter red means there are few submissions on that day.
#+CAPTION: A light grey square means that no submissions were made that day.
#+CAPTION: Weekly lab sessions for different groups were organized on Monday afternoon, Friday morning and Friday afternoon.
#+CAPTION: Weekly deadlines for mandatory exercises were on Tuesdays at 22:00.
#+CAPTION: There were four exam sessions for different groups in January.
#+CAPTION: There is little activity in the exam periods, except for days on which there was an exam.
#+CAPTION: The course is not taught in the second semester, so there is very little activity there.
#+CAPTION: Two exam sessions were organized in August/September granting an extra chance to students who failed on their exam in January/February.
#+NAME: fig:passfailheatmap
[[./images/passfailheatmap.png]]

*** Submission data
:PROPERTIES:
:CREATED: [2023-10-23 Mon 16:38]
:CUSTOM_ID: subsec:passfaildata
:END:

We exported data from Dodona on all solutions submitted by students during each course edition included in the study.
Each solution has a submission timestamp with precision down to the second and is linked to a course edition, series in the learning path, exercise and student.
We did not use the actual source code submitted by students, but did use the status describing the global assessment made by the learning environment: correct, wrong, compilation error, runtime error, time limit exceeded, memory limit exceeded, or output limit exceeded.

Comparison of student behaviour between different editions of the same course is enabled by computing snapshots for each edition at series deadlines.
Because course editions follow the same structure, we can align their series and compare snapshots for corresponding series.
Corresponding snapshots represent student performance at intermediate points during the semester and their chronology also allows longitudinal analysis within the semester.
Course A has snapshots for the five series of the first unit (labelled S1--S5), a snapshot for the evaluation of the first unit (labelled E1), snapshots for the five series of the second unit (labelled S6--S10), a snapshot for the evaluation of the second unit (labelled E2) and a snapshot for the exam (labelled E3).
Course B has snapshots for the first ten lab sessions (labelled S1--S10), a snapshot for the first evaluation (labelled E1), snapshots for the next series of seven lab sessions (labelled S11--S17), a snapshot for the second evaluation (labelled E2), snapshots for the last three lab sessions (S18--S20) and a snapshot for the exam (labelled E3).

It is important to stress that a snapshot of a course edition measures student performance only using the information available at the time of the snapshot.
As a result, the snapshot does not take into account submissions after its timestamp.
The behaviour of a student can then be expressed as a set of features extracted from the raw submission data.
We identified different types of features (see Appendix\nbsp{}[[#chap:featuretypes]]) that indirectly quantify certain behavioural aspects of students practising their programming skills.
When and how long do students work on their exercises?
Can students correctly solve an exercise and how much feedback do they need to accomplish this?
What kinds of mistakes do students make while solving programming exercises?
Do students further optimize the quality of their solution after it passes all unit tests, based on automated feedback or publication of sample solutions?
Note that there is no one-on-one relationship between these behavioural aspects and feature types.
Some aspects will be covered by multiple feature types, and some feature types incorporate multiple behavioural aspects.
We will therefore need to take into account possible dependencies between feature types while making predictions.

A feature type essentially makes one observation per student per series.
Each feature type thus results in multiple features: one for each series in the course (excluding series for evaluations and exams).
In addition, the snapshot also contains a feature for the average of each feature type across all series.
We do not use observations per individual exercise, as the actual exercises might differ between course editions.
Snapshots taken at the deadline of an evaluation or later, also contain the score a student obtained for the evaluation.
These features of the snapshot can be used to predict whether a student will finally pass/fail the course.
In addition, the snapshot also contains a label indicating whether the student passed or failed that is used during training and testing of classification algorithms.
Students that did not take part in the final examination, automatically fail the course.

Since course B has no hard deadlines, we left out deadline-related features from its snapshots (=first_dl=, =last_dl= and =nr_dl=; see Appendix\nbsp{}[[#chap:featuretypes]]).
To investigate the impact of deadline-related features, we also made predictions for course A that ignore these features.

*** Classification algorithms
:PROPERTIES:
:CREATED: [2023-10-23 Mon 16:45]
:CUSTOM_ID: subsec:passfailclassification
:END:

We evaluated four classification algorithms to make pass/fail predictions from student behaviour: stochastic gradient descent\nbsp{}[cite:@fergusonInconsistentMaximumLikelihood1982], logistic regression\nbsp{}[cite:@kleinbaumIntroductionLogisticRegression1994], support vector machines\nbsp{}[cite:@cortesSupportVectorNetworks1995], and random forests\nbsp{}[cite:@svetnikRandomForestClassification2003].
We used implementations of these algorithms from =scikit-learn=\nbsp{}[cite:@pedregosaScikitlearnMachineLearning2011] and optimized model parameters for each algorithm by cross-validated grid-search over a parameter grid.

Readers unfamiliar with machine learning can think of these specific algorithms as black boxes, but we briefly explain the basic principles of classification for their understanding.
Supervised learning algorithms use a dataset that contains both inputs and desired outputs to build a model that can be used to predict the output associated with new inputs.
The dataset used to build the model is called the training set and consists of training examples, with each example represented as an array of input values (feature vector).
Classification is a specific case of supervised learning where the outputs are restricted to a limited set of values (labels), in contrast to for example all possible numerical values with a range.
Classification algorithms are validated by splitting a dataset of labelled feature vectors into a training set and a test set, building a model from the training set, and evaluating the accuracy of its predictions on the test set.
Keeping training and test data separate is crucial to avoid bias during validation.
A standard method to make unbiased predictions for all examples in a dataset is \(k\)-fold cross-validation: partition the dataset in \(k\) subsets and then perform \(k\) experiments that each take one subset for evaluation and the other \(k-1\) subsets for training the model.

Pass/fail prediction is a binary classification problem with two possible outputs: passing or failing a course.
We evaluated the accuracy of the predictions for each snapshot and each classification algorithm with three different types of training sets.
As we have data from three editions of each course, the largest possible training set to make predictions for the snapshot of a course edition combines the corresponding snapshots from the two remaining course editions.
We also made predictions for a snapshot using each of its corresponding snapshots as individual training sets to see if we can still make accurate predictions based on data from only one other course edition.
Finally, we also made predictions for a snapshot using 5-fold cross-validation to compare the quality of predictions based on data from the same or another cohort of students.
Note that the latter strategy is not applicable to make predictions in practice, because we will not have pass/fail results as training labels while taking snapshots during the semester.
In practice, to make predictions for a snapshot, we can rely only on corresponding snapshots from previous course editions.
However, because we can assume that different editions of the same course yield independent data, we also used snapshots from future course editions in our experiments.

There are many metrics that can be used to evaluate how accurately a classifier predicted which students will pass or fail the course from the data in a given snapshot.
Predicting a student will pass the course is called a positive prediction, and predicting they will fail the course is called a negative prediction.
Predictions that correspond with the actual outcome are called true predictions, and predictions that differ from the actual outcome are called false predictions.
This results in four possible combinations of predictions: true positives (\(TP\)), true negatives (\(TN\)), false positives (\(FP\)) and false negatives (\(FN\)).
Two standard accuracy metrics used in information retrieval are precision (\(TP/(TP+FP)\)) and recall (\(TP/(TP+FN)\)).
The latter is also called sensitivity if used in combination with specificity (\(TN/(TN+FP)\)).

Many studies for pass/fail prediction use accuracy (\((TP+TN)/(TP+TN+FP+FN)\)) as a single performance metric.
However, this can yield misleading results.
For example, let's take a dummy classifier that always "predicts" students will pass, no matter what.
This is clearly a bad classifier, but it will nonetheless have an accuracy of 75% for a course where 75% of the students pass.

In our study, we will therefore use two more complex metrics that take these effects into account: balanced accuracy and F_1-score.
Balanced accuracy is the average of sensitivity and specificity.
The F_1-score is the harmonic mean of precision and recall.
If we go back to our example, the optimistic classifier that consistently predicts that all students will pass the course and thus fails to identify any failing student will have a balanced accuracy of 50% and an F_1-score of 75%.
Under the same circumstances, a pessimistic classifier that consistently predicts that all students will fail the course has a balanced accuracy of 50% and an F_1-score of 0%.

*** Pass/fail predictions
:PROPERTIES:
:CREATED:  [2024-01-22 Mon 17:17]
:CUSTOM_ID: subsec:passfailmaterialspredictions
:END:

In summary, Figure\nbsp{}[[fig:passfailmethodoverview]] outlines the entire flow of the proposed pass/fail prediction framework.
It starts by extracting metadata for all submissions students made so far within a course (timestamp, status, student, exercise, series) and collecting their marks on intermediate tests and final exams (step 1).
In practice, applying the framework on a student cohort in the current course edition only requires submission metadata and pass/fail outcomes from student cohorts in previous course editions.
Successive course editions are then aligned by identifying fixed time points throughout the course where predictions are made, for example at submission deadlines, intermediate tests or final exams (step 2).
We conducted a longitudinal study to evaluate the accuracy of pass/fail predictions at successive stages of a course (step 3).
This is done by extracting features from the raw submission metadata of one or more course editions and training machine learning models that can identify at-risk students during other course editions.
Our scripts that implement this framework are provided as supplementary material.[fn::
https://github.com/dodona-edu/pass-fail-article
]
Teachers can also interpret the behaviour of students in their class by analysing the feature weights of the machine learning models (step 4).

** Results and discussion
:PROPERTIES:
:CREATED: [2023-10-23 Mon 16:55]
:CUSTOM_ID: sec:passfailresults
:END:

We evaluated the performance of four classification algorithms for pass/fail predictions in a longitudinal sequence of snapshots from course A and B: stochastic gradient descent (Figure\nbsp{}[[fig:passfailsgdresults]]), logistic regression (Figure\nbsp{}[[fig:passfaillrresults]]), support vector machines (Figure\nbsp{}[[fig:passfailsvmresults]]), and random forests (Figure\nbsp{}[[fig:passfailrfresults]]).
For each classifier, course and snapshot, we evaluated 12 predictions for the following combinations of training and test sets: train on one edition and test on another edition; train on two editions and test on the other edition; train and test on one edition using 5-fold cross validation.
In addition, we made predictions for course A using both the full set of features and a reduced feature set that ignores deadline-related features.
We discuss the results in terms of accuracy, potential for early detection, and interpretability.

#+CAPTION: Performance of stochastic gradient descent classifiers for pass/fail predictions in a longitudinal sequence of snapshots from courses A (all features and reduced set of features) and B, measured by balanced accuracy and F_1-score.
#+CAPTION: Dots represent performance of a single prediction, with 12 predictions for each group of corresponding snapshots (columns).
#+CAPTION: Solid line connects averages of the performances for each group of corresponding snapshots.
#+NAME: fig:passfailsgdresults
[[./images/passfailsgdresults.png]]

#+CAPTION: Performance of logistic regression classifiers for pass/fail predictions in a longitudinal sequence of snapshots from courses A (all features and reduced set of features) and B, measured by balanced accuracy and F_1-score.
#+CAPTION: Dots represent performance of a single prediction, with 12 predictions for each group of corresponding snapshots (columns).
#+CAPTION: Solid line connects averages of the performances for each group of corresponding snapshots.
#+NAME: fig:passfaillrresults
[[./images/passfaillrresults.png]]

#+CAPTION: Performance of support vector machine classifiers for pass/fail predictions in a longitudinal sequence of snapshots from courses A (all features and reduced set of features) and B, measured by balanced accuracy and F_1-score.
#+CAPTION: Dots represent performance of a single prediction, with 12 predictions for each group of corresponding snapshots (columns).
#+CAPTION: Solid line connects averages of the performances for each group of corresponding snapshots.
#+NAME: fig:passfailsvmresults
[[./images/passfailsvmresults.png]]

#+CAPTION: Performance of random forest classifiers for pass/fail predictions in a longitudinal sequence of snapshots from courses A (all features and reduced set of features) and B, measured by balanced accuracy and F_1-score.
#+CAPTION: Dots represent performance of a single prediction, with 12 predictions for each group of corresponding snapshots (columns).
#+CAPTION: Solid line connects averages of the performances for each group of corresponding snapshots.
#+NAME: fig:passfailrfresults
[[./images/passfailrfresults.png]]

*** Accuracy
:PROPERTIES:
:CREATED: [2023-10-23 Mon 17:03]
:CUSTOM_ID: subsec:passfailaccuracy
:END:

The overall conclusion from the longitudinal analysis is that indirectly measuring how students practice their coding skills by solving programming exercises (formative assessments) in combination with directly measuring how they perform on intermediate evaluations (summative assessments), allows us to predict with high accuracy if students will pass or fail a programming course.
The signals to make such predictions seem to be present in the data, as we come to the same conclusions irrespective of the course, classification algorithm, or performance metric evaluated in our study.
Overall, logistic regression was the best performing classifier for both courses, but the difference compared to the other classifiers is small.

When we compare the longitudinal trends of balanced accuracy for the predictions of both courses, we see that course A starts with a lower balanced accuracy at the first snapshot, but its accuracy increases faster and is slightly higher at the end of the semester.
At the start of the semester at snapshot S1, course A has an average balanced accuracy between 60% and 65% and course B around 70%.
Nearly halfway through the semester, before the first evaluation, we see an average balanced accuracy around 70% for course A at snapshot S5 and between 70% and 75% for course B at snapshot S8.
After the first evaluation, we can make predictions with a balanced accuracy between 75% and 80% for both courses.
The predictions for course B stay within this range for the rest of the semester, but for course A we can consistently make predictions with an average balanced accuracy of 80% near the end of the semester.

Compared to the accuracy results of\nbsp{}[cite/t:@kovacicPredictingStudentSuccess2012], we see a 15-20% increase for our balanced accuracy results.
Our balanced accuracy results are similar to the accuracy results of\nbsp{}[cite/t:@livierisPredictingSecondarySchool2019], who used semi-supervised machine learning.
[cite/t:@asifAnalyzingUndergraduateStudents2017] achieve an accuracy of about 80% when using one cohort of training and another cohort for testing, which is again similar to our balanced accuracy results.
All of these studies used prior academic history as the basis for their methods, which we do not use in our framework.
We also see similar results as compared to\nbsp{}[cite/t:@vihavainenPredictingStudentsPerformance2013] where we don't have to rely on data collection that interferes with the learning process.
Note that we are comparing the basic accuracy results of prior studies with the more reliable balanced accuracy results of our framework.

F_1-scores follow the same trend as balanced accuracy, but the inclination is even more pronounced because it starts lower and ends higher.
It shows another sharp improvement of predictive performance for both courses when students practice their programming skills in preparation of the final exam (snapshot E3).
This underscores the need to keep organizing final summative assessments as catalysts of learning, even for courses with a strong focus on active learning.

The variation in predictive accuracy for a group of corresponding snapshots is higher for course A than for course B.
This might be explained by the fact that successive editions of course B use the same set of exercises, supplemented with evaluation and exam exercises from the previous edition, whereas each edition of course A uses a different selection of exercises.

Predictions made with training sets from the same student cohort (5-fold cross-validation) perform better than those with training sets from different cohorts (see supplementary material for details[fn::
https://github.com/dodona-edu/pass-fail-article
]).
This is more pronounced for F_1-scores than for balanced accuracy, but the differences are small enough so that nothing prevents us from building classification models with historical data from previous student cohorts to make pass/fail predictions for the current cohort, which is something that can't be done in practice with data from the same cohort as pass/fail information is needed during the training phase.
In addition, we found no significant performance differences for classification models using data from a single course edition or combining data from two course editions.
Given that cohort sizes are large enough, this tells us that accurate predictions can already be made in practice with historical data from a single course edition.
This is also relevant when the structure of a course changes, because we can only make predictions from historical data for course editions whose snapshots align.

The need to align snapshots is also the reason why we had to build separate models for courses A and B since both have differences in course structure.
The models, however, were built using the same set of feature types.
Because course B does not work with hard deadlines, deadline-related feature types could not be computed for its snapshots.
This missing data and associated features had no impact on the performance of the predictions.
Deliberately dropping the same feature types for course A also had no significant effect on the performance of predictions, illustrating that the training phase is where classification algorithms decide themselves how the individual features will contribute to the predictions.
This frees us from having to determine the importance of features beforehand, allows us to add new features that might contribute to predictions even if they correlate with other features, and makes it possible to investigate afterwards how important individual features are for a given classifier (see Section\nbsp{}[[#subsec:passfailinterpretability]]).

Even though the structure of the courses is quite different, our method achieves high accuracy results for both courses.
The results for course A with reduced features also still gives accurate results.
This indicates that the method should be generalizable to other courses where similar data can be collected, even if the structure is quite different or when some features are impossible to calculate due to the course structure.

*** Early detection
:PROPERTIES:
:CREATED: [2023-10-23 Mon 17:05]
:CUSTOM_ID: subsec:passfailearly
:END:

Accuracy of predictions systematically increases as we capture more of student behaviour during the semester.
But surprisingly we can already make quite accurate predictions early on in the semester, long before students take their first evaluation.
Because of the steady trend, predictions for course B at the start of the semester are already reliable enough to serve as input for student feedback or teacher interventions.
It takes some more time to identify at-risk students for course A, but from week four or five onwards the predictions may also become an instrument to design remedial actions for this course.
Hard deadlines and graded exercises are a strong enforcement of active learning behaviour on the students of course A, and might disguise somewhat more the motivation of students to work on their programming skills.
This might explain why it takes a bit longer to properly measure student motivation for course A than for course B.

*** Interpretability
:PROPERTIES:
:CREATED: [2023-10-23 Mon 17:05]
:CUSTOM_ID: subsec:passfailinterpretability
:END:

So far, we have considered classification models as black boxes in our longitudinal analysis of pass/fail predictions.
However, many machine learning techniques can tell us something about the contribution of individual features to make the predictions.
In the case of our pass/fail predictions, looking at the importance of feature types and linking them to aspects of practising programming skills, might give us insights into what kind of behaviour promotes or inhibits learning, or has no or a minor effect on the learning process.
Temporal information can tell us what behaviour makes a steady contribution to learning or where we see shifts throughout the semester.

This interpretability was a considerable factor in our choice of the classification algorithms we investigated in this study.
Since we identified logistic regression as the best-performing classifier, we will have a closer look at feature contributions in its models.
These models are explained by the feature weights in the logistic regression equation, so we will express the importance of a feature as its actual weight in the model.
We use a temperature scale when plotting importances: white for zero importance, a red gradient for positive importance values and a blue gradient for negative importance values.
A feature importance \(w\) can be interpreted as follows for logistic regression models: an increase of the feature value by one standard deviation increases the odds of passing the course by a factor of \(e^w\) when all other feature values remain the same\nbsp{}[cite:@molnarInterpretableMachineLearning2019].
The absolute value of the importance determines the impact the feature has on predictions.
Features with zero importance have no impact because \(e^0 = 1\).
Features represented with a light colour have a weak impact and features represented with a dark colour have a strong impact.
As a reference, an importance of 0.7 doubles the odds for passing the course with each standard deviation increase of the feature value, because \(e^{0.7} \sim 2\).
The sign of the importance determines whether the feature promotes or inhibits the odds of passing the course.
Features with a positive importance (red colour) will increase the odds with increasing feature values, and features with a negative importance (blue colour) will decrease the odds with increasing feature values.

To simulate that we want to make predictions for each course edition included in this study, we trained logistic regression models with data from the remaining two editions of the same course.
A label "edition 18--19" therefore means that we want to make predictions for the 2018--2019 edition of a course with a model built from the 2016--2017 and 2017--2018 editions of the course.
However, in this case we are not interested in the predictions themselves, but in the importance of the features in the models.
The importance of all features for each course edition can be found in the supplementary material.[fn::
https://github.com/dodona-edu/pass-fail-article
]
We will restrict our discussion by highlighting the importance of a selection of feature types for the two courses.

For course A, we look into the evaluation scores (Figure\nbsp{}[[fig:passfailfeaturesAevaluation]]) and the feature types =correct_after_15m= (Figure\nbsp{}[[fig:passfailfeaturesAcorrect]]) and =wrong= (Figure\nbsp{}[[fig:passfailfeaturesAwrong]]).
Evaluation scores have a very strong impact on predictions, and high evaluation scores increase the odds of passing the course.
This comes as no surprise, as both the evaluations and exams are summative assessments that are organized and graded in the same way.
Although the difficulty of evaluation exercises is lower than those of exam exercises, evaluation scores already are good predictors for exam scores.
Also note that these features only show up in snapshots taken at or after the corresponding evaluation.
They have zero impact on predictions for earlier snapshots, as the information is not available at the time these snapshots are taken.

#+CAPTION: Importance of evaluation scores in the logistic regression models for course A (full feature set).
#+CAPTION: Reds mean that a growth in the feature value for a student increases the odds of passing the course for that student.
#+CAPTION: The darker the colour the larger this increase will be.
#+NAME: fig:passfailfeaturesAevaluation
[[./images/passfailfeaturesAevaluation.png]]

The second feature type we want to highlight is =correct_after_15m=: the number of exercises in a series where the first correct submission was made within fifteen minutes after the first submission (Figure\nbsp{}[[fig:passfailfeaturesAcorrect]]).
Note that we can't directly measure how long students work on an exercise, as they may write, run and test their solutions on their local machine before their first submission to Dodona.
Rather, this feature type measures how long it takes students to find and remedy errors in their code (debugging), after they start getting automatic feedback from Dodona.

For exercise series in the first unit of course A (series 1--5), we generally see that the features have a positive impact (red).
This means that students will more likely pass the course if they are able to quickly remedy errors in their solutions for these exercises.
The first and fourth series are an exception here.
The fact that students need more time for the first series might reflect that learning something new is hard at the beginning, even if the exercises are still relatively easy.
Series 4 of course A covers strings as the first compound data type of Python in combination with nested loops, where (non-nested) loops themselves are covered in series 3.
This complex combination might mean that students generally need more time to debug the exercises in series 4.

For the series of the second unit (series 6--10), we observe two different effects.
The impact of these features is zero for the first few snapshots (grey bottom left corner).
This is because the exercises from these series were not yet published at the time of those snapshots, where all series of the first unit were available from the start of the semester.
For the later snapshots, we generally see a negative (blue) weight associated with the features.
It might seem counterintuitive and contradicts the explanation given for the series of the first unit.
However, the exercises of the second unit are a lot more complex than those of the first unit.
This up to a point that even for good students it is hard to debug and correctly solve an exercise in only 15 minutes.
Students that need less than 15 minutes at this stage might be bypassing learning by copying solutions from fellow students instead of solving the exercises themselves.
An exception to this pattern are the few red squares forming a diagonal in the middle of the bottom half.
These squares correspond with exercises that are solved as soon as they become available as opposed to waiting for the deadline.
A possible explanation for these few slightly positive weights is that these exercises are solved by highly-motivated, top students.

#+CAPTION: Importance of feature type =correct_after_15m= (the number of exercises in a series where the first correct submission was made within fifteen minutes after the first submission) in logistic regression models for course A (full feature set).
#+CAPTION: Reds mean that a growth in the feature value for a student increases the odds of passing the course for that student.
#+CAPTION: The darker the colour the larger this increase will be.
#+CAPTION: Blues mean that a growth in the feature value decreases the odds.
#+CAPTION: The darker the colour the larger this decrease will be.
#+NAME: fig:passfailfeaturesAcorrect
[[./images/passfailfeaturesAcorrect.png]]

Finally, if we look at the feature type =wrong= (Figure\nbsp{}[[fig:passfailfeaturesAwrong]]), submitting a lot of submissions with logical errors mostly has a positive impact on the odds of passing the course.
This underscores the old adage that practice makes perfect, as real learning happens where students learn from their mistakes.
Exceptions to this rule are found for series 2, 3, 9 and 10.
The lecturer and teaching assistants identify the topics covered in series 2 and 9 by far as the easiest topics covered in the course, and identify the topics covered in series 3, 6 and 10 as the hardest.
However, it does not feel very intuitive that being stuck with logical errors longer than other students either inhibits the odds for passing on topics that are extremely hard or easy or promotes the odds on topics with moderate difficulty.
This shows that interpreting the importance of feature types is not always straightforward.

#+CAPTION: Importance of feature type =wrong= (the number of wrong submissions in a series) in logistic regression models for course A (full feature set).
#+CAPTION: Reds mean that a growth in the feature value for a student increases the odds of passing the course for that student.
#+CAPTION: The darker the colour the larger this increase will be.
#+CAPTION: Blues mean that a growth in the feature value decreases the odds.
#+CAPTION: The darker the colour the larger this decrease will be.
#+NAME: fig:passfailfeaturesAwrong
[[./images/passfailfeaturesAwrong.png]]

For course B, we look into the evaluation scores (Figure\nbsp{}[[fig:passfailfeaturesBevaluation]]) and the feature types =comp_error= (Figure\nbsp{}[[fig:passfailfeaturesBcomp]]) and =wrong= (Figure\nbsp{}[[fig:passfailfeaturesBwrong]]).
The importance of evaluation scores is similar as for course A, although their relative impact on the predictions is slightly lower.
The latter might be caused by automatic grading of evaluation exercises, where exam exercises are graded by hand.
The fact that the second evaluation is scheduled a little bit earlier in the semester than for course A, makes that pass/fail predictions for course B can also rely earlier on this important feature.
However, we do not see a similar increase of the global performance metrics around the second evaluation of course B, as we see for the first evaluation.

#+CAPTION: Importance of evaluation scores in the logistic regression models for course B.
#+CAPTION: Reds mean that a growth in the feature value for a student increases the odds of passing the course for that student.
#+CAPTION: The darker the colour the larger this increase will be.
#+NAME: fig:passfailfeaturesBevaluation
[[./images/passfailfeaturesBevaluation.png]]

Learning to code requires mastering two major competences:
#+ATTR_LATEX: :environment enumerate*
#+ATTR_LATEX: :options [label={\emph{\roman*)}}, itemjoin={{, }}, itemjoin*={{, and }}]
- getting familiar with the syntax rules of a programming language
  to express the steps for solving a problem in a formal way, so that
  the algorithm can be executed by a computer
- problem-solving itself.
As a result, we can make a distinction between different kinds of errors in source code.
Compilation errors are mistakes against the syntax of the programming language, whereas logical errors result from solving a problem with a wrong algorithm.
When comparing the importance of the number of compilation (Figure\nbsp{}[[fig:passfailfeaturesBcomp]]) and logical errors (Figure\nbsp{}[[fig:passfailfeaturesBwrong]]) students make while practising their coding skills, we see a clear difference.
Making a lot of compilation errors has a negative impact on the odds for passing the course (blue colour dominates in Figure\nbsp{}[[fig:passfailfeaturesBcomp]]), whereas making a lot of logical errors makes a positive contribution (red colour dominates in Figure\nbsp{}[[fig:passfailfeaturesBwrong]]).
This aligns with the claim of\nbsp{}[cite/t:@edwardsSeparationSyntaxProblem2018] that problem-solving is a higher-order learning task in the Taxonomy by\nbsp{}[cite/t:@bloom1956handbook] (analysis and synthesis) than language syntax (knowledge, comprehension, and application).
Students that get stuck longer in the mechanics of a programming language will more likely fail the course, whereas students that make a lot of logical errors and properly learn from them will more likely pass the course.
So making mistakes is beneficial for learning, but it depends on what kind of mistakes.
We also looked at the number of solutions with logical errors while interpreting feature types for course A.
Although we hinted there towards the same conclusions as for course B, the signals were less consistent.
This shows that interpreting feature importances always needs to take the educational context into account.
This can also be seen in Figure\nbsp{}[[fig:passfailfeaturesAcorrect]], where some weeks contribute positively and some negatively.
The reasons for these differences depend on the content of the course, which requires knowledge of the course contents to interpret correctly.

#+CAPTION: Importance of feature type =comp_error= (the number of submissions with compilation errors in a series) in logistic regression models for course B.
#+CAPTION: Reds mean that a growth in the feature value for a student increases the odds of passing the course for that student.
#+CAPTION: The darker the colour the larger this increase will be.
#+CAPTION: Blues mean that a growth in the feature value decreases the odds.
#+CAPTION: The darker the colour the larger this decrease will be.
#+NAME: fig:passfailfeaturesBcomp
[[./images/passfailfeaturesBcomp.png]]

#+CAPTION: Importance of feature type =wrong= (the number of wrong submissions in a series) in logistic regression models for course B.
#+CAPTION: Reds mean that a growth in the feature value for a student increases the odds of passing the course for that student.
#+CAPTION: The darker the colour the larger this increase will be.
#+CAPTION: Blues mean that a growth in the feature value decreases the odds.
#+CAPTION: The darker the colour the larger this decrease will be.
#+NAME: fig:passfailfeaturesBwrong
[[./images/passfailfeaturesBwrong.png]]

** Replication study at Jyväskylä University in Finland
:PROPERTIES:
:CREATED: [2023-10-23 Mon 08:50]
:CUSTOM_ID: sec:passfailfinland
:END:

After our original study, we collaborated with researchers from Jyväskylä University (JYU) in Finland on replicating the study in their introductory programming course\nbsp{}[cite:@zhidkikhReproducingPredictiveLearning2024].
There are however, some notable differences to the study performed at Ghent University.
In the new study, self-reported data was added to the model to test whether this enhances its predictions.
Also, the focus shifted from pass/fail prediction to dropout prediction.
This happened because of the different way the course at JYU is taught.
By performing well enough in all weekly exercises and a project, students can already receive a passing grade.
This is impossible in the courses at Ghent University, where most of the final marks are earned at the exam at the end of the semester.

Another important difference in the two studies is the data that was available to feed into the machine learning model.
Dodona keeps rich data about the evaluation results of a student's submission.
In TIM (the learning environment used at JYU), only a score is kept for each submission.
This score represents the underlying evaluation results (compilation error/mistakes in the output/...).
While it is possible to reverse engineer the score into some underlying status, for some statuses that Dodona can make a distinction between this is not possible with TIM.
This means that a different set of features had to be used in the study at JYU than the feature set used in the study at Ghent University.
The specific feature types left out of the study at JYU are =comp_error= and =runtime_error=.

The course at JYU had been taught in the same way since 2015, resulting in behavioural and survey data from 2\thinsp{}615 students from the 2015--2021 academic years.
The snapshots were made weekly as well, since the course also works with weekly assignments and deadlines.
The self-reported data consists of pre-course and midterm surveys that inquire about aptitudes towards learning programming and motivation, including expectation about grades, prior programming experience, study year, attendance and amount of concurrent courses.

In the analysis, the same four classifiers as the original study were tested.
In addition to this, the dropout analysis was done for three datasets:
#+ATTR_LATEX: :environment enumerate*
#+ATTR_LATEX: :options [label={\emph{\roman*)}}, itemjoin={{, }}, itemjoin*={{, and }}]
- behavioural data only
- self-reported data only
- combined behavioural and self-reported data.

The results obtained in the study at JYU are very similar to the results obtained at Ghent University.
Again, logistic regression was found to yield the best and most stable results.
Even though no data about midterm evaluations or examinations was used (since this data was not available) a similar jump in accuracy around the midterm of the course was also observed.
The jump in accuracy can be explained here by the fact that the period around the middle of the term is when most students drop out.
It was also observed that the first weeks of the course play an important role in reducing dropout.

The addition of the self-reported data to the snapshots resulted in a statistically significant improvement of predictions in the first four weeks of the course.
For the remaining weeks, the change in prediction performance was not statistically significant.
This again points to the conclusion that the first few weeks of a CS1 course play a significant role in student success.
The models trained only on self-reported data performed significantly worse than the other models.

The replication done at JYU showed that our prediction strategy can be used in significantly different educational contexts.
Of course, adaptations to the method have to be made sometimes given differences in course structure and learning environment used, but these adaptations do not result in different prediction results.

** Conclusions and future work
:PROPERTIES:
:CREATED: [2023-10-23 Mon 17:30]
:CUSTOM_ID: sec:passfailconclusions
:END:

In this chapter, we presented a classification framework for predicting if students will likely pass or fail introductory programming courses.
The framework already yields high-accuracy predictions early on in the semester and is privacy-friendly because it only works with metadata from programming challenges solved by students while working on their programming skills.
Being able to identify at-risk students early on in the semester opens windows for remedial actions to improve the overall success rate of students.

We validated the framework by building separate classifiers for three courses because of differences in course structure, institute and learning platform, but using similar sets of features for training models.
The results showed that submission metadata from previous student cohorts can be used to make predictions about the current cohort of students, even if course editions use different sets of exercises, or the courses are structured differently.
Making predictions requires aligning snapshots between successive editions of a course, where students have the same expected progress at corresponding snapshots.
Historical metadata from a single course edition suffices if group sizes are large enough.
Different classification algorithms can be plugged into the framework, but logistic regression resulted in the best-performing classifiers.

Apart from their application to make pass/fail predictions, an interesting side effect of classification models that map indirect measurements of learning behaviour onto mastery of programming skills is that they allow us to interpret what behavioural aspects contribute to learning to code.
Visualization of feature importance turned out to be a useful instrument for linking individual feature types with student behaviour that promotes or inhibits learning.
We applied this interpretability to some important feature types that popped up for the three courses included in this study.

Our study has several strengths and promising implications for future practice and research.
First, we were able to predict success based on historical metadata from earlier cohorts, and we are already able to do that early on in the semester.
In addition to that, the accuracy of our predictions is similar to those of earlier efforts\nbsp{}[cite:@asifAnalyzingUndergraduateStudents2017; @vihavainenPredictingStudentsPerformance2013; @kovacicPredictingStudentSuccess2012] while we are not using prior academic history or interfering with the students' usual learning workflows.
However, there are also some limitations and work for the future.
While our visualizations of the features (Figures\nbsp{}[[fig:passfailfeaturesAevaluation]]\nbsp{}through\nbsp{}[[fig:passfailfeaturesBwrong]]) are helpful to indicate which features are important at which stage of the course in view of increasing versus decreasing the odds of passing the course, they may not be oversimplified and need to be carefully interpreted and placed into context.
This is where the expertise and experience of teachers comes in.
These visualizations can be interpreted by teachers and further contextualized towards the specific course objectives.
For example, teachers know the content and goals of every series of exercises, and they can use the information presented in our visualizations in order to investigate why certain series of exercises are more or less important in view of passing the course.
In addition, they may use the information to further redesign their course.

We can thus conclude that the proposed framework achieves the objectives set for accuracy, early prediction and interpretability.
Having this new framework at hand immediately raises some follow-up research questions that urge for further exploration:
#+ATTR_LATEX: :environment enumerate*
#+ATTR_LATEX: :options [label={\emph{\roman*)}}, itemjoin={{ }}, itemjoin*={{ }}]
- Do we inform students about their odds of passing a course?
  How and when do we inform students about their performance in an educationally responsible way?
  What learning analytics do we use to present predictions to students, and do we only show results or also explain how the data led to the results?
  How do students react to the announcement of their chance at passing the course?
  How do we ensure that students are not demotivated?
- What actions could teachers take upon early insights which students will likely fail the course?
  What recommendations could they make to increase the odds that more students will pass the course?
  How could interpretations of important behavioural features be translated into learning analytics that give teachers more insight into how students learn to code?
- Can we combine student progress (what programming skills does a student already have and at what level of mastery), student preferences (which skills does a student want to improve on), and intrinsic properties of programming exercises (what skills are needed to solve an exercise and how difficult is it) into dynamic learning paths that recommend exercises to optimize the learning effect for individual students?

* Automating manual feedback
:PROPERTIES:
:CREATED: [2023-10-23 Mon 08:51]
:CUSTOM_ID: chap:feedback
:END:

This chapter discusses the history of manual feedback in the programming course taught at the Faculty of Sciences at Ghent University (as described in the case study in Section\nbsp{}[[#sec:usecasestudy]]) and how it informed the development of evaluation and grading features within Dodona.
We will then expand on some further experiments using data mining techniques we did to try to further reduce the time spent adding manual feedback.
Section\nbsp{}[[#sec:feedbackprediction]] is based on an article that is currently being prepared for submission.

** Phase 0: Paper-based assessment
:PROPERTIES:
:CREATED: [2023-11-20 Mon 13:04]
:CUSTOM_ID: sec:feedbackpaper
:END:

Since the academic year 2015--2016 the programming course has started taking two open-book/open-internet evaluations in addition to the regular exam.[fn::
Before this, sessions were organized where students had to explain the code they submitted for an exercise.
This was found not to be a great system, since it's far easier to explain code than to write it.
]
The first is a midterm and the other happens at the end of the semester (but before the exam period).
The organization of these evaluations has been a learning process for everyone involved.
Although the basic idea has remained the same (solve two Python programming exercises in two hours, or three in 3.5 hours for the exam), almost every aspect surrounding this basic premise has changed.

To be able to give feedback, student solutions were initially printed at the end of the evaluation.
At first this happened by giving each student a USB stick on which they could find some initial files and which they had to copy their solution to.
Later, this was replaced by a submission platform developed at Ghent University (Indianio) that had support for printing in the evaluation rooms.
Indianio and its printing support was developed specifically to support courses in this format.
Students were then allowed to check their printed solutions to make sure that the correct code was graded.
This however means that the end of an evaluation takes a lot of time, since printing all these papers is a slow and badly parallelizable process (not the mention the environmental impact!).[fn::
The assignments themselves were also printed out and given to all students, which increased the amount of paper even more.
]

It also has some important drawbacks while grading.
SPOJ (and later Dodona) was used to generate automated feedback on correctness.
This automated feedback was not available when assessing a student's source code on paper.
It therefore takes either more mental energy to work out whether the student's code would behave correctly with all inputs or it takes some hassle to look up a student's automated assessment results every time.
Another important drawback is that students have a much harder time seeing their feedback.
While their numerical grades were posted online or emailed to them, to see the comments graders wrote alongside their code they had to come to a hands-on session and ask the assistant there to be able to view the annotated version of their code (which could sometimes be hard to read, depending on the handwriting of the grader).[fn::
For the second evaluation, the feedback was also scanned and emailed, since there were no more hands-on sessions.
This was even the basis for a Dodona exercise: https://dodona.be/en/activities/235452497/.
]
Very few students did so.
There are a few possible explanations for this.
They might experience social barriers for asking feedback on an evaluation they performed poorly on.
For students who performed well, it might not be worth the hassle of going to ask about feedback.
But maybe more importantly, a vicious cycle started to appear: because few students look at their feedback, graders did not spend much effort in writing out clear and useful feedback.
Code that was too complex or plain wrong usually received little more than a strikethrough, instead of an explanation on why the student's method did not work.

** Phase 1: Adding comments via Dodona
:PROPERTIES:
:CREATED: [2023-11-20 Mon 13:32]
:CUSTOM_ID: sec:feedbackcomments
:END:

Seeing the amount of hassle that assessing these evaluations brought with them, we decided to build support for manual feedback and grading into Dodona.
The first step of this was the functionality of adding comments to code.
This work was started in the academic year 2019--2020, so the onset of the COVID-19 pandemic brought a lot of momentum to this work.
Suddenly, the idea of printing student submissions became impossible, since the evaluations had to be taken remotely by students and the graders were working from home as well.
Graders could now add comments to a student's code, which would allow the student to view the feedback remotely as well.
An example of such a comment can be seen on Figure\nbsp{}[[fig:feedbackfirstcomment]].
There were still a few drawbacks to this system for assessing and grading though:
- Knowing which submissions to grade was not always trivial.
  For most students, the existing deadline system worked, since the solution they submitted right before the deadline was the submission taken into account when grading.
  There are however also students who receive extra time based on a special status granted to them by Ghent University (due to e.g. a learning disability).
  For these students, graders had to manually search for the submission made right before the extended deadline these students receive.
  This means that students could not be graded anonymously.
  It also makes the process a lot more error-prone.
- Comment visibility could not yet be time-gated towards students.
  This meant that graders had to write their comments in a local file with some extra metadata about the assessment.
  Afterwards this local file could be processed using some home-grown scripts to automatically add all comments at (nearly) the same time.
- Grades were added in external files, which was quite error-prone, since this involves manually looking up the correct student and entering their scores in a global spreadsheet.
  It is also less transparent towards students.
  While rubrics were made for every exercise that had to be graded, every grader had their preferred way of aggregating and entering these scores.
  This means that even though the rubrics exist, students had no option of seeing the different marks they received for different rubrics.
It is obvious that this was not a great user experience, and not something we could roll out more widely outside of Dodona developers that were also involved with teaching.

#+CAPTION: The first comment ever left on Dodona as part of a grading session.
#+NAME: fig:feedbackfirstcomment
[[./images/feedbackfirstcomment.png]]

We could already do some anecdotal analysis of this new system.
One first observation that might seem counterintuitive is that graders did not feel like they spent less time grading.
If anything, they reported spending more time grading.
Another observation however is that graders gave more feedback and felt that the feedback they gave was of higher quality than before.
In the first trial of this system, the feedback was viewed by over 80% of students within 24 hours, which is something that we had never observed when grading on paper.

** Phase 2: Evaluations
:PROPERTIES:
:CREATED: [2023-11-20 Mon 13:32]
:CUSTOM_ID: sec:feedbackevaluations
:END:

To streamline and automate the process of grading even more, the concept of an evaluation was added to Dodona.[fn::
See https://docs.dodona.be/en/guides/teachers/grading/ for the actual process of creating an evaluation.
]
Evaluations address two of the drawbacks identified above:
- Comments made within an evaluation are linked to this evaluation.
  They are only made visible to students once the feedback of the evaluation is released.
- Evaluations also add an overview of the submissions that need to receive feedback.
  Since the submissions are explicitly linked to the evaluation, changing the submissions for students who receive extra time is also a lot less error-prone, since it can be done before actually starting out with the assessment.
  Evaluations also have specific UI to do this, where the timestamps are shown to teachers as accurately as Dodona saves them.
The addition of evaluations resulted in a subjective feeling of time being saved by the graders, at least in comparison with the previous system of adding comments.

To address the third concern mentioned above, another feature was implemented in Dodona.
We added rubrics and a user-friendly way of entering scores.
This means that students can view the scores they received for each rubric, and can do so right next to the feedback that was added manually.

** Phase 3: Feedback reuse
:PROPERTIES:
:CREATED: [2023-11-20 Mon 17:39]
:CUSTOM_ID: sec:feedbackreuse
:END:

Grading and giving feedback has always been a time-consuming process, and the move to digital grading did not improve this compared to grading on paper.
Even though the process itself was optimized, this optimization was used by graders to write out more and more comprehensive feedback.

Since evaluations are done with a few exercises solved by lots of students, there are usually a lot of mistakes that are common to a lot of students.
This leads to graders giving the same feedback a lot of times.
In fact, most graders maintained a list of commonly given feedback in a separate program or document.

We implemented the concept of feedback reuse to streamline giving commonly reused feedback.
When giving feedback, the grader has the option to save the annotation they are currently writing.
When they later encounter a situation where they want to give that same feedback, the only thing they have to do is write a few letters of the annotation in the saved annotation search box, and they can quickly insert the text written earlier.
An example of this can be seen in Figure\nbsp{}[[fig:feedbackreusesaved]].

#+CAPTION: An example of searching for a previously saved annotation.
#+NAME: fig:feedbackreusesaved
[[./images/feedbackreusesaved.png]]

While originally conceptualized mainly for the benefit of graders, students can actually benefit from this feature as well.
Graders only need to write out a detailed and clear message once and can then reuse that message over a lot of submissions instead of writing a shorter message each time.
Because feedback is also added to a specific section of code, graders naturally write atomic feedback that is easier to reuse than monolothic sections of feedback\nbsp{}[cite:@moonsAtomicReusableFeedback2022].

** Phase 4: Feedback prediction
:PROPERTIES:
:CREATED: [2023-11-20 Mon 13:04]
:CUSTOM_ID: sec:feedbackprediction
:END:

Given that we now have a system for reusing earlier feedback, we can ask ourselves if we can do this in a smarter way.
Instead of teachers having to search for the annotation they want to use, what if we could predict which annotation they want to use?
This is exactly what we will explore in this section.

*** Introduction
:PROPERTIES:
:CREATED:  [2024-01-19 Fri 15:47]
:CUSTOM_ID: subsec:feedbackpredictionintro
:END:

Feedback is a key factor in student learning\nbsp{}[cite:@hattiePowerFeedback2007; @blackAssessmentClassroomLearning1998].
In programming education, many steps have been taken to give feedback using automated assessment systems\nbsp{}[cite:@paivaAutomatedAssessmentComputer2022; @ihantolaReviewRecentSystems2010; @ala-mutkaSurveyAutomatedAssessment2005].
These automated assessment systems give feedback on correctness, and can give some feedback on style and best practices through the use of linters.
They are, however, generally not able to give the same high-level feedback on program design that an experienced programmer can give.
In many educational practices, automated assessment is therefore supplemented with manual feedback, especially when grading evaluations or exams\nbsp{}[cite:@debuseEducatorsPerceptionsAutomated2008].
This requires a significant time investment of teachers\nbsp{}[cite:@tuckFeedbackgivingSocialPractice2012].

As a result, many researchers have explored the use of AI to enhance giving feedback.
[cite/t:@vittoriniAIBasedSystemFormative2021] used natural language processing to automate grading, and found that students who used the system during the semester were more likely to pass the course at the end of the semester.
[cite/t:@leeSupportingStudentsGeneration2023] has used supervised learning with ensemble learning to enable students to perform peer and self-assessment.
In addition, [cite/t:@berniusMachineLearningBased2022] introduced a framework based on clustering text segments in textual exercises to reduce the grading workload.

The context of our work is in our own assessment system, called Dodona, developed at Ghent University\nbsp{}[cite:@vanpetegemDodonaLearnCode2023].
Dodona gives automated feedback on each submission, but also has a module that allows teachers to give manual feedback on student submissions and assign scores to them, from within the platform.
The process of giving feedback on solution to a programming exercise in Dodona is very similar to a code review, where errors or suggestions for improvements are annotated on the relevant line(s), as can be seen on Figure\nbsp{}[[fig:feedbackintroductionreview]].
In 2023, there were 3\thinsp{}663\thinsp{}749 submissions on our platform, of which 44\thinsp{}012 were assessed manually.
During these assessments, 22\thinsp{}888 annotations were added.

#+CAPTION: Manual assessment of a submission: a teacher gives feedback on the code by adding inline annotations and scores the submission by filling out the exercise-specific scoring rubric.
#+CAPTION: The teacher just searched for an annotation so that they could reuse it.
#+CAPTION: Automated assessment was already performed beforehand, where 22 test cases failed, as can be seen from the badge on the "Correctness" tab.
#+CAPTION: An automated annotation left by PyLint can be seen on line 22.
#+NAME: fig:feedbackintroductionreview
[[./images/feedbackintroductionreview.png]]

However, there is a crucial difference between traditional code reviews and those in an educational context: teachers often give feedback on numerous solutions to the same exercise.
Since students often make similar mistakes, it logically follows that teachers will repeatedly give the same feedback on multiple student submissions.
In response to this repetitive nature of feedback, Dodona has implemented a feature that allows teachers to save and later retrieve specific annotations.
This feature facilitates the reuse of feedback by allowing teachers to search for previously saved annotations.
In 2023, 777 annotations were saved by teachers on Dodona, and there were 7\thinsp{}180 instances of reuse of these annotations.
By using this functionality, we have generated data that we can use in this study: code submissions, where those submissions have been annotated on specific lines with annotations that are shared across those submissions.

In this section we answer the following research question: In the context of manually assessing code written by students during an evaluation, can we use previously added annotations to predict wat annotation a reviewer will add on a particular line?

We present a machine learning method for suggesting reuse of previously given feedback.
We begin with a detailed explanation of the design of the method.
We then present and discuss the experimental results we obtained by testing our method on student submissions.
The dataset we use is based on real (Python) code written by students for exams.
First, we test our method by predicting automated PyLint annotations.
Second, we use manual annotations left by human reviewers during assessment.

*** Methodology
:PROPERTIES:
:CREATED: [2024-01-08 Mon 13:18]
:CUSTOM_ID: subsec:feedbackpredictionmethodology
:END:

The approach we present for predicting what feedback a reviewer will give on source code is based on mining patterns from trees.
This is a data mining technique for extracting frequently occurring patterns from data that can be represented as trees.
It was developed in the early 2000s\nbsp{}[cite:@zakiEfficientlyMiningFrequent2005; @asaiEfficientSubstructureDiscovery2004].
Program code can be represented as an abstract syntax tree (AST), where the nodes of the tree represent the language constructs used in the program.
Recent work has used this fact to investigate how these pattern mining algorithms can be used to efficiently find frequent patterns in source code\nbsp{}[cite:@phamMiningPatternsSource2019].
In an educational context, these techniques could then be used, for example, to find patterns common to solutions that failed a given exercise\nbsp{}[cite:@mensGoodBadUgly2021].
Other work has looked at automatically generating unit tests from mined patterns\nbsp{}[cite:@lienard2023extracting].

We start with a general overview of our method (Figure\nbsp{}[[fig:feedbackmethodoverview]]).
The first step is to use the tree-sitter library\nbsp{}[cite:@brunsfeldTreesitterTreesitterV02024] to generate ASTs for each submission.
Using tree-sitter should make our method independent of the programming language used, since it is a generic interface for generating syntax trees.
For each instance of an annotation, a constrained AST context surrounding the annotated line is extracted.
We then aggregate all subtrees of an annotation.
The collection of subtrees for each annotation is processed by the =TreeminerD= algorithm\nbsp{}[cite:@zakiEfficientlyMiningFrequent2005], yielding a set of frequently occurring patterns specific to that annotation.
We assign weights to these patterns based on their length and their frequency across the entire dataset of patterns for all annotations.
The result of these operations is our trained model.

The model can now be used to suggest matching patterns and thus annotations for a given code fragment.
In practice, the reviewer first selects a line of code in a given student's submission.
Next, the AST of the selected line and its surrounding context is generated.
For each annotation, each of its patterns is matched to the line, and a similarity score is calculated, given the previously determined weights.
This similarity score is used to rank the annotations and this ranking is displayed to the teacher.

A detailed explanation of this process follows, with a particular emphasis on operational efficiency.
Speed is a paramount concern throughout the model's lifecycle, from training to deployment in real-time reviewing contexts.
Given the continuous generation of training data during the reviewing process, the model's training time must be optimized to avoid significant delays, ensuring that the model remains practical for live reviewing situations.

#+CAPTION: Overview of our machine learning method for predicting annotation reuse.
#+CAPTION: Code of previously reviewed submissions is converted to its abstract syntax tree (AST) form.
#+CAPTION: Instances of the same annotation have the same colour.
#+CAPTION: For each annotation, the context of each instance is extracted and mined for patterns using the =TreeminerD= algorithm.
#+CAPTION: These patterns are then weighted to form our model.
#+CAPTION: When a teacher wants to place an annotation on a line of the submissions they are currently reviewing, all previously given annotations are ranked based on the similarity determined for that line.
#+CAPTION: The teacher can then choose which annotation they want to place.
#+NAME: fig:feedbackmethodoverview
[[./diagrams/feedbackmethodoverview.svg]]

**** Extract subtree around a line
:PROPERTIES:
:CREATED:  [2024-01-19 Fri 15:44]
:CUSTOM_ID: subsubsec:feedbackpredictionsubtree
:END:

Currently, the context around a line is extracted by taking all AST nodes from that line.
For example, Figure\nbsp{}[[fig:feedbacksubtree]] shows that the subtree extracted for the code on line 3 in Listing\nbsp{}[[lst:feedbacksubtreesample]].
Note that the context we extract here is very limited.
Previous iterations considered all nodes that contained the relevant line (e.g. the function node for a line in a function), but these contexts proved too large to process in an acceptable time frame.

#+ATTR_LATEX: :float t
#+CAPTION: Example code that simply adds a number to the ASCII value of a character and converts it back to a character.
#+NAME: lst:feedbacksubtreesample
#+BEGIN_SRC python
def jump(alpha, n):
    alpha_number = ord(alpha)
    adjusted = alpha_number + n
    return chr(adjusted)
#+END_SRC

#+CAPTION: AST subtree corresponding to line 3 in Listing\nbsp{}[[lst:feedbacksubtreesample]].
#+NAME: fig:feedbacksubtree
[[./diagrams/feedbacksubtree.svg]]

**** Find frequent patterns
:PROPERTIES:
:CREATED: [2023-11-20 Mon 13:33]
:CUSTOM_ID: subsubsec:feedbackpredictiontreeminer
:END:

=Treeminer=\nbsp{}[cite:@zakiEfficientlyMiningFrequent2005] is an algorithm for discovering frequently occurring subtrees in datasets of rooted, ordered and labelled trees.
It does this by starting with a list of frequently occurring nodes, and then iteratively expanding the frequently occurring patterns.

In the base =Treeminer= algorithm, frequently occurring means that the number of times the pattern occurs in all trees divided by the number of trees is greater than some predefined threshold.
This is called the =minimum support= parameter.

Patterns are embedded subtrees: the nodes in a pattern are a subset in the nodes of the tree, preserving the ancestor-descendant relationships and the left-to-right order of the nodes.

=TreeminerD= is a more efficient variant of the base =Treeminer= algorithm.
It achieves this efficiency by not counting occurrences of a frequent pattern within a tree.
Since we are not interested in this information for our method, it was an obvious choice to use the =TreeminerD= variant.

We use a custom Python implementation of the =TreeminerD= algorithm, to find patterns in the AST subtrees for each annotation.
In our implementation, we set the =minimum support= parameter to 0.8.
This value was determined experimentally.

For example, one annotation in our real-world dataset had 92 instances placed on 47 student submissions.
For this annotation =TreeminerD= finds 105\thinsp{}718 patterns.

**** Assign weights to patterns
:PROPERTIES:
:CREATED: [2023-11-22 Wed 14:39]
:CUSTOM_ID: subsubsec:feedbackpredictionweights
:END:

Due to the iterative nature of =TreeminerD=, many patterns are (embedded) subtrees of other patterns.
We don't do any post-processing to remove these patterns, since they might be relevant to code we haven't seen yet, but we do assign weights to them.
Weights are assigned using two criteria.

The first criterion is the size of the pattern (i.e., the number of nodes in the pattern), since a pattern with twenty nodes is much more specific than a pattern with only one node.
The second criterion is the number of occurrences of a pattern across all annotations.
If the pattern sets for all annotations contain a particular pattern, it can't be used reliably to determine which annotation should be predicted and is therefore given a lower weight.
Weights are calculated using the formula below.
\[\operatorname{weight}(pattern) = \frac{\operatorname{len}(pattern)}{\operatorname{\#occurences}(pattern)}\]

**** Match patterns to subtrees
:PROPERTIES:
:CREATED:  [2024-02-01 Thu 14:25]
:CUSTOM_ID: subsubsec:feedbackpredictionmatching
:END:

To check whether a given pattern matches a given subtree, we iterate over all the nodes in the subtree.
At the same time, we also iterate over the nodes in the pattern.
During the iteration, we also store the current depth, both in the pattern and the subtree.
We also keep a stack to store (some of) the depths of the subtree.
If the current label in the subtree and the pattern are the same, we store the current subtree depth on the stack and move to the next node in the pattern.
Moving up in the tree is more complicated.
If the current depth and the depth of the last match (stored on the stack) are the same, we can move forwards in the pattern (and the subtree).
If not, we need to check that we are still in the embedded subtree, otherwise we need to reset our position in the pattern to the start.
Since subtrees can contain the same label multiple times, we also need to make sure that we can backtrack.
Listings\nbsp{}[[lst:feedbackmatchingpseudocode1]]\nbsp{}and\nbsp{}[[lst:feedbackmatchingpseudocode2]] contain the full pseudocode for this algorithm.

#+ATTR_LATEX: :float t
#+CAPTION: Pseudocode for checking whether a pattern matches a subtree.
#+CAPTION: Note that both the pattern and the subtree are stored in the encoding described by\nbsp{}[cite/t:@zakiEfficientlyMiningFrequent2005].
#+CAPTION: The implementation of =find_in_subtree= can be found in Listing\nbsp[[lst:feedbackmatchingpseudocode1]].
#+NAME: lst:feedbackmatchingpseudocode1
#+BEGIN_SRC python
start, p_i, pattern_depth, depth = 0
depth_stack, history = []

subtree_matches(subtree, pattern):
  result = find_in_subtree(subtree, subtree)
  while not result and history is not empty:
    to_explore, to_explore_subtree = history.pop()
    while not result and to_explore is not empty:
      start, depth, depth_stack, p_i = to_explore.pop()
      new_subtree = to_explore_subtree[start:]
      start = 0
      if pattern_length - p_i <= len(new_subtree) and new_subtree is fully contained in pattern[p_i:]:
        result = find_in_subtree(subtree, new_subtree)
   return result
#+END_SRC

#+ATTR_LATEX: :float t
#+CAPTION: Continuation of Listing\nbsp[[lst:feedbackmatchingpseudocode1]].
#+NAME: lst:feedbackmatchingpseudocode2
#+BEGIN_SRC python
find_in_subtree(subtree, current_subtree):
  local_history = []
  for item in subtree:
    if item == -1:
      if depth_stack is not empty and depth - 1 == depth_stack.last:
        depth_stack.pop()
        if pattern [p_i] != -1:
          p_i = 0
          if depth_stack is empty:
            history.append((local_history, current_subtree[:i + 1])
            local_history = []
        else:
          p_i += 1
      depth -= 1
    else:
      if pattern[p_i] == item:
        local_history.append((start + i + 1, depth + 1, depth_stack, p_i))
        depth_stack.append(depth)
        p_i += 1
      depth += 1
    if p_i == pattern_length:
      return True
  if local_history is not empty:
    history.append((local_history, current_subtree))
  return False
#+END_SRC

Checking whether a pattern matches a subtree is an operation that needs to happen a lot.
For some annotations, there are many patterns, and all patterns of all annotations are checked.
An important optimization we added was to run the algorithm in Listings\nbsp{}[[lst:feedbackmatchingpseudocode1]]\nbsp{}and\nbsp{}[[lst:feedbackmatchingpseudocode2]] only if the set of labels in the pattern is a subset of the labels in the subtree.

**** Rank annotations
:PROPERTIES:
:CREATED: [2023-11-22 Wed 14:47]
:CUSTOM_ID: subsubsec:feedbackpredictionsimilarity
:END:

Given a model where we have weighted patterns for each annotation, and a method for matching patterns to subtrees, we can now put the two together to make a final ranking of the available annotations for a given line of code.
We compute a match score for each annotation using the formula below.
\[ \operatorname{score}(annotation) = \frac{\displaystyle\sum_{pattern \atop \in\, patterns} \begin{cases} \operatorname{weight}(pattern) & pattern \text{ matches} \\ 0               & \text{otherwise} \end{cases}}{\operatorname{len}(patterns)} \]
The annotations are ranked according to this score.

*** Results and discussion
:PROPERTIES:
:CREATED: [2024-01-08 Mon 13:18]
:CUSTOM_ID: subsec:feedbackpredictionresults
:END:

As a dataset to validate our method, we used read (Python) code written by students for exercises from (different) exams.
The dataset contains between 135 and 214 submissions per exercise.
We split the datasets evenly into a training set and a test set.
During the testing phase we iterate over all instances of annotations in the test set.
The lines these occur on are the lines we feed to our model.
We evaluate if the correct annotation is ranked first, or if it is ranked in the top five.
This gives us a good idea of how useful the suggested ranking would be in practice: if an annotation is ranked higher than fifth, we would expect the reviewer to have to search for it instead of directly selecting it from the suggested ranking.

We first ran PyLint on the student submissions, and used PyLint's machine annotations as our training and test data.
For a second evaluation, we used the manual annotations left by human reviewers on student code in Dodona.
In this case, we train and test per exercise, since the set of annotations used is also different for each exercise.
Exercises have between 55 and 469 instances of annotations.
Unique annotations vary between 11 and 92 per exercise.
Table\nbsp{}[[tab:feedbackresultsdataset]] gives an overview of some of the features of the dataset.

#+CAPTION: Statistics for the exercises used in this analysis.
#+CAPTION: Max is the maximum amount of instances per annotation.
#+CAPTION: Avg is the average amount of instances per annotation.
#+NAME: tab:feedbackresultsdataset
| Exercise                                                                | subm. | ann. | inst. | max |  avg |
|-------------------------------------------------------------------------+-------+------+-------+-----+------|
| <l>                                                                     |   <r> |  <r> |   <r> | <r> |  <r> |
| A last goodbye[fn:: https://dodona.be/en/activities/505886137/]         |   135 |   35 |   334 |  92 |  9.5 |
| Symbolic[fn:: https://dodona.be/en/activities/933265977/]               |   141 |   11 |    55 |  24 |  5.0 |
| Narcissus cipher[fn:: https://dodona.be/en/activities/1730686412/]      |   144 |   24 |   193 |  53 |  8.0 |
| Cocktail bar[fn:: https://dodona.be/en/activities/1875043169/]          |   211 |   92 |   469 | 200 | 5.09 |
| Anthropomorphic emoji[fn:: https://dodona.be/en/activities/2046492002/] |   214 |   84 |   322 |  37 | 3.83 |
| Hermit[fn:: https://dodona.be/en/activities/2146239081/]                |   194 |   70 |   215 |  29 | 3.07 |

We distinguish between these two sources of annotations, because we expect PyLint to be more consistent both in when it places an instance of an annotation and also where it places the instance.
Most linting annotations are detected through explicit pattern matching in the AST, so we expect our implicit pattern matching to work fairly well.
However, we want to skip this explicit pattern matching because of the time required to assemble them and the fact that annotations will often be specific to a particular exercise.
Therefore we also test on real-world data.
Real-world data is expected to be more inconsistent because human reviewer may miss a problem in one student's code that they annotated in another student's code, or they may not place an instance of a particular annotation in a consistent location.
The method by which human reviewers place an annotation is also much more implicit than PyLint's pattern matching.

**** Machine annotations (PyLint)
:PROPERTIES:
:CUSTOM_ID: subsubsec:feedbackpredictionresultspylint
:CREATED: [2023-11-20 Mon 13:33]
:END:

We will first discuss the results for the PyLint annotations.
Depending on the exercise, the actual annotation is ranked among the top five annotations for 50% to 80% of all test instances (Figure\nbsp{}[[fig:feedbackpredictionpylintglobal]]).
The annotation is even ranked first for 10% to 45% of all test instances.
Interestingly, the method performs worse when the instances for all exercises are combined.
This highlights the fact that our method is most useful in the context where similar code needs to be reviewed many times.
For the submissions and instances in the training set, training takes 1.5 to 50 seconds to process all submissions and instances in a training set, depending on the number of patterns found.
Testing takes 4 seconds to 22 minutes, again depending on the number of patterns.
Performance was measured on a consumer-grade Macbook Pro with a 1.4GHz Intel quad-core processor and 16 GB of RAM.

#+CAPTION: Predictive accuracy for suggesting instances of PyLint annotations.
#+NAME: fig:feedbackpredictionpylintglobal
[[./images/feedbackpredictionpylintglobal.png]]

We have selected some annotations for further inspection, some of which perform very well, and some of which perform worse (Figure\nbsp{}[[fig:feedbackpredictionpylintmessages]]).
The differences in performance can be explained by the content of the annotation and the underlying patterns PyLint is looking for.
For example, the annotation "too many branches"[fn:: https://pylint.pycqa.org/en/latest/user_guide/messages/refactor/too-many-branches.html] performs rather poorly.
This can be explained by the fact that we do not feed enough context to =TreeminerD= to find predictive patterns for this PyLint annotation.
There are also annotations that can't be predicted at all, because no patterns are found.

Other annotations, like "consider using with"[fn:: https://pylint.pycqa.org/en/latest/user_guide/messages/refactor/consider-using-with.html], work very well.
For these annotations, =TreeminerD= does have enough context to pick up the underlying patterns.
The number of instances of an annotation in the training set also has an impact.
Annotations which have only a few instances are generally predicted worse than those with lots of instances.

#+CAPTION: Predictive accuracy for a selection of PyLint annotations.
#+CAPTION: Each line corresponds to a Pylint annotation, with the number of instances in the training and test set denoted in brackets after the name of the annotation.
#+NAME: fig:feedbackpredictionpylintmessages
[[./images/feedbackpredictionpylintmessages.png]]

**** Human annotations
:PROPERTIES:
:CREATED: [2023-11-20 Mon 13:33]
:CUSTOM_ID: subsubsec:feedbackpredictionresultsrealworld
:END:

For the annotations added by human reviewers, we applied two different scenarios to evaluate our method.
Besides using the same 50/50 split as with the PyLint data, we also simulated how a human reviewer would use the method in practice by gradually increasing the training set and decreasing the test set as the reviewer progresses through the submissions during the assessment.
At the start of the assessment no annotations are available and the first instance of an annotation that applies to a reviewed submission cannot be predicted.
As more submissions have been reviewed, and more instances of annotations are placed on those submissions, the training set for modelling predictions on the next submission under review grows gradually.

If we evenly split submissions and the corresponding annotations from a human reviewer into a training and a test set, the predictive accuracy is similar or even slightly better compared to the PyLint annotations (Figure\nbsp{}[[fig:feedbackpredictionrealworldglobal]]).
The number of instances where the true annotation is ranked first is generally higher (between 20% and 62% depending on the exercise), and the number of instances where it is ranked in the top five is between 42.5% and 81% depending on the exercise.
However, there is quite some variance between exercises.
This can be explained by the quality of the data.
For example, for the exercise "Symbolic", very few instances were placed for most annotations, which makes it difficult to predict additional instances.
For this experiment, training took between 1.5 and 16 seconds depending on the exercise.
Testing took between 1.5 and 36 seconds depending on the exercise.
These evaluations were run on the same hardware as those for the machine annotations.

#+CAPTION: Prediction results for six exercises that were designed and used for an exam.
#+CAPTION: Models were trained on half of the submissions from the dataset and tested on the other half of the submissions from the dataset.
#+NAME: fig:feedbackpredictionrealworldglobal
[[./images/feedbackpredictionrealworldglobal.png]]

These results show that we can predict reuse with an accuracy that is quite high at the midpoint of a reviewing session for an exercise.
The accuracy depends on the amount of instances per annotation and the consistency of the reviewer.
Looking at the underlying data, we can also see that short, atomic messages can be predicted very well, as hinted by\nbsp{}[cite/t:@moonsAtomicReusableFeedback2022].
We will now look at the accuracy of our method over time, to test how the accuracy evolves as the reviewing session progresses.

For the next experiment, we introduce two specific categories of negative predictive outcomes, namely "Not yet seen" and "No patterns found".
"Not yet seen" means that the annotation corresponding to the true instance had no instances in  the test set, and therefore could never have been predicted.
"No patterns found" means that =TreeminerD= was unable to find any frequent patterns for the set of subtrees.
We know beforehand that test instances of such annotations cannot be predicted.

Figures\nbsp{}[[fig:feedbackpredictionrealworldsimulation1]],\nbsp{}[[fig:feedbackpredictionrealworldsimulation2]],\nbsp{}[[fig:feedbackpredictionrealworldsimulation3]]\nbsp{}and\nbsp{}[[fig:feedbackpredictionrealworldsimulation4]] show the results of this experiment for four of the exercises we used in the previous experiments.
The exercises that performed worse results in the previous experiment were not taken into account for this experiment.
We also excluded submissions that received no annotations during the human review process, which explains the lower number of submissions compared to the numbers in Table\nbsp{}[[tab:feedbackresultsdataset]].
This experiment shows that while the review process requires some build-up before sufficient training instances are available, once a critical mass of training instances is reached, the accuracy for suggesting new instances of annotations reaches its maximal predictive power.
This critical mass is reached after about 20 to 30 reviews, which is quite early in the reviewing process.
The point at which the critical mass is reached will of course depend on the nature of the exercises and the consistency of the reviewer.

#+CAPTION: Progression of the predictive accuracy for the exercise "A last goodbye" throughout the review process.
#+CAPTION: Predictions for instances whose annotation had no instances in the training set are classified as "Not yet seen".
#+CAPTION: Predictions for instances whose annotation had no corresponding patterns in the model learned from the training set are classified as "No patterns found".
#+CAPTION: The graph on the right shows the number of annotations present with at least one instance in the training set.
#+NAME: fig:feedbackpredictionrealworldsimulation1
[[./images/feedbackpredictionrealworldsimulation1.png]]

#+CAPTION: Progression of the predictive accuracy for the exercise "Narcissus cipher" throughout the review process.
#+CAPTION: Predictions for instances whose annotation had no instances in the training set are classified as "Not yet seen".
#+CAPTION: Predictions for instances whose annotation had no corresponding patterns in the model learned from the training set are classified as "No patterns found".
#+CAPTION: The graph on the right shows the number of annotations present with at least one instance in the training set.
#+NAME: fig:feedbackpredictionrealworldsimulation2
[[./images/feedbackpredictionrealworldsimulation2.png]]

#+CAPTION: Progression of the predictive accuracy for the exercise "Cocktail bar" throughout the review process.
#+CAPTION: Predictions for instances whose annotation had no instances in the training set are classified as "Not yet seen".
#+CAPTION: Predictions for instances whose annotation had no corresponding patterns in the model learned from the training set are classified as "No patterns found".
#+CAPTION: The graph on the right shows the number of annotations present with at least one instance in the training set.
#+NAME: fig:feedbackpredictionrealworldsimulation3
[[./images/feedbackpredictionrealworldsimulation3.png]]

#+CAPTION: Progression of the predictive accuracy for the exercise "Anthropomorphic emoji" throughout the review process.
#+CAPTION: Predictions for instances whose annotation had no instances in the training set are classified as "Not yet seen".
#+CAPTION: Predictions for instances whose annotation had no corresponding patterns in the model learned from the training set are classified as "No patterns found".
#+CAPTION: The graph on the right shows the number of annotations present with at least one instance in the training set.
#+NAME: fig:feedbackpredictionrealworldsimulation4
[[./images/feedbackpredictionrealworldsimulation4.png]]

As mentioned before, we are working with a slightly inconsistent data set when using annotations by human reviewers.
They will sometimes miss an instance of an annotation, place it inconsistently, or create duplicate annotations.
If this system is used in practice, the predictions could possibly be even better, since knowing about its existence might further motivate a reviewer to be consistent in their reviews.
The exercises were also reviewed by different people, which could also be an explanation for the differences in accuracy of predictions.
For example, the reviewers of the exercises in Figures\nbsp{}[[fig:feedbackpredictionrealworldsimulation3]]\nbsp{}and\nbsp{}[[fig:feedbackpredictionrealworldsimulation4]] were still creating new annotations in the last few submissions, which obviously can't be predicted.

*** Conclusions and future work
:PROPERTIES:
:CREATED: [2023-11-20 Mon 13:33]
:CUSTOM_ID: subsec:feedbackpredictionconclusion
:END:

We presented a prediction method to assist in giving feedback while reviewing students submissions for an exercise by reusing annotations.
Improving annotation reuse can be both a time-saver, and improve the consistency with which feedback is given.
The latter itself might actually also improve the accuracy of the predictions when the strategy is applied during the review process.

The method has already shown promising results.
We validated the framework both by predicting automated linting annotations to establish a baseline and by predicting annotations from human reviewers.
The method has about the same predictive accuracy for machine (Pylint) and human annotations.
Thus, we can give a positive answer to our research question that reuse of feedback given previously by a human reviewer can be predicted with high accuracy on a particular line of a new submission.

We can conclude that the proposed method has achieved the desired objective.
Having this method at hand immediately raises some possible follow-up work.
Currently, the proposed model is reactive: we suggest a ranking of most likely annotations when a reviewer wants to add an annotation to a particular line.
By introducing a confidence score, we could check beforehand if we have a confident match for each line, and then immediately propose those suggestions to the reviewer.
Whether or not a reviewer accepts these suggestions could then also be used as an input to the model.
This could also have an extra advantage, since it could help reviewers be more consistent in where and when they place an annotation.

Annotations that don’t lend themselves well to prediction also need further investigation.
The context used could be expanded, although the important caveat here is that the method still needs to maintain its speed.
We could also consider applying some of the source code pattern mining techniques proposed by\nbsp{}[cite/t:@phamMiningPatternsSource2019] to achieve further speed improvements.
Another important aspect that was explicitly left out of the scope of this chapter was its integration into a learning platform and user testing.

Of course, alternative methods could also be considered.
One cannot overlook the rise of Large Language Models (LLMs) and the way they could contribute to this problem.
LLMs can also generate feedback for students, based on their code and a well-chosen system prompt.
Fine-tuning of a model with feedback already given could also be considered.

* Looking ahead: opportunities and challenges
:PROPERTIES:
:CREATED: [2023-10-23 Mon 08:51]
:CUSTOM_ID: chap:discussion
:END:

It feels safe to say that Dodona is a successful automated assessment platform with a big societal impact.
{{{num_users}}} users is quite a lot, and the fact that it is being actively used in every university in Flanders, a number of colleges, and a lot of secondary schools is a feat that not many other similar platforms have achieved.

As we have tried to show in this dissertation, its development has also led to interesting opportunities for new research.
Dodona generates a lot of data by being used, and we have shown that educational data mining can be used on this data.
It can even be used to develop new educational data mining techniques that are applicable elsewhere.
The work is, however, never finished.
There are still possibilities for interesting computer science and educational research.

** Research opportunities
:PROPERTIES:
:CREATED:  [2024-02-16 Fri 10:50]
:END:

A big question, left open in this work, is what to do with the results we obtained in Chapter\nbsp{}[[#chap:passfail]].
Teachers can use the results to figure out which aspects of their course students are struggling with, and take general measures to deal with this.
But should we, and if so, /how/ should we communicate predictions to individual students, or what other interventions with students should we take?

Chapter\nbsp{}[[#chap:feedback]] also suggests a number of improvements that could still be worked on.
It gives us a framework for suggesting the feedback a teacher probably wants to give when selecting a line, but we could also try to come up with a confidence score and use that to suggest feedback before the teacher has even done that.
Another interesting (more educational) line of research that this work suggests is building the method into an actual assessment platform, and looking at its effects on feedback consistency and quality, time saved by teachers,\nbsp{}...

A new idea for research using Dodona's data would be skill estimation.
There are a few ways we could try to infer what skills are being tested by exercises: we could try to use the model solution, or the labels assigned to the exercise in Dodona.
Using those skills, we could try to estimate a student's mastery of those skills, using their submissions.

This leads right into another possibility for future research: exercise recommendation.
Right now, learning paths in Dodona are static, determined by the teacher of the course the student is following.
Dodona has a rich library of extra exercises, which some courses point to as opportunities for extra practice, but it is not always easy for students to know what exercises would be good for them.
The research from Chapter\nbsp{}[[#chap:passfail]] could also be used to help solve this problem.
If we know a student has a higher chance of failing the course, we might want to recommend some easier exercises.
The other way around, if a student has a higher chance of passing, we could suggest harder exercises, so they can keep up their good progress in their course.

The use of LLMs in Dodona could also be an opportunity.
As mentioned in Section\nbsp{}[[#subsec:feedbackpredictionconclusion]], a possibility for using LLMs could be to generate feedback while grading.
Another option is to integrate an LLM as an AI tutor (as, for example, Khan Academy has done with Khanmigo[fn:: https://www.khanmigo.ai/]).
This way, it could interactively help students while they are learning.
The final possibility we will present here is to prepare suggestions for answers to student questions on Dodona.
At first glance, LLMs should be quite good at this.
If we use the LLM output as a suggestion for what the teacher could answer, this should be a big time-saver.
However, there are some issues around data quality.
Questions are sometimes asked on a specific line, but the question doesn't necessarily have anything to do with that line.
Sometimes the question also needs context that is hard to pass on to the LLM.
For example, if the question is just "I don't know what's wrong.", a human might look at the failed test cases and be able to answer the "question" in that way.
Passing on the failed test cases to the LLM is a harder problem to solve.
The actual assignment also needs to be passed on, but depending on its size this might also present a problem given token limitations/cost per token of some models.
Another important aspect of this research would be figuring out how to evaluate the quality of the suggestions.

** Challenges for the future
:PROPERTIES:
:CREATED:  [2024-02-16 Fri 10:50]
:END:

Even though Dodona is a successful project with some exciting possibilities for research that can still be done, the project also faces some challenges.

The most important of these challenges is the sustainability of the project.
Dodona was started in the spare time of some researchers.
After a few years, there was somebody working on it full-time.
However, the funding for a full-time developer was always, and still is, temporary.
PhD students who can devote some of their time to it are attracted, grants are applied for (and sometimes granted), but there is no stable source of funding.
We have the advantage that we can kindly make use of Ghent University's data centre, resulting in very few operational costs.
A full-time developer, which Dodona is big enough to need, is expensive though.
This puts Dodona's future in a precarious situation, where there is a constant need to look for new funding opportunities.

As much as generative AI can be an asset for Dodona, it is also a threat.
Most exercises in Dodona can be solved by LLMs without issues.[fn:: Or at least with some nudging.]
This has some troubling implications for Dodona.
Students using ChatGPT or GitHub Copilot when solving their exercises, might not learn as much as students who do the work fully on their own (just like students who plagiarize have a lower chance of passing their courses, as seen in Chapter\nbsp{}[[#chap:passfail]]).
Another aspect is the fairness and integrity of evaluations using Dodona.
The case study in Chapter\nbsp{}[[#chap:use]] details the use of open-book/open-internet evaluations.
If students can use generative AI during these evaluations (either locally or via a webservice), and knowing that LLMs can solve most exercises on Dodona, these evaluations will test the students' abilities less and less, if students can use LLMs.
The way to solve these issues is not clear.
It seems like LLMs are here to stay, and just like the calculator is a commonplace tool these days, the same could be true for LLMs in the future.[fn::
The IMEC digimeter (a yearly survey on technology use in Flanders) showed that 18% of Flemish people used generative AI at least monthly in 2023.
]
Instead of banning the use of LLMs, teachers could integrate the use of them in their courses.
On the other hand, when children first learn to count and add, they do not use calculators.
The same might be necessary when learning to program: to learn the basics, students might need to do a lot of things themselves, to really get a feel for what they are doing.

#+LATEX: \appendix
* Pass/fail prediction feature types
:PROPERTIES:
:CREATED: [2023-10-23 Mon 18:09]
:CUSTOM_ID: chap:featuretypes
:APPENDIX: t
:END:

- =subm= :: numbers of submissions by student in series
- =nosubm= :: number of exercises student did not submit any solutions for in series
- =first_dl= :: time difference in seconds between student's first submission in series and deadline of series
- =last_dl= :: time difference in seconds between student's last submission in series before deadline and deadline of series
- =nr_dl= :: number of correct submissions in series by student before series' deadline
- =correct= :: number of correct submissions in series by student
- =after_correct= :: number of submissions by student after their first correct submission in the series
- =before_correct= :: number of submissions by student before their first correct submission in the series
- =time_series= :: time difference in seconds between the student's first and last submission in the series
- =time_correct= :: time difference in seconds between the student's first submission in the series and their first correct submission in the series
- =wrong= :: number of submissions by student in series with logical errors
- =comp_error= :: number of submissions by student in series with compilation errors
- =runtime_error= :: number of submissions by student in series with runtime errors
- =correct_after_5m= :: number of exercises where first correct submission by student was made within five minutes after first submission
- =correct_after_15m= :: number of exercises where first correct submission by student was made within fifteen minutes after first submission
- =correct_after_2h= :: number of exercises where first correct submission by student was made within two hours after first submission
- =correct_after_24h= :: number of exercises where first correct submission by student was made within twenty-four hours after first submission

* Bibliography
:PROPERTIES:
:CREATED: [2023-10-23 Mon 08:59]
:CUSTOM_ID: chap:bibliography
:UNNUMBERED: t
:END:

#+LATEX: {\setlength{\emergencystretch}{2em}
#+print_bibliography:
#+LATEX: }