phd-thesis/book.org

#+TITLE: Dodona
#+AUTHOR: Charlotte Van Petegem
#+LATEX_CLASS: book
#+LATEX_CLASS_OPTIONS: [11pt,paper=240mm:170mm,paper=portrait]
#+LATEX_COMPILER: lualatex
#+LATEX_HEADER: \usepackage[inline]{enumitem}
#+LATEX_HEADER: \usepackage{listings}
#+LATEX_HEADER: \usepackage{color}
#+LATEX_HEADER: \usepackage[type=report]{ugent2016-title}
#+LATEX_HEADER: \usepackage[final]{microtype}
#+LATEX_HEADER: \usepackage[defaultlines=2,all]{nowidow}
#+LATEX_HEADER: \academicyear{2023–2024}
#+LATEX_HEADER: \subtitle{Learn to code with a data-driven platform}
#+LATEX_HEADER: \titletext{A dissertation submitted to Ghent University in partial fulfilment of the requirements for the degree of Doctor of Computer Science.}
#+LATEX_HEADER: \promotors{%
#+LATEX_HEADER: Supervisors:\\
#+LATEX_HEADER: Prof.\ Dr.\ Peter Dawyndt\\
#+LATEX_HEADER: Prof.\ Dr.\ Ir.\ Bart Mesuere\\
#+LATEX_HEADER: Prof.\ Dr.\ Bram De Wever
#+LATEX_HEADER: }
#+OPTIONS: toc:nil
#+OPTIONS: ':t
#+OPTIONS: H:4
#+cite_export: csl citation-style.csl
#+bibliography: bibliography.bib

#+LATEX: \frontmatter

* Table of Contents
:PROPERTIES:
:CREATED:  [2023-10-23 Mon 14:10]
:CUSTOM_ID: chap:toc
:UNNUMBERED: t
:END:
#+LATEX: \listoftoc*{toc}

* Acknowledgements
:PROPERTIES:
:CREATED:  [2023-10-23 Mon 09:25]
:CUSTOM_ID: chap:ack
:UNNUMBERED: t
:END:

* Summaries
:PROPERTIES:
:CREATED:  [2023-10-23 Mon 17:56]
:CUSTOM_ID: chap:summ
:UNNUMBERED: t
:END:

** Summmary in English
:PROPERTIES:
:CREATED:  [2023-10-23 Mon 17:54]
:CUSTOM_ID: sec:summen
:END:

** Nederlandstalige samenvatting
:PROPERTIES:
:CREATED:  [2023-10-23 Mon 17:54]
:CUSTOM_ID: sec:summnl
:END:

#+LATEX: \mainmatter

* Introduction
:PROPERTIES:
:CREATED:  [2023-10-23 Mon 08:47]
:CUSTOM_ID: chap:intro
:END:

#+BEGIN_COMMENT
History of automated assessment
#+END_COMMENT

Learning how to solve problems with computer programs requires practice, and programming assignments are the main way in which such practice is generated  [cite:@gibbsConditionsWhichAssessment2005].
Because of its potential to provide feedback loops that are scalable and responsive enough for an active learning environment, automated source code assessment has become a driving force in programming courses.
This has resulted in a proliferation of educational programming platforms [cite:@paivaAutomatedAssessmentComputer2022; @ihantolaReviewRecentSystems2010; @douceAutomaticTestbasedAssessment2005; @ala-mutkaSurveyAutomatedAssessment2005].
Automated assessment was introduced into programming education in the early 1960s [cite:@hollingsworthAutomaticGradersProgramming1960] and allows students to receive immediate and personalized feedback on each submitted solution without the need for human intervention.
[cite/t:@cheangAutomatedGradingProgramming2003] identified the labor-intensive nature of assessing programming assignments as the main reason why students are given few such assignments when ideally they should be given many more.
While almost all platforms support automated assessment of code submitted by students, contemporary platforms usually offer additional features such as gamification in the FPGE platform [cite:@paivaManagingGamifiedProgramming2022], integration of full-fledged editors in iWeb-TD [cite:@fonsecaWebbasedPlatformMethodology2023], exercise recommendations in PLearn  [cite:@vasyliukDesignImplementationUkrainianLanguage2023], automatic grading with JavAssess [cite:@insaAutomaticAssessmentJava2018], assessment of test suites using test coverage measures in Web-CAT [cite:@edwardsWebCATAutomaticallyGrading2008] and automatic hint generation in GradeIT [cite:@pariharAutomaticGradingFeedback2017].

* What is Dodona?
:PROPERTIES:
:CREATED:  [2023-10-23 Mon 08:47]
:CUSTOM_ID: chap:what
:END:

** Students
:PROPERTIES:
:CREATED:  [2023-10-23 Mon 08:48]
:CUSTOM_ID: sec:whatstudents
:END:

** Teachers
:PROPERTIES:
:CREATED:  [2023-10-23 Mon 08:48]
:CUSTOM_ID: sec:whatteachers
:END:

*** Classroom management
:PROPERTIES:
:CREATED:  [2023-10-24 Tue 09:31]
:CUSTOM_ID: subsec:whatclassroom
:END:

In Dodona, a *course* is where teachers and instructors effectively manage a learning environment by instructing, monitoring and evaluating their students and interacting with them, either individually or as a group.
A Dodona user who created a course becomes its first administrator and can promote other registered users as *course administrators*.
In what follows, we will also use the generic term teacher as a synonym for course administrators if this Dodona-specific interpretation is clear from the context, but keep in mind that courses may have multiple administrators.

The course itself is laid out as a *learning path* that consists of course units called *series*, each containing a sequence of *learning activities* (Figure\nbsp{}[[fig:whatcourse]]).
Among the learning activities we differentiate between *reading activities* that can be marked as read and *programming assignments* with support for automated assessment of submitted solutions.
Learning paths are composed as a recommended sequence of learning activities to build knowledge progressively, allowing students to monitor their own progress at any point in time.
Courses can either be created from scratch or from copying an existing course and making additions, deletions and rearrangements to its learning path.

#+CAPTION: Main course page (administrator view) showing some series with deadlines, reading activities and programming assignments in its learning path.
#+CAPTION: At any point in time, students can see their own progress through the learning path of the course.
#+CAPTION: Teachers have some additional icons in the navigation bar (top) that lead to an overview of all students and their progress, an overview of all submissions for programming assignments, general learning analytics about the course, course management and a dashboard with questions from students in various stages from being answered (Figure\nbsp{}[[fig:whatquestions]]).
#+CAPTION: The red dot on the latter icon notifies that some student questions are still pending.
#+NAME: fig:whatcourse
[[./images/whatcourse.png]]

Students can *self-register* to courses in order to avoid unnecessary user management.
A course can either be announced in the public overview of Dodona for everyone to see, or be limited in visibility to students from a certain educational institution.
Alternatively, students can be invited to a hidden course by sharing a secret link.
Independent of course visibility, registration for a course can either be open to everyone, restricted to users from the institution the course is associated with or new registrations can be disabled altogether.
Registrations are either approved automatically or require explicit approval by a teacher.
Registered users can be tagged with one or more labels to create subgroups that may play a role in learning analytics and reporting.

Students and teachers more or less see the same course page, except for some management features and learning analytics that are reserved for teachers.
Teachers can make content in the learning path temporarily inaccessible and/or invisible to students.
Content is typically made inaccessible when it is still in preparation or if it will be used for evaluating students during a specific period.
A token link can be used to grant access to invisible content, e.g.\nbsp{}when taking a test or exam from a subgroup of students.

Students can only mark reading activities as read once, but there is no restriction on the number of solutions they can submit for programming assignments.
Submitted solutions are automatically assessed and students receive immediate feedback as soon as the assessment has completed, usually within a few seconds.
Dodona stores all submissions, along with submission metadata and generated feedback, such that the submission and feedback history can be reclaimed at all times.
On top of automated assessment, student submissions may be further assessed and graded manually by a teacher.

Series can have a *deadline*.
Passed deadlines do not prevent students from marking reading activities or submitting solutions for programming assignments in their series.
However, learning analytics, reports and exports usually only take into account submissions before the deadline.
Because of the importance of deadlines and to avoid discussions with students about missed deadlines, series deadlines are not only announced on the course page.
The student's home page highlights upcoming deadlines for individual courses and across all courses.
While working on a programming assignment, students also start to see a clear warning from ten minutes before a deadline onwards.
Courses also provide an iCalendar link that students can use to publish course deadlines in their personal calendar application.

Because Dodona logs all student submissions and their metadata, including feedback and grades from automated and manual assessment, we use that data to integrate reports and learning analytics in the course page [cite:@fergusonLearningAnalyticsDrivers2012].
We also provide export wizards that enable the extraction of raw and aggregated data in CSV-format for downstream processing and educational data mining [cite:@romeroEducationalDataMining2010; @bakerStateEducationalData2009].
This allows teachers to better understand student behavior, progress and knowledge, and might give deeper insight into the underlying factors that contribute to student actions [cite:@ihantolaReviewRecentSystems2010].
Understanding, knowledge and insights that can be used to make informed decisions about courses and their pedagogy, increase student engagement, and identify at-risk students [cite:@vanpetegemPassFailPrediction2022].

*** User management
:PROPERTIES:
:CREATED:  [2023-10-24 Tue 09:44]
:CUSTOM_ID: subsec:whatuser
:END:

Instead of providing its own authentication and authorization, Dodona delegates authentication to external identity providers (e.g.\nbsp{}educational and research institutions) through SAML [cite:@farrellAssertionsProtocolOASIS2002], OAuth [cite:@leibaOAuthWebAuthorization2012; @hardtOAuthAuthorizationFramework2012] and OpenID Connect [cite:@sakimuraOpenidConnectCore2014].
This support for *decentralized authentication* allows users to benefit from single sign-on when using their institutional account across multiple platforms and teachers to trust their students’ identities when taking high-stakes tests and exams in Dodona.

Dodona automatically creates user accounts upon successful authentication and uses the association with external identity providers to assign an institution to users.
By default, newly created users are assigned a student role.
Teachers and instructors who wish to create content (courses, learning activities and judges), must first request teacher rights using a streamlined form.

*** Automated assessment
:PROPERTIES:
:CREATED:  [2023-10-24 Tue 10:16]
:CUSTOM_ID: subsec:whatassessment
:END:

The range of approaches, techniques and tools for software testing that may underpin assessing the quality of software under test is incredibly diverse.
Static testing directly analyzes the syntax, structure and data flow of source code, whereas dynamic testing involves running the code with a given set of test cases [cite:@oberkampfVerificationValidationScientific2010; @grahamFoundationsSoftwareTesting2021].
Black-box testing uses test cases that examine functionality exposed to end-users without looking at the actual source code, whereas white-box testing hooks test cases onto the internal structure of the code to test specific paths within a single unit, between units during integration, or between subsystems [cite:@nidhraBlackBoxWhite2012].
So, broadly speaking, there are three levels of white-box testing: unit testing, integration testing and system testing [cite:@wiegersCreatingSoftwareEngineering1996; @dooleySoftwareDevelopmentProfessional2011].
Source code submitted by students can therefore be verified and validated against a multitude of criteria: functional completeness and correctness, architectural design, usability, performance and scalability in terms of speed, concurrency and memory footprint, security, readability (programming style), maintainability (test quality) and reliability [cite:@staubitzPracticalProgrammingExercises2015].
This is also reflected by the fact that a diverse range of metrics for measuring software quality have come forward, such as cohesion/coupling [cite:@yourdonStructuredDesignFundamentals1979; @stevensStructuredDesign1999], cyclomatic complexity [cite:@mccabeComplexityMeasure1976] or test coverage [cite:@millerSystematicMistakeAnalysis1963].

To cope with such a diversity in software testing alternatives, Dodona is centered around a generic infrastructure for *programming assignments that support automated assessment*.
Assessment of a student submission for an assignment comprises three loosely coupled components: containers, judges and assignment-specific assessment configurations.
More information on this underlying mechanism can be found in Chapter\nbsp{}[[Technical description]].

Where automatic assessment and feedback generation is outsourced to the judge linked to an assignment, Dodona itself takes up the responsibility for rendering the feedback.
This frees judge developers from putting effort in feedback rendering and gives a coherent look-and-feel even for students that solve programming assignments assessed by different judges.
Because the way feedback is presented is very important [cite:@maniBetterFeedbackEducational2014], we took great care in designing how feedback is displayed to make its interpretation as easy as possible (Figure\nbsp{}[[fig:whatfeedback]]).
Differences between generated and expected output are automatically highlighted for each failed test [cite:@myersAnONDDifference1986], and users can swap between displaying the output lines side-by-side or interleaved to make differences more comparable.
We even provide specific support for highlighting differences between tabular data such as CSV-files, database tables and dataframes.
Users have the option to dynamically hide contexts whose test cases all succeeded, allowing them to immediately pinpoint reported mistakes in feedback that contains lots of succeeded test cases.
To ease debugging the source code of submissions for Python assignments, the Python Tutor [cite:@guoOnlinePythonTutor2013] can be launched directly from any context with a combination of the submitted source code and the test code from the context.
Students typically report this as one of the most useful features of Dodona.

#+CAPTION: Dodona rendering of feedback generated for a submission of the Python programming assignment "Curling".
#+CAPTION: The feedback is split across three tabs: ~isinside~, ~isvalid~ and ~score~.
#+CAPTION: 48 tests under the ~score~ tab failed as can be seen from the badge in the tab header.
#+CAPTION: The "Code" tab displays the source code of the submission with annotations added during automatic and/or manual assessment (Figure\nbsp{}[[fig:whatannotations]]).
#+CAPTION: The differences between the generated and expected return values were automatically highlighted and the judge used HTML snippets to add a graphical representation (SVG) of the problem for the failed test cases.
#+CAPTION: In addition to highlighting differences between the generated and expected return values of the first (failed) test case, the judge also added a text snippet that points the user to a type error.
#+NAME: fig:whatfeedback
[[./images/whatfeedback.png]]

*** Content management
:PROPERTIES:
:CREATED:  [2023-10-24 Tue 10:47]
:CUSTOM_ID: subsec:whatcontent
:END:

Where courses are created and managed in Dodona itself, other content is managed in external git *repositories* (Figure\nbsp{}[[fig:whatrepositories]]).
In this distributed content management model, a repository either contains a single judge or a collection of learning activities: reading activities and/or programming assignments.
Setting up a *webhook* for the repository guarantees that any changes pushed to its default branch are automatically and immediately synchronized with Dodona.
This even works without the need to make repositories public, as they may contain information that should not be disclosed such as programming assignments that are under construction, contain model solutions, or will be used during tests or exams.
Instead, a *Dodona service account* must be granted push/pull access to the repository.
Some settings of a learning activity can be modified through the web interface of Dodona, but any changes are always pushed back to the repository in which the learning activity is configured so that it always remains the master copy.

#+CAPTION: Distributed content management model that allows to seamlessly integrate custom learning activities (reading activities and programming assignments with support for automated assessment) and judges (frameworks for automated assessment) into Dodona.
#+CAPTION: Content creators manage their content in external git repositories, keep ownership over their content, control who can co-create, and set up webhooks to automatically synchronize any changes with the content as published on Dodona.
#+NAME: fig:whatrepositories
[[./images/whatrepositories.png]]

Due to the distributed nature of content management, creators also keep ownership over their content and control who may co-create.
After all, access to a repository is completely independent from access to its learning activities that are published in Dodona.
The latter is part of the configuration of learning activities, with the option to either share learning activities so that all teachers can include them in their courses or to restrict inclusion of learning activities to courses that are explicitly granted access.
Dodona automatically stores metadata about all learning activities such as content type, natural language, programming language and repository to increase their findability in our large collection.
Learning activities may also be tagged with additional labels as part of their configuration.

Any repository containing learning activities must have a predefined directory structure.
Directories that contain a learning activity also have their own internal directory structure that includes a *description* in HTML or Markdown.
Descriptions may reference data files and multimedia content included in the repository, and such content can be shared across all learning activities in the repository.
Embedded images are automatically encapsulated in a responsive lightbox to improve readability.
Mathematical formulas in descriptions are supported through MathJax [cite:@cervoneMathJaxPlatformMathematics2012].

While reading activities only consist of descriptions, programming assignments need an additional *assessment configuration* that sets a programming language and a judge.
The configuration may also set a Docker image, a time limit, a memory limit and grant Internet access to the container that is instantiated from the image, but these settings have proper default values.
Judges, for example, have a default image that is used if the configuration of a programming assignment does not specify one explicitly.
Dodona builds the available images from Dockerfiles specified in a separate git repository.
The configuration might also provide additional *assessment resources*: files made accessible to the judge during assessment.
The specification of how these resources must be structured and how they are used during assessment is completely up to the judge developers.
Finally, the configuration might also contain *boilerplate code*: a skeleton students can use to start the implementation that is provided in the code editor along with the description.

*** Internationalization and localization
:PROPERTIES:
:CREATED:  [2023-10-24 Tue 10:55]
:CUSTOM_ID: subsec:whati18n
:END:
*Internationalization* (i18n) is a shared responsibility between Dodona, learning activities and judges.
All boilerplate text in the user interface that comes from Dodona itself is supported in English and Dutch, and users can select their preferred language.
Content creators can specify descriptions of learning activities in both languages, and Dodona will render a learning activity in the user’s preferred language if available.
When users submit solutions for a programming assignment, their preferred language is passed as submission metadata to the judge.
It’s then up to the judge to take this information into account while generating feedback.

Dodona always displays *localized deadlines* based on a time zone setting in the user profile, and users are warned when the current time zone detected by their browser differs from the one in their profile.

*** Questions, answers and code reviews
:PROPERTIES:
:CREATED:  [2023-10-24 Tue 10:56]
:CUSTOM_ID: subsec:whatqa
:END:

A downside of using discussion forums in programming courses is that students can ask questions about programming assignments that are either disconnected from their current implementation or contain code snippets that may give away (part of) the solution to other students [cite:@nandiEvaluatingQualityInteraction2012].
Dodona therefore allows students to address teachers with questions they directly attach to their submitted source code.
We support both general questions and questions linked to specific lines of their submission (Figure\nbsp{}[[fig:whatquestion]]).
Questions are written in Markdown (e.g., to include markup, tables, syntax highlighted code snippets or multimedia), with support for MathJax (e.g., to include mathematical formulas).

#+CAPTION: A student (Matilda) previously asked a question that has already been answered by her teacher (Miss Honey).
#+CAPTION: Based on this response, the student is now asking a follow-up question that can be formatted using Markdown.
#+NAME: fig:whatquestion
[[./images/whatquestion.png]]

Teachers are notified whenever there are pending questions (Figure\nbsp{}[[fig:whatcourse]]).
They can process these questions from a dedicated dashboard with live updates (Figure\nbsp{}[[fig:whatquestions]]).
The dashboard immediately guides them from an incoming question to the location in the source code of the submission it relates to, where they can answer the question in a similar way as students ask questions.
To avoid questions being inadvertently handled simultaneously by multiple teachers, they have a three-state lifecycle: pending, in progress and answered.
In addition to teachers changing question states while answering them, students can also mark their own questions as being answered.
The latter might reflect the rubber duck debugging [cite:@huntPragmaticProgrammer1999] effect that is triggered when students are forced to explain a problem to someone else while asking questions in Dodona.
Teachers can (temporarily) disable the option for students to ask questions in a course, e.g.\nbsp{}when a course is over or during hands-on sessions or exams when students are expected to ask questions face-to-face rather than online.

#+CAPTION: Live updated dashboard showing all incoming questions in a course while asking questions is enabled.
#+CAPTION: Questions are grouped into three categories: unanswered, in progress and answered.
#+NAME: fig:whatquestions
[[./images/whatquestions.png]]

Manual source code annotations from students (questions) and teachers (answers) are rendered in the same way as source code annotations resulting from automated assessment.
They are mixed in the source code displayed in the "Code" tab, showing their complementary nature.
It is not required that students take the initiative for the conversation.
Teachers can also start adding source code annotations while reviewing a submission.
Such *code reviews* will be used as a building block for manual assessment.

*** Manual assessment
:PROPERTIES:
:CREATED:  [2023-10-24 Tue 11:01]
:CUSTOM_ID: subsec:whateval
:END:

Teachers can create an *evaluation* for a series to manually assess student submissions for its programming assignments after a specific period, typically following the deadline of some homework, an intermediate test or a final exam.
The evaluation embodies all programming assignments in the series and a group of students that submitted solutions for these assignments.
Because a student may have submitted multiple solutions for the same assignment, the last submission before a given deadline is automatically selected for each student and each assignment in the evaluation.
This automatic selection can be manually overruled afterwards.
The evaluation deadline defaults to the deadline set for the associated series, if any, but an alternative deadline can be selected as well.

Evaluations support *two-way navigation* through all selected submissions: per assignment and per student.
For evaluations with multiple assignments, it is generally recommended to assess per assignment and not per student, as students can build a reputation throughout an assessment [cite:@malouffBiasGradingMetaanalysis2016].
As a result, they might be rated more favorably with a moderate solution if they had excellent solutions for assignments that were assessed previously, and vice versa [cite:@malouffRiskHaloBias2013].
Assessment per assignment breaks this reputation as it interferes less with the quality of previously assessed assignments from the same student.
Possible bias from the same sequence effect is reduced during assessment per assignment as students are visited in random order for each assignment in the evaluation.
In addition, *anonymous mode* can be activated as a measure to eliminate the actual or perceived halo effect conveyed through seeing a student’s name during assessment [cite:@lebudaTellMeYour2013].
While anonymous mode is active, all students are automatically pseudonymized.
Anonymous mode is not restricted to the context of assessment and can be used across Dodona, for example while giving in-class demos.

When reviewing a selected submission from a student, assessors have direct access to the feedback that was previously generated during automated assessment: source code annotations in the "Code" tab and other structured and unstructured feedback in the remaining tabs.
Moreover, next to the feedback that was made available to the student, the specification of the assignment may also add feedback generated by the judge that is only visible to the assessor.
Assessors might then complement the assessment made by the judge by adding *source code annotations* as formative feedback and by *grading* the evaluative criteria in a scoring rubric as summative feedback (Figure\nbsp{}[[fig:whatannotations]]).
Previous annotations can be reused to speed up the code review process, because remarks or suggestions tend to recur frequently when reviewing submissions for the same assignment.
Grading requires setting up a specific *scoring rubric* for each assignment in the evaluation, as a guidance for evaluating the quality of submissions [cite:@dawsonAssessmentRubricsClearer2017; @pophamWhatWrongWhat1997].
The evaluation tracks which submissions have been manually assessed, so that analytics about the assessment progress can be displayed and to allow multiple assessors working simultaneously on the same evaluation, for example one (part of a) programming assignment each.

#+CAPTION: Manual assessment of a submission: a teacher (Miss Honey) is giving feedback on the source code by adding inline annotations and is grading the submission by filling up the scoring rubric that was set up for the programming assignment "The Feynman ciphers".
#+NAME: fig:whatannotations
[[./images/whatannotations.png]]

** Related projects
:PROPERTIES:
:CREATED:  [2023-10-23 Mon 08:48]
:CUSTOM_ID: sec:whatrelated
:END:

*** Dolos
:PROPERTIES:
:CREATED:  [2023-10-23 Mon 08:48]
:CUSTOM_ID: subsec:whatdolos
:END:

*** TESTed
:PROPERTIES:
:CREATED:  [2023-10-23 Mon 08:48]
:CUSTOM_ID: subsec:whattested
:END:

* Use
:PROPERTIES:
:CREATED:  [2023-10-23 Mon 08:48]
:CUSTOM_ID: chap:use
:END:

** University level
:PROPERTIES:
:CREATED:  [2023-10-23 Mon 08:48]
:CUSTOM_ID: sec:useuni
:END:

*** FWE
:PROPERTIES:
:CREATED:  [2023-10-23 Mon 08:48]
:CUSTOM_ID: subsec:usefwe
:END:

Since the academic year 2011-2012 we have organized an introductory Python course at Ghent University (Belgium) with a strong focus on active and online learning.
Initially the course was offered twice a year in the first and second term, but from academic year 2014-2015 onwards it was only offered in the first term.
The course is taken by a mix of undergraduate, graduate, and postgraduate students enrolled in various study programmes (mainly formal and natural sciences, but not computer science), with 442 students enrolled for the 2021-2022 edition.

**** Course structure
:PROPERTIES:
:CREATED:  [2023-10-24 Tue 11:47]
:CUSTOM_ID: subsubsec:usecourse
:END:

Each course edition has a fixed structure, with 13 weeks of educational activities subdivided in two successive instructional units that each cover five topics of the Python programming language -- one topic per week -- followed by a graded test about all topics covered in the unit (Figure\nbsp{}[[fig:usefwecoursestructure]]).
The final exam at the end of the term evaluates all topics covered in the entire course.
Students who fail the course during the first exam in January can take a resit exam in August/September that gives them a second chance to pass the exam.

#+CAPTION: *Top*: Structure of the Python course that runs each academic year across a 13-week term (September-December).
#+CAPTION: Programming assignments from the same Dodona series are stacked vertically.
#+CAPTION: Students submit solutions for ten series with six mandatory assignments, two tests with two assignments and an exam with three assignments.
#+CAPTION: There is also a resit exam with three assignments in August/September if they failed the first exam in January.
#+CAPTION: *Bottom*: Heatmap from Dodona learning analytics page showing distribution per day of all 331\thinsp{}734 solutions submitted during the 2021-2022 edition of the course (442 students).
#+CAPTION: The darker the color, the more solutions were submitted that day.
#+CAPTION: A light gray square means no solutions were submitted that day.
#+CAPTION: Weekly lab sessions for different groups on Monday afternoon, Friday morning and Friday afternoon, where we can see darker squares.
#+CAPTION: Weekly deadlines for mandatory assignments on Tuesdays at 22:00.
#+CAPTION: Three exam sessions for different groups in January.
#+CAPTION: Low activity in exam periods, except for days where an exam was taken.
#+CAPTION: The course is not taught in the second term, so this low-activity period was collapsed.
#+CAPTION: Two more exam sessions for different groups in August/September, granting an extra chance to students who failed on their exam in January.
#+NAME: fig:usefwecoursestructure
[[./images/usefwecoursestructure.png]]

Each week in which a new programming topic is covered, students must try to solve six programming assignments on that topic before a deadline one week later.
That results in 60 mandatory assignments across the semester.
Following the flipped classroom strategy [cite:@bishopFlippedClassroomSurvey2013; @akcayirFlippedClassroomReview2018], students prepare themselves to achieve this goal by reading the textbook chapters covering the topic.
Lectures are interactive programming sessions that aim at bridging the initial gap between theory and practice, advancing concepts, and engaging in collaborative learning [cite:@tuckerFlippedClassroom2012].
Along the same lines, the first assignment for each topic is an ISBN-themed programming challenge whose model solution is shared with the students, together with an instructional video that works step-by-step towards the model solution.
As soon as students feel they have enough understanding of the topic, they can start working on the five remaining mandatory assignments.
Students can work on their programming assignments during weekly computer labs, where they can collaborate in small groups and ask help from teaching assistants.
They can also work on their assignments and submit solutions outside lab sessions.
In addition to the mandatory assignments, students can further elaborate on their programming skills by tackling additional programming exercises they select from a pool of over 850 exercises linked to the ten programming topics.
Submissions for these additional exercises are not taken into account in the final grade.

**** Assessment, feedback and grading
:PROPERTIES:
:CREATED:  [2023-10-24 Tue 11:47]
:CUSTOM_ID: subsubsec:useassessment
:END:

We use the online learning environment Dodona to promote active learning through problem solving [cite:@princeDoesActiveLearning2004].
Each course edition has its own dedicated course in Dodona, with a learning path containing all mandatory, test and exam assignments, grouped into series with corresponding deadlines.
Mandatory assignments for the first unit are published at the start of the semester, and those for the second unit after the test of the first unit.
For each test and exam we organize multiple sessions for different groups of students.
Assignments for test and exam sessions are provided in a hidden series that is only accessible for students participating in the session using a shared token link.
The test and exam assignments are published afterwards for all students, when grades are announced.
Students can see class progress when working on their mandatory assignments to nudge them to avoid procrastination.
Only teachers can see class progress for test and exam series so as not to accidentally stress out students.
For the same reason, we intentionally organize tests and exams following exactly the same procedure, so that students can take high-stake exams in a familiar context and adjust their approach based on previous experiences.
The only difference is that test assignments are not as hard as exam assignments, as students are still in the midst of learning programming skills when tests are taken.

Students are stimulated to use an integrated development environment (IDE) to work on their programming assignments.
IDEs bundle a battery of programming tools to support today’s generation of software developers in writing, building, running, testing and debugging software.
Working with such tools can be a true blessing for both seasoned and novice programmers, but there is no silver bullet [cite:@brooksNoSilverBullet1987].
Learning to code remains inherently hard [cite:@kelleherAlice2ProgrammingSyntax2002] and consists of challenges that are different to reading and learning natural languages [cite:@fincherWhatAreWe1999].
As an additional aid, students can continuously submit (intermediate) solutions for their programming assignments and immediately receive automatically generated feedback upon each submission, even during tests and exams.
Guided by that feedback, they can track potential errors in their code, remedy them and submit updated solutions.
There is no restriction on the number of solutions that can be submitted per assignment.
All submitted solutions are stored, but for each assignment only the last submission before the deadline is taken into account to grade students.
This allows students to update their solutions after the deadline (i.e.\nbsp{}after model solutions are published) without impacting their grades, as a way to further practice their programming skills.
One effect of active learning, triggered by mandatory assignments with weekly deadlines and intermediate tests, is that most learning happens during the term (Figure\nbsp{}[[fig:usefwecoursestructure]]).
In contrast to other courses, students do not spend a lot of time practicing their coding skills for this course in the days before an exam.
We want to explicitly motivate this behavior, because we strongly believe that one cannot learn to code in a few days' time [cite:@peternorvigTeachYourselfProgramming2001].

For the assessment of tests and exams, we follow the line of thought that human expert feedback through source code annotations is a valuable complement to feedback coming from automated assessment, and that human interpretation is an absolute necessity when it comes to grading [cite:@staubitzPracticalProgrammingExercises2015; @jacksonGradingStudentPrograms1997; @ala-mutkaSurveyAutomatedAssessment2005].
We shifted from paper-based to digital code reviews and grading when support for manual assessment was released in version 3.7 of Dodona (summer 2020).
Although online reviewing positively impacted our productivity, the biggest gain did not come from an immediate speed-up in the process of generating feedback and grades compared to the paper-based approach.
While time-on-task remained about the same, our online source code reviews were much more elaborate than what we produced before on printed copies of student submissions.
This was triggered by improved reusability of digital annotations and the foresight of streamlined feedback delivery.
Where delivering custom feedback only requires a single click after the assessment of an evaluation has been completed in Dodona, it took us much more effort before to distribute our paper-based feedback.
Students were direct beneficiaries from more and richer feedback, as observed from the fact that 75% of our students looked at their personalized feedback within 24 hours after it had been released, before we even published grades in Dodona.
What did not change is the fact that we complement personalized feedback with collective feedback sessions in which we discuss model solutions for test and exam assignments, and the low numbers of questions we received from students on their personalized feedback.
As a future development, we hope to reduce the time spent on manual assessment through improved computer-assisted reuse of digital source code annotations in Dodona.

We accept to primarily rely on automated assessment as a first step in providing formative feedback while students work on their mandatory assignments.
After all, a back-of-the-envelope calculation tells us it would take us 72 full-time equivalents (FTE) to generate equivalent amounts of manual feedback for mandatory assignments compared to what we do for tests and exams.
In addition to volume, automated assessment also yields the responsiveness needed to establish an interactive feedback loop throughout the iterative software development process while it still matters to students and in time for them to pay attention to further learning or receive further assistance [cite:@gibbsConditionsWhichAssessment2005].
Automated assessment thus allows us to motivate students working through enough programming assignments and to stimulate their self-monitoring and self-regulated learning [cite:@schunkSelfregulationLearningPerformance1994; @pintrichUnderstandingSelfregulatedLearning1995].
It results in triggering additional questions from students that we manage to respond to with one-to-one personalized human tutoring, either synchronously during hands-on sessions or asynchronously through Dodona's Q&A module.
We observe that individual students seem to have a strong bias towards either asking for face-to-face help during hands-on sessions or asking questions online.
This could be influenced by the time when they mainly work on their assignments, by their way of collaboration on assignments, or by reservations because of perceived threats to self-esteem or social embarrassment [cite:@newmanStudentsPerceptionsTeacher1993; @karabenickRelationshipAcademicHelp1991].

In computing a final score for the course, we try to find an appropriate balance between stimulating students to find solutions for programming assignments themselves and collaborating with and learning from peers, instructors and teachers while working on assignments.
The final score is computed as the sum of a score obtained for the exam (80%) and a score for each unit that combines the student’s performance on the mandatory and test assignments (10% per unit).
We use Dodona's grading module to determine scores for tests and exams based on correctness, programming style, choice made between the use of different programming techniques and the overall quality of the implementation.
The score for a unit is calculated as the score $s$ for the two test assignments multiplied by the fraction $f$ of mandatory assignments the student has solved correctly.
A solution for a mandatory assignment is considered correct if it passes all unit tests.
Evaluating mandatory assignments therefore doesn’t require any human intervention, except for writing unit tests when designing the assignments, and is performed entirely by our Python judge.
In our experience, most students traditionally perform much better on mandatory assignments compared to test and exam assignments [cite:@glassFewerStudentsAre2022], given the possibilities for collaboration on mandatory assignments.

**** Open and collaborative learning environment
:PROPERTIES:
:CREATED:  [2023-10-24 Tue 11:59]
:CUSTOM_ID: subsubsec:useopen
:END:

We strongly believe that effective collaboration among small groups of students is beneficial for learning [cite:@princeDoesActiveLearning2004], and encourage students to collaborate and ask questions to tutors and other students during and outside lab sessions.
We also demonstrate how they can embrace collaborative coding and pair programming services provided by modern integrated development environments [cite:@williamsSupportPairProgramming2002; @hanksPairProgrammingEducation2011].
But we recommend them to collaborate in groups of no more than three students, and to exchange and discuss ideas and strategies for solving assignments rather than sharing literal code with each other.
After all, our main reason for working with mandatory assignments is to give students sufficient opportunity to learn topic-oriented programming skills by applying them in practice and shared solutions spoil the learning experience.
The factor $f$ in the score for a unit encourages students to keep finetuning their solutions for programming assignments until all test cases succeed before the deadline passes.
But maximizing that factor without proper learning of programming skills will likely yield a low test score $s$ and thus an overall low score for the unit, even if many mandatory exercises were solved correctly.

Fostering an open collaboration environment to work on mandatory assignments with strict deadlines and taking them into account for computing the final score is a potential promoter for plagiarism, but using it as a weight factor for the test score rather than as an independent score item should promote learning by avoiding that plagiarism is rewarded.
It takes some effort to properly explain this to students.
We initially used Moss [cite:@schleimerWinnowingLocalAlgorithms2003] and now use Dolos [cite:@maertensDolosLanguageagnosticPlagiarism2022] to monitor submitted solutions for mandatory assignments, both before and at the deadline.
The solution space for the first few mandatory assignments is too small for linking high similarity to plagiarism: submitted solutions only contain a few lines of code and the diversity of implementation strategies is small.
But at some point, as the solution space broadens, we start to see highly similar solutions that are reliable signals of code exchange among larger groups of students.
Strikingly this usually happens among students enrolled in the same study programme (Figure\nbsp{}[[fig:usefweplagiarism]]).
As soon as this happens -- typically in week 3 or 4 of the course -- plagiarism is discussed during the next lecture.
Usually this is a lecture about working with the string data type, so we can introduce plagiarism detection as a possible application of string processing.

#+CAPTION: Dolos plagiarism graphs for the Python programming assignment "\pi{}-ramidal constants" that was created and used for a test of the 2020-2021 edition of the course (left) and reused as a mandatory assignment in the 2021-2022 edition (right).
#+CAPTION: Graphs constructed from the last submission before the deadline of 142 and 382 students respectively.
#+CAPTION: The color of each node represents the student's study programme.
#+CAPTION: Edges connect highly similar pairs of submissions, with similarity threshold set to 0.8 in both graphs.
#+CAPTION: Edge directions are based on submission timestamps in Dodona.
#+CAPTION: Clusters of connected nodes are highlighted with a distinct background color and have one node with a solid border that indicates the first correct submission among all submissions in that cluster.
#+CAPTION: All students submitted unique solutions during the test, except for two students who confessed they exchanged a solution during the test.
#+CAPTION: Submissions for the mandatory assignment show that most students work either individually or in groups of two or three students, but we also observe some clusters of four or more students that exchanged solutions and submitted them with hardly any varying types and amounts of modifications.
#+NAME: fig:usefweplagiarism
[[./images/usefweplagiarism.png]]

In an announcement entitled "copy-paste \neq{} learn to code" we show students some pseudonymized Dolos plagiarism graphs that act as mirrors to make them reflect upon which node in the graph they could be (Figure\nbsp{}[[fig:usefweplagiarism]]).
We stress that the learning effect dramatically drops in groups of four or more students.
Typically we notice that in such a group only one or a few students make the effort to learn to code, while the other students usually piggyback by copy-pasting solutions.
We make students aware that understanding someone else's code for programming assignments is a lot easier than trying to find solutions themselves.
Over the years, we have experienced that a lot of students are caught in the trap of genuinely believing that being able to understand code is the same as being able to write code that solves a problem until they take a test at the end of a unit.
That's where the $s$ factor of the test score comes into play.
After all, the goal of summative tests is to evaluate if individual students have acquired the skills to solve programming challenges on their own.

When talking to students about plagiarism, we also point out that the plagiarism graphs are directed graphs, indicating which student is the potential source of exchanging a solution among a cluster of students.
We specifically address these students by pointing out that they are probably good at programming and might want to exchange their solutions with other students in a way to help their peers.
But instead of really helping them out, they actually take away learning opportunities from their fellow students by giving away the solution as a spoiler.
Stated differently, they help maximize the factor $f$ but effectively also reduce the $s$ factor of the test score, where both factors need to be high to yield a high score for the unit.
After this lecture, we usually notice a stark decline in the amount of plagiarized solutions.

The goal of plagiarism detection at this stage is prevention rather than penalisation, because we want students to take responsibility over their learning.
The combination of realizing that teachers and instructors can easily detect plagiarism and an upcoming test that evaluates if students can solve programming challenges on their own, usually has an immediate and persistent effect on reducing cluster sizes in the plagiarism graphs to at most three students.
At the same time, the signal is given that plagiarism detection is one of the tools we have to detect fraud during tests and exams.
The entire group of students is only addressed once about plagiarism, without going into detail about how plagiarism detection itself works, because we believe that overemphasizing this topic is not very effective and explaining how it works might drive students towards spending time thinking on how they could bypass the detection process, which is time they'd better spend on learning to code.
Every three or four years we see a persistent cluster of students exchanging code for mandatory assignments over multiple weeks.
If this is the case, we individually address these students to point them again on their responsibilities, again differentiating between students that share their solution and students that receive solutions from others.

Tests and exams, on the other hand, are taken on-campus under human surveillance and allow no communication with fellow students or other persons.
Students can work on their personal computers and get exactly two hours to solve two programming assignments during a test, and three hours and thirty minutes to solve three programming assignments during an exam.
Tests and exams are "open book/open Internet", so any hard copy and digital resources can be consulted while solving test or exam assignments.
Students are instructed that they can only be passive users of the Internet: all information available on the Internet at the start of a test or exam can be consulted, but no new information can be added.
When taking over code fragments from the Internet, students have to add a proper citation as a comment in their submitted source code.
After each test and exam, we again use Moss/Dolos to detect and inspect highly similar code snippets among submitted solutions and to find convincing evidence they result from exchange of code or other forms of interpersonal communication (Figure\nbsp{}[[fig:usefweplagiarism]]).
If we catalog cases as plagiarism beyond reasonable doubt, the examination board is informed to take further action [cite:@maertensDolosLanguageagnosticPlagiarism2022].

**** Workload for running a course edition
:PROPERTIES:
:CREATED:  [2023-10-24 Tue 13:46]
:CUSTOM_ID: subsubsec:useworkload
:END:

To organize "open book/open Internet" tests and exams that are valid and reliable, we always create new assignments and avoid assignments whose solutions or parts thereof are readily available online.
At the start of a test or exam, we share a token link that gives students access to the assignments in a hidden series on Dodona.

For each edition of the course, mandatory assignments were initially a combination of selected test and exam exercises reused from the previous edition of the course and newly designed exercises.
The former to give students an idea about the level of exercises they can expect during tests and exams, and the latter to avoid solution slippage.
As feedback for the students we publish sample solutions for all mandatory exercises after the weekly deadline has passed.
This also indicates that students must strictly adhere to deadlines, because sample solutions are available afterwards.
As deadlines are very clear and adjusted to timezone settings in Dodona, we never experience discussions with students about deadlines.

After nine editions of the course, we felt we had a large enough portfolio of exercises to start reusing mandatory exercises from four or more years ago instead of designing new exercises for each edition.
However, we still continue to design new exercises for each test and exam.
After each test and exam, exercises are published and students receive manual reviews on the code they submitted, on top of the automated feedback they already got during the test or exam.
But in contrast to mandatory exercises we do not publish sample solutions for test and exam exercises, so that these exercises can be reused during the next edition of the course.
When students ask for sample solutions of test or exam exercises, we explain that we want to give the next generation of students the same learning opportunities they had.

So far, we have created more than 850 programming assignments for this introductory Python course alone.
All these assignments are publicly shared on Dodona as open educational resources [cite:@hylenOpenEducationalResources2021; @tuomiOpenEducationalResources2013; @wileyOpenEducationalResources2014; @downesModelsSustainableOpen2007; @caswellOpenEducationalResources2008].
They are used in many other courses on Dodona (on average 10.8 courses per assignment) and by many students (on average 503.7 students and 4801.5 submitted solutions per assignment).
We estimate that it takes about 10 person-hours on average to create a new assignment for a test or an exam: 2 hours for ideation, 30 minutes for implementing and tweaking a sample solution that meets the educational goals of the assignment and can be used to generate a test suite for automated assessment, 4 hours for describing the assignment (including background research), 30 minutes for translating the description from Dutch into English, one hour to configure support for automated assessment, and another 2 hours for reviewing the result by some extra pair of eyes.

Generating a test suite usually takes 30 to 60 minutes for assignments that can rely on basic test and feedback generation features that are built into the judge.
The configuration for automated assessment might take 2 to 3 hours for assignments that require more elaborate test generation or that need to extend the judge with custom components for dedicated forms of assessment (e.g.\nbsp{}assessing non-deterministic behavior) or feedback generation (e.g.\nbsp{}generating visual feedback).
[cite/t:@keuningSystematicLiteratureReview2018] found that publications rarely describe how difficult and time-consuming it is to add assignments to automated assessment platforms, or even if this is possible at all.
The ease of extending Dodona with new programming assignments is reflected by more than 10 thousand assignments that have been added to the platform so far.
Our experience is that configuring support for automated assessment only takes a fraction of the total time for designing and implementing assignments for our programming course, and in absolute numbers stays far away from the one person-week reported for adding assignments to Bridge [cite:@bonarBridgeIntelligentTutoring1988].
Because the automated assessment infrastructure of Dodona provides common resources and functionality through a Docker container and a judge, the assignment-specific configuration usually remains lightweight.
Only around 5% of the assignments need extensions on top of the built-in test and feedback generation features of the judge.

So how much effort does it cost us to run one edition of our programming course?
For the most recent 2021-2022 edition we estimate about 34 person-weeks in total (Table\nbsp{}[[tab:usefweworkload]]), the bulk of which is spent on on-campus tutoring of students during hands-on sessions (30%), manual assessment and grading (22%), and creating new assignments (21%).
About half of the workload (53%) is devoted to summative feedback through tests and exams: creating assignments, supervision, manual assessment and grading.
Most of the other work (42%) goes into providing formative feedback through on-campus and online assistance while students work on their mandatory assignments.
Out of 2215 questions that students asked through Dodona’s online Q&A module, 1983 (90%) were answered by teaching assistants and 232 (10%) were marked as answered by the student who originally asked the question.
Because automated assessment provides first-line support, the need for human tutoring is already heavily reduced.
We have drastically cut the time we initially spent on mandatory assignments by reusing existing assignments and because the Python judge is stable enough to require hardly any maintenance or further development.

#+CAPTION: Estimated workload to run the 2021-2022 edition of the introductory Python programming course for 442 students with 1 lecturer, 7 teaching assistants and 3 undergraduate students who serve as teaching assistants [cite:@gordonUndergraduateTeachingAssistants2013].
#+NAME: tab:usefweworkload
| Task                                | Estimated workload (hours) |
|-------------------------------------+----------------------------|
| Lectures                            |                         60 |
|-------------------------------------+----------------------------|
| Mandatory assignments               |                        540 |
| \emsp{} Select assignments          |                         10 |
| \emsp{} Review selected assignments |                         30 |
| \emsp{} Tips & tricks               |                         10 |
| \emsp{} Automated assessment        |                          0 |
| \emsp{} Hands-on sessions           |                        390 |
| \emsp{} Answering online questions  |                        100 |
|-------------------------------------+----------------------------|
| Tests & exams                       |                        690 |
| \emsp{} Create new assignments      |                        270 |
| \emsp{} Supervise tests and exams   |                        130 |
| \emsp{} Automated assessment        |                          0 |
| \emsp{} Manual assessment           |                        288 |
| \emsp{} Plagiarism detection        |                          2 |
|-------------------------------------+----------------------------|
| Total                               |              1\thinsp{}290 |

**** Learning analytics and educational data mining
:PROPERTIES:
:CREATED:  [2023-10-24 Tue 14:04]
:CUSTOM_ID: subsubsec:uselearninganalytics
:END:

A longitudinal analysis of student submissions across the term shows that most learning happens during the 13 weeks of educational activities and that students don't have to catch up practicing their programming skills during the exam period (Figure\nbsp{}[[fig:usefwecoursestructure]]).
Active learning thus effectively avoids procrastination.
We observe that students submit solutions every day of the week and show increased activity around hands-on sessions and in the run-up to the weekly deadlines (Figure\nbsp{}[[fig:usefwepunchcard]]).
Weekends are also used to work further on programming assignments, but students seem to be watching over a good night's sleep.

#+CAPTION: Punchcard from the Dodona learning analytics page showing the distribution per weekday and per hour of all 331\thinsp{}734 solutions submitted during the 2021-2022 edition of the course (442 students).
#+NAME: fig:usefwepunchcard
[[./images/usefwepunchcard.png]]

Throughout a course edition, we use Dodona's series analytics to monitor how students perform on our selection of programming assignments (Figures\nbsp{}[[fig:usefweanalyticssubmissions]],\nbsp{}[[fig:usefweanalyticsstatuses]],\nbsp{}and\nbsp{}[[fig:usefweanalyticscorrect]]).
This allows us to make informed decisions and appropriate interventions, for example when students experience issues with the automated assessment configuration of a particular assignment or if the original order of assignments in a series does not seem to align with our design goal to present them in increasing order of difficulty.
The first students that start working on assignments usually are good performers.
Seeing these early birds having trouble with solving one of the assignments may give an early warning that action is needed, as in improving the problem specification, adding extra tips & tricks, or better explaining certain programming concepts to all students during lectures or hands-on sessions.
Reversely, observing that many students postpone working on their assignments until just before the deadline might indicate that some assignments are simply too hard at this moment in time through the learning pathway of the students or that completing the collection of programming assignments interferes with the workload from other courses.
Such "deadline hugging" patterns are also a good breeding ground for students to resort on exchanging solutions among each other.

#+CAPTION: Distribution of the number of student submissions per programming assignment.
#+CAPTION: The larger the zone, the more students submitted a particular number of solutions.
#+CAPTION: Black dot indicates the average number of submissions per student.
#+NAME: fig:usefweanalyticssubmissions
[[./images/usefweanalyticssubmissions.png]]

#+CAPTION: Distribution of top-level submission statuses per programming assignment.
#+NAME: fig:usefweanalyticsstatuses
[[./images/usefweanalyticsstatuses.png]]

#+CAPTION: Progression over time of the percentage of students that correctly solved each assignment.
#+NAME: fig:usefweanalyticscorrect
[[./images/usefweanalyticscorrect.png]]

Using educational data mining techniques on historical data exported from several editions of the course, we further investigated what aspects of practicing programming skills promote or inhibit learning, or have no or minor effect on the learning process [cite:@vanpetegemPassFailPrediction2022].
It won't come as a surprise that mid-term test scores are good predictors for a student's final grade, because tests and exams are both summative assessments that are organized and graded in the same way.
However, we found that organizing a final exam end-of-term is still a catalyst of learning, even for courses with a strong focus of active learning during weeks of educational activities.

In evaluating if students gain deeper understanding when learning from their mistakes while working progressively on their programming assignments, we found the old adage that practice makes perfect to depend on what kind of mistakes students make.
Learning to code requires mastering two major competences:
#+ATTR_LATEX: :environment enumerate*
#+ATTR_LATEX: :options [label={\emph{\roman*)}}, itemjoin={{, }}, itemjoin*={{, and }}]
- getting familiar with the syntax and semantics of a programming language to express the steps for solving a problem in a formal way, so that the algorithm can be executed by a computer
- problem solving itself.
  It turns out that staying stuck longer on compilation errors (mistakes against the syntax of the programming language) inhibits learning, whereas taking progressively more time to get rid of logical errors (reflective of solving a problem with a wrong algorithm) as assignments get more complex actually promotes learning.
  After all, time spent in discovering solution strategies while thinking about logical errors can be reclaimed multifold when confronted with similar issues in later assignments [cite:@glassFewerStudentsAre2022].

These findings neatly align with the claim of [cite/t:@edwardsSeparationSyntaxProblem2018] that problem solving is a higher-order learning task in Bloom's Taxonomy (analysis and synthesis) than language syntax (knowledge, comprehension, and application).

Using historical data from previous course editions, we can also make highly accurate predictions about what students will pass or fail the current course edition [cite:@vanpetegemPassFailPrediction2022].
This can already be done after a few weeks into the course, so remedial actions for at-risk students can be started well in time.
The approach is privacy-friendly as we only need to process metadata on student submissions for programming assignments and results from automated and manual assessment extracted from Dodona.
Given that cohort sizes are large enough, historical data from a single course edition are already enough to make accurate predictions.

*** FEA
:PROPERTIES:
:CREATED:  [2023-10-23 Mon 08:48]
:CUSTOM_ID: subsec:usefea
:END:

*** Others
:PROPERTIES:
:CREATED:  [2023-10-23 Mon 08:48]
:CUSTOM_ID: subsec:useothers
:END:

** Secondary schools
:PROPERTIES:
:CREATED:  [2023-10-23 Mon 08:49]
:CUSTOM_ID: sec:usesecondary
:END:

* Technical description
:PROPERTIES:
:CREATED:  [2023-10-23 Mon 08:49]
:CUSTOM_ID: chap:technical
:END:

** Dodona
:PROPERTIES:
:CREATED:  [2023-10-23 Mon 08:49]
:CUSTOM_ID: sec:techdodona
:END:

For proper virtualization we use Docker containers [cite:@pevelerComparingJailedSandboxes2019] that use OS-level containerization technologies and define runtime environments in which all data and executable software (e.g., scripts, compilers, interpreters, linters, database systems) are provided and executed.
These resources are typically pre-installed in the image of the container.
Prior to launching the actual assessment, the container is extended with the submission, the judge and the resources included in the assessment configuration (Figure\nbsp{}[[fig:technicaloutline]]).
Additional resources can be downloaded and/or installed during the assessment itself, provided that Internet access is granted to the container.

#+CAPTION: Outline of the procedure to automatically assess a student submission for a programming assignment.
#+CAPTION: Dodona instantiates a Docker container (1) from the image linked to the assignment (or from the default image linked to the judge of the assignment) and loads the submission and its metadata (2), the judge linked to the assignment (3) and the assessment resources of the assignment (4) into the container.
#+CAPTION: Dodona then launches the actual assessment, collects and bundles the generated feedback (5), and stores it into a database along with the submission and its metadata.
#+NAME: fig:technicaloutline
[[./images/technicaloutline.png]]


The actual assessment of the student submission is done by a software component called a *judge* [cite:@wasikSurveyOnlineJudge2018].
The judge must be robust enough to provide feedback on all possible submissions for the assignment, especially submissions that are incorrect or deliberately want to tamper with the automatic assessment procedure [cite:@forisekSuitabilityProgrammingTasks2006].
Following the principles of software reuse, the judge is ideally also a generic framework that can be used to assess submissions for multiple assignments.
This is enabled by the submission metadata that is passed when calling the judge, which includes the path to the source code of the submission, the path to the assessment resources of the assignment and other metadata such as programming language, natural language, time limit and memory limit.

Rather than providing a fixed set of judges, Dodona adopts a minimalistic interface that allows third parties to create new judges: automatic assessment is bootstrapped by launching the judge’s run executable that can fetch the JSON formatted submission metadata from standard input and must generate JSON formatted feedback on standard output.
The feedback has a standardized hierarchical structure that is specified in a JSON schema.
At the lowest level, *tests* are a form of structured feedback expressed as a pair of generated and expected results.
They typically test some behavior of the submitted code against expected behavior.
Tests can have a brief description and snippets of unstructured feedback called messages.
Descriptions and messages can be formatted as plain text, HTML (including images), Markdown, or source code.
Tests can be grouped into *test cases*, which in turn can be grouped into *contexts* and eventually into *tabs*.
All these hierarchical levels can have descriptions and messages of their own and serve no other purpose than visually grouping tests in the user interface.
At the top level, a submission has a fine-grained status that reflects the overall assessment of the submission: =compilation error= (the submitted code did not compile), =runtime error= (executing the submitted code failed during assessment), =memory limit exceeded= (memory limit was exceeded during assessment), =time limit exceeded= (assessment did not complete within the given time), =output limit exceeded= (too much output was generated during assessment), =wrong= (assessment completed but not all strict requirements were fulfilled), or =correct= (assessment completed and all strict requirements were fulfilled).

Taken together, a Docker image, a judge and a programming assignment configuration (including both a description and an assessment configuration) constitute a *task package* as defined by [cite:@verhoeffProgrammingTaskPackages2008]: a unit Dodona uses to render the description of the assignment and to automatically assess its submissions.
However, Dodona’s layered design embodies the separation of concerns [cite:@laplanteWhatEveryEngineer2007] needed to develop, update and maintain the three modules in isolation and to maximize their reuse: multiple judges can use the same docker image and multiple programming assignments can use the same judge.
Related to this, an explicit design goal for judges is to make the assessment configuration for individual assignments as lightweight as possible.
After all, minimal configurations reduce the time and effort teachers and instructors need to create programming assignments that support automated assessment.
Sharing of data files and multimedia content among the programming assignments in a repository also implements the inheritance mechanism for *bundle packages* as hinted by [cite:@verhoeffProgrammingTaskPackages2008].
Another form of inheritance is specifying default assessment configurations at the directory level, which takes advantage of the hierarchical grouping of learning activities in a repository to share common settings.

To ensure that the system is robust to sudden increases in workload and when serving hundreds of concurrent users, Dodona has a multi-tier service architecture that delegates different parts of the application to different servers running Ubuntu 22.04 LTS.
More specifically, the web server, database (MySQL 8), caching system (Memcached 1.6.14) and Python Tutor each run on their own machine.
In addition, a scalable pool of interchangeable worker servers are available to automatically assess incoming student submissions.
The web server is the only public-facing part of Dodona, running a Ruby on Rails web application (Ruby 3.1, Rails 7.0) that is available on GitHub under the permissive MIT open-source license.

Dodona needs to operate in a challenging environment where students simultaneously submit untrusted code to be executed on its servers ("remote code execution by design") and expect automatically generated feedback, ideally within a few seconds.
Many design decisions are therefore aimed at maintaining and improving the reliability and security of its systems.

#+BEGIN_COMMENT
Development
Setup
Deployment
Testing
...
#+END_COMMENT

** Judges
:PROPERTIES:
:CREATED:  [2023-10-23 Mon 08:49]
:CUSTOM_ID: sec:techjudges
:END:

*** R
:PROPERTIES:
:CREATED:  [2023-10-23 Mon 08:49]
:CUSTOM_ID: subsec:techr
:END:

*** TESTed
:PROPERTIES:
:CREATED:  [2023-10-23 Mon 08:49]
:CUSTOM_ID: subsec:techtested
:END:

* Pass/fail prediction
:PROPERTIES:
:CREATED:  [2023-10-23 Mon 08:50]
:CUSTOM_ID: chap:passfail
:END:

** Introduction
:PROPERTIES:
:CREATED:  [2023-10-23 Mon 08:50]
:CUSTOM_ID: sec:passfailintro
:END:

A lot of educational opportunities are missed by keeping assessment separate from learning [cite:@wiliamWhatAssessmentLearning2011; @blackAssessmentClassroomLearning1998].
Educational technology can bridge this divide by providing real-time data and feedback to help students learn better, teachers teach better, and education systems become more effective [cite:@oecdOECDDigitalEducation2021].
Earlier research demonstrated that the adoption of interactive platforms may lead to better learning outcomes [cite:@khalifaWebbasedLearningEffects2002] and allows to collect rich data on student behaviour throughout the learning process in non-evasive ways.
Effectively using such data to extract knowledge and further improve the underlying processes, which is called educational data mining [cite:@bakerStateEducationalData2009], is increasingly explored as a way to enhance learning and educational processes [cite:@duttSystematicReviewEducational2017].
About one third of the students enrolled in introductory programming courses fail [cite:@watsonFailureRatesIntroductory2014; @bennedsenFailureRatesIntroductory2007].
Such high failure rates are problematic in light of low enrolment numbers and high industrial demand for software engineering and data science profiles [cite:@watsonFailureRatesIntroductory2014].
To remedy this situation, it is important to have detection systems for monitoring at-risk students, understand why they are failing, and develop preventive strategies.
Ideally, detection happens early on in the learning process to leave room for timely feedback and interventions that can help students increase their chances of passing a course.
Previous approaches for predicting performance on examinations either take into account prior knowledge such as educational history and socio-economic background of students or require extensive tracking of student behaviour.
Extensive behaviour tracking may directly impact the learning process itself.
[cite/t:@rountreeInteractingFactorsThat2004] used decision trees to find that the chance of failure strongly correlates with a combination of academic background, mathematical background, age, year of study, and expectation of a grade other than "A".
They conclude that students with a skewed view on workload and content are more likely to fail.
[cite/t:@kovacicPredictingStudentSuccess2012] used data mining techniques and logistic regression on enrolment data to conclude that ethnicity and curriculum are the most important factors for predicting student success.
They were able to predict success with 60% accuracy.
[cite/t:@asifAnalyzingUndergraduateStudents2017] combine examination results from the last two years in high school and the first two years in higher education to predict student performance in the remaining two years of their academic study program.
They used data from one cohort to train models and from another cohort to test that the accuracy of their predictions is about 80%.
This evaluates their models in a similar scenario in which they could be applied in practice.
A downside of the previous studies is that collecting uniform and complete data on student enrolment, educational history and socio-economic background is impractical for use in educational practice.
Data collection is time-consuming and the data itself can be considered privacy sensitive.
Usability of predictive models therefore not only depends on their accuracy, but also on their dependency on findable, accessible, interoperable and reusable data [cite:@wilkinsonFAIRGuidingPrinciples2016].
Predictions based on educational history and socio-economic background also raise ethical concerns.
Such background information definitely does not explain everything and lowers the perceived fairness of predictions [cite:@grgic-hlacaCaseProcessFairness2018; @binnsItReducingHuman2018].
A student can also not change their background, so these items are not actionable for any corrective intervention.

It might be more convenient and acceptable if predictive models are restricted to data collected on student behaviour during the learning process of a single course.
An example of such an approach comes from [cite/t:@vihavainenPredictingStudentsPerformance2013], using snapshots of source code written by students to capture their work attitude.
Students are actively monitored while writing source code and a snapshot is taken automatically each time they edit a document.
These snapshots undergo static and dynamic analysis to detect good practices and code smells, which are fed as features to a nonparametric Bayesian network classifier whose pass/fail predictions are 78% accurate by the end of the semester.
In a follow-up study they applied the same data and classifier to accurately predict learning outcomes for the same student cohort in another course [cite:@vihavainenUsingStudentsProgramming2013].
In this case, their predictions were 98.1% accurate, although the sample size was rather small.
While this procedure does not rely on external background information, it has the drawback that data collection is more invasive and directly intervenes with the learning process.
Students can’t work in their preferred programming environment and have to agree with extensive behaviour tracking.

In this chapter, we present an alternative framework to predict if students will pass or fail a course within the same context of learning to code.
The method only relies on submission behaviour for programming exercises to make accurate predictions and does not require any prior knowledge or intrusive behaviour tracking.
Interpretability of the resulting models was an important design goal to enable further investigation on learning habits.
We also focused on early detection of at-risk students, because predictive models are only effective for the cohort under investigation if remedial actions can be started long before students take their final exam.

The chapter starts with a description of how data is collected, what data is used and which machine learning methods have been evaluated to make pass/fail predictions.
We evaluated the same models and features in multiple courses to test their robustness against differences in teaching styles and student backgrounds.
The results are discussed from a methodological and educational perspective with a focus on
#+ATTR_LATEX: :environment enumerate*
#+ATTR_LATEX: :options [label={\emph{\roman*)}}, itemjoin={{, }}, itemjoin*={{, and }}]
- accuracy (What machine learning algorithms yield the best predictions?)
- early detection (Can we already make accurate predictions early on in the semester?)
- interpretability (Are resulting models clear about which features are important? Can we explain why certain features are identified as important? How self-evident are important features?).

** Materials and methods
:PROPERTIES:
:CREATED:  [2023-10-23 Mon 08:50]
:CUSTOM_ID: sec:passfailmaterials
:END:

*** Course structures
:PROPERTIES:
:CREATED:  [2023-10-23 Mon 16:28]
:CUSTOM_ID: subsec:passfailstructures
:END:

This study uses data from two introductory programming courses (referenced as course A and course B) collected during 3 editions of each course in academic years 2016-2017, 2017-2018 and 2018-2019.
Both courses run once per academic year across a 12-week semester (September-December).
They have separate lecturers and teaching assistants, and are taken by students of different faculties.
The courses have their own structure, but each edition of a course follows the same structure.
Table [[tab:passfailcoursestatistics]] summarizes some statistics on the course editions included in this study.

#+ATTR_LATEX: :float sideways
#+CAPTION: Statistics for course editions included in this study.
#+CAPTION: The number of attempts is the average number of solutions submitted by a student per exercise they worked on (i.e. for which the student submitted at least one solution in the course edition).
#+NAME: tab:passfailcoursestatistics
| course |  academic | students | series | exercises | mandatory | submitted       | attempts | pass rate |
|        |      year |          |        |           | exercises | solutions       |          |           |
|--------+-----------+----------+--------+-----------+-----------+-----------------+----------+-----------|
| A      | 2016-2017 |      322 |     10 |        60 | yes       | 167\thinsp{}675 |     9.56 |    60.86% |
| A      | 2017-2018 |      249 |     10 |        60 | yes       | 125\thinsp{}920 |     9.19 |    61.44% |
| A      | 2018-2019 |      307 |     10 |        60 | yes       | 176\thinsp{}535 |    10.29 |    65.14% |
| B      | 2016-2017 |      372 |     20 |       138 | no        | 371\thinsp{}891 |     9.10 |    56.72% |
| B      | 2017-2018 |      393 |     20 |       187 | no        | 407\thinsp{}696 |     7.31 |    60.81% |
| B      | 2018-2019 |      437 |     20 |       201 | no        | 421\thinsp{}461 |     6.26 |    62.47% |

Course A is subdivided into two successive instructional units that each cover five programming topics -- one topic per week -- followed by an evaluation about all topics covered in the unit.
Students must solve six programming exercises on each topic before a deadline one week later.
Submitted solutions for these mandatory exercises are automatically evaluated and considered correct if they pass all unit tests for the exercise.
Failing to submit a correct solution for a mandatory exercise has a small impact on the score for the evaluation at the end of the unit.
The final exam at the end of the semester evaluates all topics covered in the entire course.
Students need to solve new programming exercises during evaluations (2 exercises) and exams (3 exercises), where reviewers manually evaluate and grade submitted solutions based on correctness, programming style used, choice made between the use of different programming techniques, and the overall quality of the solution.
Each edition of the course is taken by about 300 students.

Course B has 20 lab sessions across the semester, with evaluations after the 10th and 17th lab session and a final exam at the end of the semester.
Each lab session comes with a set of exercises and has an indicative deadline for submitting solutions.
However, these exercises are not taken into account when computing the final score for the course, so students are completely free to work on exercises as a way to practice their coding skills.
Students need to solve new programming exercises during evaluations (3 exercises) and exams (4 exercises).
Solutions submitted during evaluations are automatically graded based on the number of passed unit tests for the exercise.
Solutions submitted during exams are manually graded in the same way as for course A.
Each edition of the course is taken by about 400 students.

*** Learning environment
:PROPERTIES:
:CREATED:  [2023-10-23 Mon 16:28]
:CUSTOM_ID: subsec:passfaillearningenvironment
:END:

Both courses use the same in-house online learning environment to promote active learning through problem solving [cite:@princeDoesActiveLearning2004].
Each course edition has its own module, with a learning path that groups exercises in separate series (Figure\nbsp{}[[fig:passfailstudentcourse]]).
Course A has one series per covered programming topic (10 series in total) and course B has one series per lab session (20 series in total).
A submission deadline is set for each series.
The learning environment is also used to take tests and exams, within series that are only accessible for participating students.

#+CAPTION: Student view of a module in the online learning environment, showing two series of six exercises in the learning path of course A.
#+CAPTION: Each series has its own deadline.
#+CAPTION: The status column shows a global status for each exercise based on the last solution submitted.
#+CAPTION: The class progress column visualizes global status for each exercise for all students subscribed in the course.
#+CAPTION: Icons on the left show a global status for each exercise based on the last submission submitted before the series deadline.
#+NAME: fig:passfailstudentcourse
[[./images/passfailstudentcourse.png]]

Throughout an edition of a course, students can continuously submit solutions for programming exercises and immediately receive feedback upon each submission, even during tests and exams.
This rich feedback is automatically generated by an online judge and unit tests linked to each exercise [cite:@wasikSurveyOnlineJudge2018].
Guided by that feedback, students can track potential errors in their code, remedy them and submit an updated solution.
There is no restriction on the number of solutions that can be submitted per exercise, and students can continue to submit solutions after a series deadline.
All submitted solutions are stored, but only the last submission before the deadline is taken into account to determine the status (and grade) of an exercise for a student.
One of the effects of active learning, triggered by exercises with deadlines and automated feedback, is that most learning happens during the semester as can be seen on the heatmap in Figure\nbsp{}[[fig:passfailheatmap]].

#+CAPTION: Heatmap showing the distribution per day of all 176535 solutions submitted during the 2018-2019 edition of course A.
#+CAPTION: Weekly lab sessions for different groups on Monday afternoon, Friday morning and Friday afternoon.
#+CAPTION: Weekly deadlines for mandatory exercises on Tuesdays at 22:00.
#+CAPTION: Four exam sessions for different groups in January.
#+CAPTION: Two resit exam sessions for different groups in August and September.
#+NAME: fig:passfailheatmap
[[./images/passfailheatmap.png]]

*** Submission data
:PROPERTIES:
:CREATED:  [2023-10-23 Mon 16:38]
:CUSTOM_ID: subsec:passfaildata
:END:

We exported data from the learning environment on all solutions submitted by students during each course edition included in the study.
Each solution has a submission timestamp with precision down to the second and is linked to a course edition, series in the learning path, exercise and student.
We did not use the actual source code submitted by students, but the status describing the global assessment made by the learning environment: correct, wrong, compilation error, runtime error, time limit exceeded, memory limit exceeded, or output limit exceeded.

Comparison of student behaviour between different editions of the same course is enabled by computing snapshots for each edition at series deadlines.
Because course editions follow the same structure, we can align their series and compare snapshots for corresponding series.
Corresponding snapshots represent student performance at intermediate points during the semester and their chronology also allows longitudinal analysis.
Course A has snapshots for the five series on topics covered in the first unit (labelled S1-S5), a snapshot for the evaluation of the first unit (labelled E1), snapshots for the five series on topics covered in the second unit (labelled S6-S10), a snapshot for the evaluation of the second unit (labelled E2) and a snapshot for the exam (labelled E3).
Course B has snapshots for the first ten lab sessions (labelled S1-S10), a snapshot for the first evaluation (labelled E1), snapshots for the next series of seven lab sessions (labelled S11-S17), a snapshot for the second evaluation (labelled E2), snapshots for the last three lab sessions (S18-S20) and a snapshot for the exam (labelled E3).

A snapshot of a course edition measures student performance only from information available when the snapshot was taken.
As a result, the snapshot does not take into account submissions after its timestamp.
Note that the last snapshot taken at the deadline of the final exam takes into account all submissions during the course edition.
The learning behaviour of a student is expressed as a set of features extracted from the raw submission data.
We identified different types of features (see appendix [[Feature types]]) that indirectly quantify certain behavioural aspects of students practicing their programming skills.
When and how long do students work on their exercises?
Can students correctly solve an exercise and how much feedback do they need to accomplish this?
What kinds of mistakes do students make while solving programming exercises?
Do students further optimize the quality of their solution after it passes all unit tests, based on automated feedback or publication of sample solutions?
Note that there is no one-on-one relationship between these behavioural aspects and feature types.
Some aspects will be covered by multiple feature types, and some feature types incorporate multiple behavioural aspects.
We will therefore need to take into account possible dependencies between feature types while making predictions.

A feature type essentially makes one observation per student per series.
Each feature type thus results in multiple features: one for each series in the course (excluding series for evaluations and exams).
In addition, the snapshot also contains a feature for the average of each feature type across all series.
We do not use observations per individual exercise, as the actual exercises might differ between course editions.
Snapshots taken at the deadline of an evaluation or later, also contain the score a student obtained for the evaluation.
These features of the snapshot can be used to predict whether a student will finally pass/fail the course.
The snapshot also contains a binary value with the actual outcome that is used as a label during training and testing of classification algorithms.
Students that did not take part in the final examination, automatically fail the course.

Since course B has no hard deadlines, we left out deadline-related features from its snapshots (=first_dl=, =last_dl= and =nr_dl=; see appendix [[Feature types]]).
To investigate the impact of deadline-related features, we also made predictions for course A that ignore these features.

*** Classification algorithms
:PROPERTIES:
:CREATED:  [2023-10-23 Mon 16:45]
:CUSTOM_ID: subsec:passfailclassification
:END:

We evaluated four classification algorithms to make pass/fail predictions from student behaviour: stochastic gradient descent [cite:@fergusonInconsistentMaximumLikelihood1982], logistic regression [cite:@kleinbaumIntroductionLogisticRegression1994], support vector machines [cite:@cortesSupportVectorNetworks1995], and random forests [cite:@svetnikRandomForestClassification2003].
We used implementations of the algorithms from scikit-learn [cite:@pedregosaScikitlearnMachineLearning2011] and optimized model parameters for each algorithm by cross-validated grid-search over a parameter grid.

Readers unfamiliar with machine learning can think of these specific algorithms as black boxes, but we briefly explain the basic principles of classification for their understanding.
Supervised learning algorithms use a dataset that contains both inputs and desired outputs to build a model that can be used to predict the output associated with new inputs.
The dataset used to build the model is called the training set and consists of training examples, with each example represented as an array of input values (feature vector).
Classification is a specific case of supervised learning where the outputs are restricted to a limited set of values (labels), in contrast to for example all possible numerical values with a range.
Classification algorithms are validated by splitting a dataset of labelled feature vectors into a training set and a test set, building a model from the training set, and evaluating the accuracy of its predictions on the test set.
Keeping training and test data separate is crucial to avoid bias during validation.
A standard method to make unbiased predictions for all examples in a dataset is \(k\)-fold cross-validation: partition the dataset in $k$ subsets and then perform $k$ experiments that each take one subset for evaluation and the other $k-1$ subsets for training the model.

Pass/fail prediction is a binary classification problem with two possible outputs: passing or failing a course.
We evaluated the accuracy of the predictions for each snapshot and each classification algorithm with three different types of training sets.
As we have data from three editions of each course, the largest possible training set to make predictions for the snapshot of a course edition combines the corresponding snapshots from the two remaining course editions.
We also made predictions for a snapshot using each of its corresponding snapshots as individual training sets to see if we can still make accurate predictions based on data from only one other course edition.
Finally, we also made predictions for a snapshot using 5-fold cross-validation to compare the quality of predictions based on data from the same or another cohort of students.
Note that the latter strategy is not applicable to make predictions in practice, because we will not have pass/fail results as training labels while taking snapshots during the semester.
To make predictions for a snapshot, we can in practice rely only on corresponding snapshots from previous course editions.
However, because we can assume that different editions of the same course yield independent data, we also used snapshots from future course editions in our experiments.

There are many metrics that can be used to evaluate how accurately a classifier predicted which students will pass or fail the course from the data in a given snapshot.
Predicting a student will pass the course is called a positive prediction, and predicting they will fail the course is called a negative prediction.
Predictions that correspond with the actual outcome are called true predictions, and predictions that differ from the actual outcome are called false predictions.
This results in four possible combinations of predictions: true positives ($TP$), true negatives ($TN$), false positives ($FP$) and false negatives ($FN$).
Two standard accuracy metrics used in information retrieval are precision ($TP/(TP+FP)$) and recall ($TP/(TP+FN)$).
The latter is also called sensitivity if used in combination with specificity ($TN/(TN+FP)$).

Many studies for pass/fail prediction use accuracy ($(TP+TN)/(TP+TN+FP+FN)$) as a single performance metric.
However, this can yield misleading results.
For example, let’s take a dummy classifier that always "predicts" students will pass, no matter what.
This is clearly a bad classifier, but it will nonetheless have an accuracy of 75% for a course where 75% of the students pass.
In our study, we will therefore use two more complex metrics that take these effects into account: balanced accuracy and F_1-score.
Balanced accuracy is the average of sensitivity and specificity.
The F_1-score is the harmonic mean of precision and recall.
If we go back to our example, the optimistic classifier that consistently predicts that all students will pass the course and thus fails to identify any failing student will have a balanced accuracy of 50% and an F_1-score of 75%.
Under the same circumstances, a pessimistic classifier that consistently predicts that all students will fail the course has a balanced accuracy of 50% and an F_1-score of 0%.

** Results and discussion
:PROPERTIES:
:CREATED:  [2023-10-23 Mon 16:55]
:CUSTOM_ID: sec:passfailresults
:END:

We evaluated the performance of four classification algorithms for pass/fail predictions in a longitudinal sequence of snapshots from course A and B: stochastic gradient descent (Figure\nbsp{}[[fig:passfailsgdresults]]), logistic regression (Figure\nbsp{}[[fig:passfaillrresults]]), support vector machines (Figure\nbsp{}[[fig:passfailsvmresults]]), and random forests (Figure\nbsp{}[[fig:passfailrfresults]]).
For each classifier, course and snapshot, we evaluated 12 predictions for the following combinations of training and test sets: train on one edition and test on another edition; train on two editions and test on the other edition; train and test on one edition using 5-fold cross validation.
In addition, we made predictions for course A using both the full set of features and a reduced feature set that ignores deadline-related features.
We discuss the results in terms of accuracy, potential for early detection, and interpretability.

#+CAPTION: Performance of stochastic gradient descent classifiers for pass/fail predictions in a longitudinal sequence of snapshots from courses A (all features and reduced set of features) and B, measured by balanced accuracy and F_1-score.
#+CAPTION: Dots represent performance of a single prediction, with 12 predictions for each group of corresponding snapshots (columns).
#+CAPTION: Solid line connects averages of the performances for each group of corresponding snapshots.
#+NAME: fig:passfailsgdresults
[[./images/passfailsgdresults.png]]


#+CAPTION: Performance of logistic regression classifiers for pass/fail predictions in a longitudinal sequence of snapshots from courses A (all features and reduced set of features) and B, measured by balanced accuracy and F_1-score.
#+CAPTION: Dots represent performance of a single prediction, with 12 predictions for each group of corresponding snapshots (columns).
#+CAPTION: Solid line connects averages of the performances for each group of corresponding snapshots.
#+NAME: fig:passfaillrresults
[[./images/passfaillrresults.png]]

#+CAPTION: Performance of support vector machine classifiers for pass/fail predictions in a longitudinal sequence of snapshots from courses A (all features and reduced set of features) and B, measured by balanced accuracy and F_1-score.
#+CAPTION: Dots represent performance of a single prediction, with 12 predictions for each group of corresponding snapshots (columns).
#+CAPTION: Solid line connects averages of the performances for each group of corresponding snapshots.
#+NAME: fig:passfailsvmresults
[[./images/passfailsvmresults.png]]

#+CAPTION: Performance of random forest classifiers for pass/fail predictions in a longitudinal sequence of snapshots from courses A (all features and reduced set of features) and B, measured by balanced accuracy and F_1-score.
#+CAPTION: Dots represent performance of a single prediction, with 12 predictions for each group of corresponding snapshots (columns).
#+CAPTION: Solid line connects averages of the performances for each group of corresponding snapshots.
#+NAME: fig:passfailrfresults
[[./images/passfailrfresults.png]]

*** Accuracy
:PROPERTIES:
:CREATED:  [2023-10-23 Mon 17:03]
:CUSTOM_ID: subsec:passfailaccuracy
:END:

The overall conclusion from the longitudinal analysis is that indirectly measuring how students practice their coding skills by solving programming exercises (formative assessments) in combination with directly measuring how they perform on intermediate evaluations (summative assessments), allows us to predict with high accuracy if students will pass or fail a programming course.
The signals to make such predictions seem to be present in the data, as we come to the same conclusions irrespective of the course, classification algorithm, or performance metric evaluated in our study.
Overall, logistic regression was the best performing classifier for both courses, but the difference compared to the other classifiers is small.

When we compare the longitudinal trends of balanced accuracy for the predictions of both courses, we see that course A starts with a lower balanced accuracy at the first snapshot, but its accuracy increases faster and is slightly higher at the end of the semester.
At the start of the semester at snapshot S1, course A has an average balanced accuracy between 60% and 65% and course B around 70%.
Nearly halfway through the semester, before the first evaluation, we see an average balanced accuracy around 70% for course A at snapshot S5 and between 70% and 75% for course B at snapshot S8.
After the first evaluation, we can make predictions with a balanced accuracy between 75% and 80% for both courses.
The predictions for course B stay within this range for the rest of the semester, but for course A we can consistently make predictions with an average balanced accuracy of 80% near the end of the semester.

F_1-scores follow the same trend as balanced accuracy, but the inclination is even more pronounced because it starts lower and ends higher.
It shows another sharp improvement of predictive performance for both courses when students practice their programming skills in preparation of the final exam (snapshot E3).
This underscores the need to keep organizing final summative assessments as catalysts of learning, even for courses with a strong focus on active learning.

The variation in predictive accuracy for a group of corresponding snapshots is higher for course A than for course B.
This might be explained by the fact that successive editions of course B use the same set of exercises, supplemented with evaluation and exam exercises from the previous edition, whereas each edition of course A uses a different selection of exercises.

Predictions made with training sets from the same student cohort (5-fold cross-validation) perform better than those with training sets from different cohorts (see supplementary material for details).
This is more pronounced for F_1-scores than for balanced accuracy but the differences are small enough so that nothing prevents us from building classification models with historical data from previous student cohorts to make pass/fail predictions for the current cohort, which is something that can’t be done in practice with data from the same cohort as pass/fail information is needed during the training phase.
In addition, we found no significant performance differences for classification models using data from a single course edition or combining data from two course editions.
Given that cohort sizes are large enough, this tells us that accurate predictions can already be made in practice with historical data from a single course edition.
This is also relevant when the structure of a course changes, because we can only make predictions from historical data for course editions whose snapshots align.

The need to align snapshots is also the reason why we had to build separate models for courses A and B since both have differences in course structure.
The models, however, were built using the same set of feature types.
Because course B does not work with hard deadlines, deadline-related feature types could not be computed for its snapshots.
This missing data and associated features had no impact on the performance of the predictions.
Deliberately dropping the same feature types for course A also had no significant effect on the performance of predictions, illustrating that the training phase is where classification algorithms decide themselves how the individual features will contribute to the predictions.
This frees us from having to determine the importance of features beforehand, allows us to add new features that might contribute to predictions even if they correlate with other features, and makes it possible to investigate afterwards how important individual features are for a given classifier (see section [[Interpretability]]).

*** Early detection
:PROPERTIES:
:CREATED:  [2023-10-23 Mon 17:05]
:CUSTOM_ID: subsec:passfailearly
:END:

Accuracy of predictions systematically increases as we capture more of student behaviour during the semester.
But surprisingly we can already make quite accurate predictions early on in the semester, long before students take their first evaluation.
Because of the steady trend, predictions for course B at the start of the semester are already reliable enough to serve as input for student feedback or teacher interventions.
It takes some more time to identify at-risk students for course A, but from week four or five onwards the predictions may also become an instrument to design remedial actions for this course.
Hard deadlines and graded exercises are a strong enforcement of active learning behaviour on the students of course A, and might disguise somewhat more the intrinsic motivation of students to work on their programming skills.
This might explain why it takes a bit longer to properly measure student motivation for course A than for course B.

*** Interpretability
:PROPERTIES:
:CREATED:  [2023-10-23 Mon 17:05]
:CUSTOM_ID: subsec:passfailinterpretability
:END:

So far, we have considered classification models as black boxes in our longitudinal analysis of pass/fail predictions.
However, many machine learning techniques can tell us something about the contribution of individual features to make the predictions.
In the case of our pass/fail predictions, looking at the importance of feature types and linking them to aspects of practicing programming skills, might give us insights into what kind of behaviour promotes or inhibits learning, or has no or a minor effect on the learning process.
Temporal information can tell us what behaviour makes a steady contribution to learning or where we see shifts throughout the semester.

This interpretability was a considerable factor in our choice of the classification algorithms we investigated in this study.
Since we identified logistic regression as the best-performing classifier, we will take a closer look at feature contributions in its models.
These models are explained by the feature weights in the logistic regression equation, so we will express the importance of a feature as its actual weight in the model.
We use a temperature scale when plotting importances: white for zero importance, a red gradient for positive importance values and a blue gradient for negative importance values.
A feature importance w can be interpreted as follows for logistic regression models: an increase of the feature value by one standard deviation increases the odds of passing the course by a factor of $e^w$ when all other feature values remain the same [cite:@molnarInterpretableMachineLearning2019].
The absolute value of the importance determines the impact the feature has on predictions.
Features with zero importance have no impact because $e^0 = 1$.
Features represented with a light colour have a weak impact and features represented with a dark colour have a strong impact.
As a reference, an importance of 0.7 doubles the odds for passing the course with each standard deviation increase of the feature value, because $e^{0.7} \sim 2$.
The sign of the importance determines whether the feature promotes or inhibits the odds of passing the course.
Features with a positive importance (red colour) will increase the odds with increasing feature values, and features with a negative importance (blue colour) will decrease the odds with increasing feature values.

To simulate that we want to make predictions for each course edition included in this study, we trained logistic regression models with data from the remaining two editions of the same course.
A label "edition 18-19" therefore means that we want to make predictions for the 2018-2019 edition of a course with a model built from the 2016-2017 and 2017-2018 editions of the course.
However, in this case we are not interested in the predictions themselves, but in the importance of the features in the models.
The importance of all features for each course edition can be found in the supplementary material.
We will restrict our discussion by highlighting the importance of a selection of feature types for the two courses.

For course A, we look into the evaluation scores (Figure\nbsp{}[[fig:passfailfeaturesAevaluation]]) and the feature types =correct_after_15m= (Figure\nbsp{}[[fig:passfailfeaturesAcorrect]]) and =wrong= (Figure\nbsp{}[[fig:passfailfeaturesAwrong]]).
Evaluation scores have a very strong impact on predictions, and high evaluation scores increase the odds of passing the course.
This comes as no surprise, as both the evaluations and exams are summative assessments that are organized and graded in the same way.
Although the difficulty of evaluation exercises is lower than those of exam exercises, evaluation scores already are good predictors for exam scores.
Also note that these features only show up in snapshots taken at or after the corresponding evaluation.
They have zero impact on predictions for earlier snapshots, as the information is not available at the time these snapshots are taken.

#+CAPTION: Importance of evaluation scores in the logistic regression models for course A.
#+NAME: fig:passfailfeaturesAevaluation
[[./images/passfailfeaturesAevaluation.png]]

The second feature type we want to highlight is =correct_after_15m=: the number of exercises in a series where the first correct submission was made within fifteen minutes after the first submission (Figure\nbsp{}[[fig:passfailfeaturesAcorrect]]).
Note that we can’t directly measure how long students work on an exercise, as they may write, run and test their solutions on their local machine before their first submission to the learning platform.
Rather, this feature type measures how long it takes students to find and remedy errors in their code (debugging), after they start getting automatic feedback from the learning platform.

For exercise series in the first unit of course A (series 1-5), we generally see that the features have a positive impact (red).
This means that students will more likely pass the course if they are able to quickly remedy errors in their solutions for these exercises.
The first and fourth series are an exception here.
The fact that students need more time for the first series might reflect that learning something new is hard at the beginning, even if the exercises are still relatively easy.
Series 4 of course A covers strings as the first compound data type of Python in combination with nested loops, where (unnested) loops themselves are covered in series 3.
This complex combination might mean that students generally need more time to debug the exercises in series 4.

For the series of the second unit (series 6-10), we observe two different effects.
The impact of these features is zero for the first few snapshots (grey bottom left corner).
This is because the exercises from these series were not yet published at the time of those snapshots, where all series of the first unit were available from the start of the semester.
For the later snapshots, we generally see a negative (blue) weight associated with the features.
It might seem counterintuitive and contradicts the explanation given for the series of the first unit.
However, the exercises of the second unit are a lot more complex than those of the first unit.
This up to a point that even for good students it is hard to debug and correctly solve an exercise in only 15 minutes.
Students that need less than 15 minutes at this stage might be bypassing learning by copying solutions from fellow students instead of solving the exercises themselves.
An exception to this pattern are the few red squares forming a diagonal in the middle of the bottom half.
These squares correspond with exercises that are solved as soon as they become available as opposed to waiting for the deadline.
A possible explanation for these few slightly positive weights is that these exercises are solved by highly-motivated, top students.

#+CAPTION: Importance of feature type =correct_after_15m= (the number of exercises in a series where the first correct submission was made within fifteen minutes after the first submission) in logistic regression models for course A.
#+NAME: fig:passfailfeaturesAcorrect
[[./images/passfailfeaturesAcorrect.png]]

Finally, if we look at the feature type =wrong= (Figure\nbsp{}[[fig:passfailfeaturesAwrong]]), submitting a lot of submissions with logical errors mostly has a positive impact on the odds of passing the course.
This underscores the old adage that practice makes perfect, as real learning happens where students learn from their mistakes.
Exceptions to this rule are found for series 2, 3, 9 and 10.
The lecturer and teaching assistants identify the topics covered in series 2 and 9 by far as the easiest topics covered in the course, and identify the topics covered in series 3, 6 and 10 as the hardest.
However, it does not feel very intuitive that being stuck with logical exercises longer than other students either inhibits the odds for passing on topics that are extremely hard or easy or promotes the odds on topics with moderate difficulty.
This shows that interpreting the importance of feature types is not always straightforward.

#+CAPTION: Importance of feature type =wrong= (the number of wrong submissions in a series) in logistic regression models for course A.
#+NAME: fig:passfailfeaturesAwrong
[[./images/passfailfeaturesAwrong.png]]

For course B, we look into the evaluation scores (Figure\nbsp{}[[fig:passfailfeaturesBevaluation]]) and the feature types =comp_error= (Figure\nbsp{}[[fig:passfailfeaturesBcomp]]) and =wrong= (Figure\nbsp{}[[fig:passfailfeaturesBwrong]]).
The importance of evaluation scores is similar as for course A, although their relative impact on the predictions is slightly lower.
The latter might be caused by automatic grading of evaluation exercises, where exam exercises are graded by hand.
The fact that the second evaluation is scheduled a little bit earlier in the semester than for course A, makes that pass/fail predictions for course B can also rely earlier on this important feature.
However, we do not see a similar increase of the global performance metrics around the second evaluation of course B, as we see for the first evaluation.

#+CAPTION: Importance of evaluation scores in the logistic regression models for course B.
#+NAME: fig:passfailfeaturesBevaluation
[[./images/passfailfeaturesBevaluation.png]]

Learning to code requires mastering two major competences:
#+ATTR_LATEX: :environment enumerate*
#+ATTR_LATEX: :options [label={\emph{\roman*)}}, itemjoin={{, }}, itemjoin*={{, and }}]
- getting familiar with the syntax rules of a programming language
  to express the steps for solving a problem in a formal way, so that
  the algorithm can be executed by a computer
- problem solving itself.
As a result, we can make a distinction between different kinds of errors in source code.
Compilation errors are mistakes against the syntax of the programming language, whereas logical errors result from solving a problem with a wrong algorithm.
When comparing the importance of the number of compilation (Figure\nbsp{}[[fig:passfailfeaturesBcomp]]) and logical errors (Figure\nbsp{}[[fig:passfailfeaturesBwrong]]) students make while practicing their coding skills, we see a clear difference.
Making a lot of compilation errors has a negative impact on the odds for passing the course (blue colour dominates in Figure\nbsp{}[[fig:passfailfeaturesBcomp]]), whereas making a lot of logical errors makes a positive contribution (red colour dominates in Figure\nbsp{}[[fig:passfailfeaturesBwrong]]).
This aligns with the claim of [cite/t:@edwardsSeparationSyntaxProblem2018] that problem solving is a higher-order learning task in Bloom's Taxonomy (analysis and synthesis) than language syntax (knowledge, comprehension, and application).
Students that get stuck longer in the mechanics of a programming language will more likely fail the course, whereas students that make a lot of logical errors and properly learn from them will more likely pass the course.
So making mistakes is beneficial for learning, but it depends what kind of mistakes.
We also looked at the number of solutions with logical errors while interpreting feature types for course A.
Although we hinted there towards the same conclusions as for course B, the signals were less consistent.
This shows that interpreting feature importances always needs to take the educational context into account.

#+CAPTION: Importance of feature type =comp_error= (the number of submissions with compilation errors in a series) in logistic regression models for course B.
#+NAME: fig:passfailfeaturesBcomp
[[./images/passfailfeaturesBcomp.png]]

#+CAPTION: Importance of feature type =wrong= (the number of wrong submissions in a series) in logistic regression models for course B.
#+NAME: fig:passfailfeaturesBwrong
[[./images/passfailfeaturesBwrong.png]]

** Conclusions and future work
:PROPERTIES:
:CREATED:  [2023-10-23 Mon 17:30]
:CUSTOM_ID: sec:passfailconclusions
:END:

In this chapter, we presented a classification framework for predicting if students will likely pass or fail introductory programming courses.
The framework already yields high-accuracy predictions early on in the semester and is privacy-friendly because it only works with metadata from programming challenges solved by students while working on their programming skills.
Being able to identify at-risk students early on in the semester opens windows for remedial actions to improve the overall success rate of students.

We validated the framework by building separate classifiers for two courses because of differences in course structure, but using the same set of features for training models.
The results showed that metadata from previous student cohorts can be used to make predictions about the current cohort of students, even if course editions use different sets of exercises.
Making predictions requires aligning snapshots between successive editions of a course, where students have the same expected progress at corresponding snapshots.
Historical metadata from a single course edition suffices if group sizes are large enough.
Different classification algorithms can be plugged into the framework, but logistic regression resulted in the best-performing classifiers.

Apart from their application to make pass/fail predictions, an interesting side-effect of classification models that map indirect measurements of learning behaviour onto mastery of programming skills is that they allow us to interpret what behavioural aspects contribute to learning to code.
Visualisation of feature importance turned out to be a useful instrument for linking individual feature types with student behaviour that promotes or inhibits learning.
We applied this interpretability to some important feature types that popped up for the two courses included in this study.

We can thus conclude that the proposed framework achieves the objectives set for accuracy, early prediction and interpretability.
Having this new framework at hand immediately raises some follow-up research questions that urge for further exploration:
#+ATTR_LATEX: :environment enumerate*
#+ATTR_LATEX: :options [label={\emph{\roman*)}}, itemjoin={{ }}, itemjoin*={{ }}]
- Do we inform students about their odds of passing a course?
  How and when do we inform students about their performance in an educationally responsible way?
  What learning analytics do we use to present predictions to students, and do we only show results or also explain how the data led to the results?
  How do students react to the announcement of their chance at passing the course?
  How do we ensure that students are not demotivated?
- What actions could teachers take upon early insights which students will likely fail the course?
  What recommendations could they make to increase the odds that more students will pass the course?
  How could interpretations of important behavioural features be translated into learning analytics that give teachers more insight into how students learn to code?
- Can we combine student progress (what programming skills does a student already have and at what level of mastery), student preferences (what skills does a student wants to improve on), and intrinsic properties of programming exercises (what skills are needed to solve an exercise and how difficult is it) into dynamic learning paths that recommend exercises to optimize the learning effect for individual students?

** Future work/Replication in Finland
:PROPERTIES:
:CREATED:  [2023-10-23 Mon 08:50]
:CUSTOM_ID: sec:passfailfinland
:END:

#+BEGIN_COMMENT
Extract new info from article; present here
#+END_COMMENT

* Feedback prediction
:PROPERTIES:
:CREATED:  [2023-10-23 Mon 08:51]
:CUSTOM_ID: chap:prediction
:END:

* Discussion and future work
:PROPERTIES:
:CREATED:  [2023-10-23 Mon 08:51]
:CUSTOM_ID: chap:discussion
:END:

* Bibliography
:PROPERTIES:
:CREATED:  [2023-10-23 Mon 08:59]
:CUSTOM_ID: chap:bibliography
:UNNUMBERED: t
:END:

#+LATEX: {\setlength{\emergencystretch}{2em}
#+print_bibliography:
#+LATEX: }

#+LATEX: \appendix
* Feature types
:PROPERTIES:
:CREATED:  [2023-10-23 Mon 18:09]
:CUSTOM_ID: chap:featuretypes
:APPENDIX: t
:END:

- =subm= :: numbers of submissions by student in series
- =nosubm= :: number of exercises student did not submit any solutions for in series
- =first_dl= :: time difference in seconds between student’s first submission in series and deadline of series
- =last_dl= :: time difference in seconds between student’s last submission in series before deadline and deadline of series
- =nr_dl= :: number of correct submissions in series by student before series’ deadline
- =correct= :: number of correct submissions in series by student
- =after_correct= :: number of submissions by student after their first correct submission in the series
- =before_correct= :: number of submissions by student before their first correct submission in the series
- =time_series= :: time difference in seconds between the student’s first and last submission in the series
- =time_correct= :: time difference in seconds between the student’s first submission in the series and their first correct submission in the series
- =wrong= :: number of submissions by student in series with logical errors
- =comp_error= :: number of submissions by student in series with compilation errors
- =runtime_error= :: number of submissions by student in series with runtime errors
- =correct_after_5m= :: number of exercises where first correct submission by student was made within five minutes after first submission
- =correct_after_15m= :: number of exercises where first correct submission by student was made within fifteen minutes after first submission
- =correct_after_2h= :: number of exercises where first correct submission by student was made within two hours after first submission
- =correct_after_24h= :: number of exercises where first correct submission by student was made within twenty-four hours after first submission