Textual fixes in chapter 4

This commit is contained in:
Charlotte Van Petegem 2024-01-08 11:57:21 +01:00
parent 984dc6d81f
commit f820dd6733
No known key found for this signature in database
GPG key ID: 019E764B7184435A

View file

@ -48,7 +48,7 @@ Because of this the `\frontmatter` statement needs to be part of the `org-latex-
Include history of automated assessment
*** TODO Write [[Dolos]]
*** TODO Write [[#sec:whatdolos]]
:PROPERTIES:
:CREATED: [2023-11-20 Mon 17:24]
:END:
@ -65,7 +65,7 @@ This will have to be a lot shorter than the FWE section, since I'm far less know
:CREATED: [2023-11-30 Thu 10:39]
:END:
*** TODO Finish chapter [[Summative feedback]]
*** TODO Finish chapter [[#chap:feedback]]
:PROPERTIES:
:CREATED: [2023-11-20 Mon 17:20]
:END:
@ -77,7 +77,7 @@ Remaining text should probably be written in the form of an article, so we can t
:CREATED: [2023-11-20 Mon 18:00]
:END:
*** TODO Write [[Discussion and future work]]
*** TODO Write [[#chap:discussion]]
:PROPERTIES:
:CREATED: [2023-11-20 Mon 17:20]
:END:
@ -398,6 +398,7 @@ The evaluation tracks which submissions have been manually assessed, so that ana
** Dolos
:PROPERTIES:
:CREATED: [2023-11-24 Fri 14:03]
:CUSTOM_ID: sec:whatdolos
:END:
Dolos is not (yet) integrated into Dodona, but it is an important element of the educational practice around Dodona.
@ -746,7 +747,7 @@ Given that cohort sizes are large enough, historical data from a single course e
:CUSTOM_ID: chap:technical
:END:
Dodona and its related software comprises a lot of code.
Dodona and all software related to it comprises a lot of code.
This chapter discusses the technical background of Dodona itself\nbsp{}[cite:@vanpetegemDodonaLearnCode2023] and a stand-alone online code editor, Papyros (\url{https://papyros.dodona.be}), that was integrated into Dodona\nbsp{}[cite:@deridderPapyrosSchrijvenUitvoeren2022].
I will also discuss two judges that I was involved with the development of.
The R judge was written entirely by myself\nbsp{}[cite:@nustRockerversePackagesApplications2020].
@ -759,7 +760,7 @@ The TESTed judge came forth out of a prototype I built in my master's thesis\nbs
:END:
Dodona is developed as a modern web application.
In this section we will go over the inner workings of Dodona (both implementation and deployment) and how it adheres to modern standards of software development.
In this section I will go over the inner workings of Dodona (both implementation and deployment) and how it adheres to modern standards of software development.
*** Implementation
:PROPERTIES:
@ -787,33 +788,33 @@ This is also the case for network access, where even if the container is allowed
#+NAME: fig:technicaloutline
[[./images/technicaloutline.png]]
The actual assessment of the student submission is done by a software component called a *judge*\nbsp{}[cite:@wasikSurveyOnlineJudge2018].
The actual assessment of the student submission is done by a software component called a /judge/\nbsp{}[cite:@wasikSurveyOnlineJudge2018].
The judge must be robust enough to provide feedback on all possible submissions for the assignment, especially submissions that are incorrect or deliberately want to tamper with the automatic assessment procedure\nbsp{}[cite:@forisekSuitabilityProgrammingTasks2006].
Following the principles of software reuse, the judge is ideally also a generic framework that can be used to assess submissions for multiple assignments.
This is enabled by the submission metadata that is passed when calling the judge, which includes the path to the source code of the submission, the path to the assessment resources of the assignment and other metadata such as programming language, natural language, time limit and memory limit.
Rather than providing a fixed set of judges, Dodona adopts a minimalistic interface that allows third parties to create new judges: automatic assessment is bootstrapped by launching the judge's run executable that can fetch the JSON formatted submission metadata from standard input and must generate JSON formatted feedback on standard output.
The feedback has a standardized hierarchical structure that is specified in a JSON schema.
At the lowest level, *tests* are a form of structured feedback expressed as a pair of generated and expected results.
At the lowest level, /tests/ are a form of structured feedback expressed as a pair of generated and expected results.
They typically test some behaviour of the submitted code against expected behaviour.
Tests can have a brief description and snippets of unstructured feedback called messages.
Descriptions and messages can be formatted as plain text, HTML (including images), Markdown, or source code.
Tests can be grouped into *test cases*, which in turn can be grouped into *contexts* and eventually into *tabs*.
Tests can be grouped into /test cases/, which in turn can be grouped into /contexts/ and eventually into /tabs/.
All these hierarchical levels can have descriptions and messages of their own and serve no other purpose than visually grouping tests in the user interface.
At the top level, a submission has a fine-grained status that reflects the overall assessment of the submission: =compilation error= (the submitted code did not compile), =runtime error= (executing the submitted code failed during assessment), =memory limit exceeded= (memory limit was exceeded during assessment), =time limit exceeded= (assessment did not complete within the given time), =output limit exceeded= (too much output was generated during assessment), =wrong= (assessment completed but not all strict requirements were fulfilled), or =correct= (assessment completed, and all strict requirements were fulfilled).
Taken together, a Docker image, a judge and a programming assignment configuration (including both a description and an assessment configuration) constitute a *task package* as defined by\nbsp{}[cite:@verhoeffProgrammingTaskPackages2008]: a unit Dodona uses to render the description of the assignment and to automatically assess its submissions.
Taken together, a Docker image, a judge and a programming assignment configuration (including both a description and an assessment configuration) constitute a /task package/ as defined by\nbsp{}[cite:@verhoeffProgrammingTaskPackages2008]: a unit Dodona uses to render the description of the assignment and to automatically assess its submissions.
However, Dodona's layered design embodies the separation of concerns\nbsp{}[cite:@laplanteWhatEveryEngineer2007] needed to develop, update and maintain the three modules in isolation and to maximize their reuse: multiple judges can use the same docker image and multiple programming assignments can use the same judge.
Related to this, an explicit design goal for judges is to make the assessment configuration for individual assignments as lightweight as possible.
After all, minimal configurations reduce the time and effort teachers and instructors need to create programming assignments that support automated assessment.
Sharing of data files and multimedia content among the programming assignments in a repository also implements the inheritance mechanism for *bundle packages* as hinted by\nbsp{}[cite/t:@verhoeffProgrammingTaskPackages2008].
Sharing of data files and multimedia content among the programming assignments in a repository also implements the inheritance mechanism for /bundle packages/ as hinted by\nbsp{}[cite/t:@verhoeffProgrammingTaskPackages2008].
Another form of inheritance is specifying default assessment configurations at the directory level, which takes advantage of the hierarchical grouping of learning activities in a repository to share common settings.
Since Dodona grew from being used to teach mostly by people we knew personally to being used in secondary schools all over Flanders, we went from being able to fully trust exercise authors to having this trust reduced (as it is impossible for a team of our size to vet all the people we give teacher's rights in Dodona).
This meant that our threat model and therefore the security measures we had to take also changed over the years.
Once Dodona was opened up to more and more teachers, we gradually locked down what teachers could do with e.g. their exercise descriptions.
Content where teachers can inject raw HTML into Dodona was moved to iframes, to make sure that teachers could still be as creative as they wanted while writing exercises, while simultaneously not allowing them to execute JavaScript in a session where users are logged in.
For user content where this creative freedom is not as necessary (e.g. series or descriptions), but some Markdown/HTML content is still wanted, we sanitize the (generated) HTML so that it can only include HTML elements and attributes that are specifically allowed.
For user content where this creative freedom is not as necessary (e.g. series or course descriptions), but some Markdown/HTML content is still wanted, we sanitize the (generated) HTML so that it can only include HTML elements and attributes that are specifically allowed.
One of the most important components of Dodona is the feedback table.
It has, therefore, seen a lot of security, optimization and UI work over the years.
@ -821,7 +822,7 @@ Since teachers can determine a lot of the content that eventually ends up in the
The increase in teachers that added exercises to Dodona also meant that the variety in feedback given grew, sometimes resulting in a huge volume of testcases and long output.
Optimization work was needed to cope with this volume of feedback.
When Dodona was first written, the library used for diffing generated and expected results actually shelled out to the GNU =diff= command.
When Dodona was first written, the library used creating diffs of the generated and expected results actually shelled out to the GNU =diff= command.
This output was parsed and changed into HTML by the library using find and replace operations.
As one can expect, starting a new process and doing a lot of string operations every time outputs had to be diffed resulted in very slow loading times for the feedback table.
The library was replaced with a pure Ruby library (=diff-lcs=), and its outputs were built into HTML using Rails' efficient =Builder= class.
@ -829,10 +830,10 @@ This change of diffing method also fixed a number of bugs we were experiencing a
Even this was not enough to handle the most extreme of exercises though.
Diffing hundreds of lines hundreds of times still takes a long time, even if done in-process while optimized by a JIT.
The resulting feedback tables also contained so much HTML that the browser on our development machines (which are pretty powerful machines) noticeably slowed down when loading and rendering them.
The resulting feedback tables also contained so much HTML that the browsers on our development machines (which are pretty powerful machines) noticeably slowed down when loading and rendering them.
To handle these cases, we needed to do less work and needed to output less HTML.
We decided to only diff line-by-line (instead of character-by-character) in most cases and to not diff at all in the most extreme cases, reducing the amount of HTML required to render them as well.
This was also motivated by a usability perspective.
We decided to only diff line-by-line (instead of character-by-character) in most of these cases and to not diff at all in the most extreme cases, reducing the amount of HTML required to render them as well.
This was also motivated by usability.
If there are lots of small differences between a very long generated and expected output, the diff view in the feedback table could also become visually overwhelming for students.
*** Development
@ -858,10 +859,11 @@ Since we are the only deployment of Dodona, releasing every pull request immedia
To ensure that the system is robust to sudden increases in workload and when serving hundreds of concurrent users, Dodona has a multi-tier service architecture that delegates different parts of the application to different servers.
More specifically, the web server, database (MySQL) and caching system (Memcached) each run on their own machine.
In addition, a scalable pool of interchangeable worker servers are available to automatically assess incoming student submissions.
The deployment of the Python Tutor also saw a number of changes over the years.
The Python Tutor itself is written in Python, so could not be part of Dodona itself.
It started out as a Docker container on the same server as the main Dodona web application.
Because it is used mainly by students who made mistakes, the service responsible for running student code could become overwhelmed and in extreme cases even make the entire server unresponsive.
Because it is used mainly by students who want to figure out their mistakes, the service responsible for running student code could become overwhelmed and in extreme cases even make the entire server unresponsive.
After we identified this issue, the Python tutor was moved to its own server.
This did not fix the Tutor itself becoming overwhelmed however, which meant that students that depended on the Tutor were sometimes unable to use it.
This of course happened more during periods where the Tutor was being used a lot, such as evaluations and exams.
@ -899,10 +901,9 @@ In the educational practice that Dodona was born out of, this was an explicit de
We wanted to guide students to use an IDE locally instead of programming in Dodona directly, since if they needed to program later in life, they would not have Dodona available to program in.
This same goal is not present in secondary education.
In that context, the challenge of programming is already big enough, without complicating things by installing a real IDE with a lot of buttons and menus that students will never use.
Students might also be working on devices that they don't own (PC's in the school), where installing an IDE might not even be possible.
Solutions like Repl.it provided a simple online IDE, why could Dodona not do so?
Students might also be working on devices that they don't own (PCs in the school), where installing an IDE might not even be possible.
Well, there are a few reasons why we were not able to do this.
There are a few reasons why we could not initially offer a simple online IDE.
Even though we can use a lot of the infrastructure very graciously offered by Ghent University, these resources are not limitless.
The extra (interactive) evaluation of student code was something we did not have the resources for, nor did we have any architectural components in place to easily integrate this into Dodona.
The main goal of this work was thus to provide a client-side Python execution environment we could then include in Dodona.
@ -977,8 +978,8 @@ CodeMirror is a modern editor made for the web, and not linked to any specific p
It is also extensible and has modular syntax highlighting and linting support.
In contrast with Ace and Monaco, it has very good support for mobile devices.
Its documentation is also very clear and extensive.
Given the clear advantages, we decided to use CodeMirror for Papyros.
The two other main components of Papyros are the output window and the input window.
The output window is a simple read-only textarea.
The input window is a text area that has two modes: interactive mode and batch input.
@ -1006,7 +1007,7 @@ To work correctly with shared memory, synchronization primitives have to be used
After loading Pyodide, we load a Python script that overwrites certain functions with our versions.
For example, base Pyodide will overwrite =input= with a function that calls into JavaScript-land and executes =prompt=.
Since we're running Pyodide in a web worker, =prompt= is not available (and in any case, we want to implement custom input handling).
Since we're running Pyodide in a web worker, =prompt= is not available (and we want to implement custom input handling anyway).
For =input= we actually run into another problem: =input= is synchronous in Python.
In a normal Python environment, =input= will only return a value once the user entered some value on the command line.
We don't want to edit user code (to make it asynchronous) because that process is error-prone and fragile.
@ -1165,7 +1166,7 @@ TESTed was developed to solve two major drawbacks with the current judge system
The goal of TESTed was to implement a judge so that exercises only have to be created once to be available in all programming languages TESTed supports.
An exercise should also not have to be changed when support for a new programming language is added.
As a secondary goal, we also wanted to make it as easy as possible to create new exercises.
Teachers who have not used Dodona before should be able to create a basic new exercise without too much issues.
Teachers who have not used Dodona before should be able to create a basic new exercise without too many issues.
*** Overview
:PROPERTIES:
@ -1220,7 +1221,7 @@ Because of our goal in supporting many programming languages, the format also ha
None of the formats we investigated met all these requirements.
We opted to make the serialization format in JSON as well.
Values are represented by objects containing the encoded value and the accompanying type.
Note that this is a recursive format: the values of a collection are also serialized according to this specification.
Note that this is a recursive format: the values in a collection are also serialized according to this specification.
The types that values can have are split in three categories.
The first category are the basic types listed above.
@ -1248,7 +1249,7 @@ To support this, specific structures were added to the test plan JSON schema.
We also need to make sure that the programming language being executed is supported by the given test plan.
The two things that are checked are whether a programming language supports all the types that are used and whether the language has all the necessary language constructs.
For example, if the test plan uses a =tuple=, but the language doesn't support it, it's obviously not possible to evaluate a submission in the at language.
For example, if the test plan uses a =tuple=, but the language doesn't support it, it's obviously not possible to evaluate a submission in that language.
The same is true for overloaded functions: if it is necessary that a function can be called with a string and with a number, a language like C will not be able to support this.
*** Execution
@ -1835,7 +1836,7 @@ Of course sometimes adaptations have to be made given differences in course stru
* Summative feedback
:PROPERTIES:
:CREATED: [2023-10-23 Mon 08:51]
:CUSTOM_ID: chap:grading
:CUSTOM_ID: chap:feedback
:END:
This chapter will discuss the history of giving summative feedback in the programming course taught at the faculty of Sciences at Ghent University and how it informed the development of grading and manual feedback features within Dodona.