Apparently I didn't use the latest version of the pass/fail article :s

2024-01-22 17:46:11 +01:00 · 2024-01-22 17:46:11 +01:00 · dc74c4385f
commit dc74c4385f
parent 4aab8a4887
3 changed files with 143 additions and 40 deletions
--- a/book.org
+++ b/book.org
@ -1544,11 +1544,27 @@ In this case, their predictions were 98.1% accurate, although the sample size wa
 While this procedure does not rely on external background information, it has the drawback that data collection is more invasive and directly intervenes with the learning process.
 Students can't work in their preferred programming environment and have to agree with extensive behaviour tracking.

-In this chapter, we present an alternative framework to predict if students will pass or fail a course within the same context of learning to code.
+Approaches that are not using machine learning also exist.
+[cite/t:@feldmanAnsweringAmRight2019] try to answer the question "Am I on the right track?" on the level of individual exercises, by checking if the student’s current progress can be used as a base to synthesise a correct program.
+However, there is no clear way to transform this type of approach into an estimation of success on examinations.
+[cite/t:@werthPredictingStudentPerformance1986] found significant (\(p < 0.05\)) correlations between students' college grades, the number of hours worked, the number of high school mathematics classes and the students' grades for an introductory programming course.
+[cite/t:@gooldFactorsAffectingPerformance2000] also looked at learning style (surveyed using LSI2) as a factor in addition to demographics, academic ability, problem-solving ability and indicators of personal motivation.
+The regressions in their study account for 42 to 65 percent of the variation in cohort performances.
+
+In this chapter, we present an alternative framework (Figure\nbsp{}[[fig:passfailmethodoverview]]) to predict if students will pass or fail a course within the same context of learning to code.
 The method only relies on submission behaviour for programming exercises to make accurate predictions and does not require any prior knowledge or intrusive behaviour tracking.
 Interpretability of the resulting models was an important design goal to enable further investigation on learning habits.
 We also focused on early detection of at-risk students, because predictive models are only effective for the cohort under investigation if remedial actions can be started long before students take their final exam.

+#+CAPTION: Step-by-step process of the proposed pass/fail prediction framework for programming courses: 1) Collect metadata from student submissions during successive course editions.
+#+CAPTION: 2) Align course editions by identifying corresponding time points and calculating snapshots at these time points.
+#+CAPTION: A snapshot measures student performance only from metadata available in the course edition at the time the snapshot was taken.
+#+CAPTION: 3) Train a machine learning model on snapshot data from previous course editions and predict which students will likely pass or fail the current course edition by applying the model on a snapshot of the current edition.
+#+CAPTION: 4) Infer what learning behaviour has a positive or negative learning effect by interpreting feature weights of the machine learning model.
+#+CAPTION: Teachers can use insights from both steps 3 and 4 to take actions in their teaching practice.
+#+NAME: fig:passfailmethodoverview
+[[./images/passfailmethodoverview.png]]
+
 The chapter starts with a description of how data is collected, what data is used and which machine learning methods have been evaluated to make pass/fail predictions.
 We evaluated the same models and features in multiple courses to test their robustness against differences in teaching styles and student backgrounds.
 The results are discussed from a methodological and educational perspective with a focus on
@ -1578,6 +1594,8 @@ Table\nbsp{}[[tab:passfailcoursestatistics]] summarizes some statistics on the c

 #+ATTR_LATEX: :float sideways
 #+CAPTION: Statistics for course editions included in this study.
+#+CAPTION: The courses are taken by different student cohorts at different faculties and differ in structure, lecturers and teaching assistants.
+#+CAPTION: A series is a collection of exercises typically handled in one week/lab session.
 #+CAPTION: The number of attempts is the average number of solutions submitted by a student per exercise they worked on (i.e. for which the student submitted at least one solution in the course edition).
 #+NAME: tab:passfailcoursestatistics
 | course |  academic | students | series | exercises | mandatory | submitted       | attempts | pass rate |
@ -1606,19 +1624,22 @@ Solutions submitted during evaluations are automatically graded based on the num
 Solutions submitted during exams are manually graded in the same way as for course A.
 Each edition of the course is taken by about 400 students.

+We opted to use two different courses that are structured quite differently to make sure our framework is generally applicable in other courses where the same behavioural data can be collected.
+
 *** Learning environment
 :PROPERTIES:
 :CREATED: [2023-10-23 Mon 16:28]
 :CUSTOM_ID: subsec:passfaillearningenvironment
 :END:

-Both courses use the same in-house online learning environment to promote active learning through problem-solving\nbsp{}[cite:@princeDoesActiveLearning2004].
+Both courses use the same in-house online learning environment.
+This online learning environment promotes active learning through problem-solving\nbsp{}[cite:@princeDoesActiveLearning2004].
 Each course edition has its own module, with a learning path that groups exercises in separate series (Figure\nbsp{}[[fig:passfailstudentcourse]]).
 Course A has one series per covered programming topic (10 series in total) and course B has one series per lab session (20 series in total).
 A submission deadline is set for each series.
 The learning environment is also used to take tests and exams, within series that are only accessible for participating students.

-#+CAPTION: Student view of a module in the online learning environment, showing two series of six exercises in the learning path of course A.
+#+CAPTION: Student view of a module in the online learning environment from which we collected our data, showing two series of six exercises in the learning path of course A.
 #+CAPTION: Each series has its own deadline.
 #+CAPTION: The status column shows a global status for each exercise based on the last solution submitted.
 #+CAPTION: The class progress column visualizes global status for each exercise for all students subscribed in the course.
@ -1634,10 +1655,14 @@ All submitted solutions are stored, but only the last submission before the dead
 One of the effects of active learning, triggered by exercises with deadlines and automated feedback, is that most learning happens during the semester as can be seen on the heatmap in Figure\nbsp{}[[fig:passfailheatmap]].

 #+CAPTION: Heatmap showing the distribution per day of all 176535 solutions submitted during the 2018--2019 edition of course A.
-#+CAPTION: Weekly lab sessions for different groups on Monday afternoon, Friday morning and Friday afternoon.
-#+CAPTION: Weekly deadlines for mandatory exercises on Tuesdays at 22:00.
-#+CAPTION: Four exam sessions for different groups in January.
-#+CAPTION: Two resit exam sessions for different groups in August and September.
+#+CAPTION: The darker the colour, the more submissions were made on that day. A lighter blue means there are few submissions on that day.
+#+CAPTION: A light grey square means that no submissions were made that day.
+#+CAPTION: Weekly lab sessions for different groups were organized on Monday afternoon, Friday morning and Friday afternoon.
+#+CAPTION: Weekly deadlines for mandatory exercises were on Tuesdays at 22:00.
+#+CAPTION: There were four exam sessions for different groups in January.
+#+CAPTION: There is little activity in the exam periods, except for days on which there was an exam.
+#+CAPTION: The course is not taught in the second semester, so there is very little activity there
+#+CAPTION: Two exam sessions were organized in August/September granting an extra chance to students who failed on their exam in January/February.
 #+NAME: fig:passfailheatmap
 [[./images/passfailheatmap.png]]

@ -1649,18 +1674,17 @@ One of the effects of active learning, triggered by exercises with deadlines and

 We exported data from the learning environment on all solutions submitted by students during each course edition included in the study.
 Each solution has a submission timestamp with precision down to the second and is linked to a course edition, series in the learning path, exercise and student.
-We did not use the actual source code submitted by students, but the status describing the global assessment made by the learning environment: correct, wrong, compilation error, runtime error, time limit exceeded, memory limit exceeded, or output limit exceeded.
+We did not use the actual source code submitted by students, but did use the status describing the global assessment made by the learning environment: correct, wrong, compilation error, runtime error, time limit exceeded, memory limit exceeded, or output limit exceeded.

 Comparison of student behaviour between different editions of the same course is enabled by computing snapshots for each edition at series deadlines.
 Because course editions follow the same structure, we can align their series and compare snapshots for corresponding series.
-Corresponding snapshots represent student performance at intermediate points during the semester and their chronology also allows longitudinal analysis.
-Course A has snapshots for the five series on topics covered in the first unit (labelled S1--S5), a snapshot for the evaluation of the first unit (labelled E1), snapshots for the five series on topics covered in the second unit (labelled S6--S10), a snapshot for the evaluation of the second unit (labelled E2) and a snapshot for the exam (labelled E3).
+Corresponding snapshots represent student performance at intermediate points during the semester and their chronology also allows longitudinal analysis within the semester.
+Course A has snapshots for the five series of the first unit (labelled S1--S5), a snapshot for the evaluation of the first unit (labelled E1), snapshots for the five series of the second unit (labelled S6--S10), a snapshot for the evaluation of the second unit (labelled E2) and a snapshot for the exam (labelled E3).
 Course B has snapshots for the first ten lab sessions (labelled S1--S10), a snapshot for the first evaluation (labelled E1), snapshots for the next series of seven lab sessions (labelled S11--S17), a snapshot for the second evaluation (labelled E2), snapshots for the last three lab sessions (S18--S20) and a snapshot for the exam (labelled E3).

-A snapshot of a course edition measures student performance only from information available when the snapshot was taken.
+It is important to stress that a snapshot of a course edition measures student performance only using the information available at the time of the snapshot.
 As a result, the snapshot does not take into account submissions after its timestamp.
-Note that the last snapshot taken at the deadline of the final exam takes into account all submissions during the course edition.
-The learning behaviour of a student is expressed as a set of features extracted from the raw submission data.
+The behaviour of a student can then be expressed as a set of features extracted from the raw submission data.
 We identified different types of features (see Appendix\nbsp{}[[Feature types]]) that indirectly quantify certain behavioural aspects of students practising their programming skills.
 When and how long do students work on their exercises?
 Can students correctly solve an exercise and how much feedback do they need to accomplish this?
@ -1676,7 +1700,7 @@ In addition, the snapshot also contains a feature for the average of each featur
 We do not use observations per individual exercise, as the actual exercises might differ between course editions.
 Snapshots taken at the deadline of an evaluation or later, also contain the score a student obtained for the evaluation.
 These features of the snapshot can be used to predict whether a student will finally pass/fail the course.
-The snapshot also contains a binary value with the actual outcome that is used as a label during training and testing of classification algorithms.
+In addition, the snapshot also contains a label indicating whether the student passed or failed that is used during training and testing of classification algorithms.
 Students that did not take part in the final examination, automatically fail the course.

 Since course B has no hard deadlines, we left out deadline-related features from its snapshots (=first_dl=, =last_dl= and =nr_dl=; see Appendix\nbsp{}[[Feature types]]).
@ -1689,7 +1713,7 @@ To investigate the impact of deadline-related features, we also made predictions
 :END:

 We evaluated four classification algorithms to make pass/fail predictions from student behaviour: stochastic gradient descent\nbsp{}[cite:@fergusonInconsistentMaximumLikelihood1982], logistic regression [cite:@kleinbaumIntroductionLogisticRegression1994], support vector machines [cite:@cortesSupportVectorNetworks1995], and random forests [cite:@svetnikRandomForestClassification2003].
-We used implementations of the algorithms from =scikit-learn=\nbsp{}[cite:@pedregosaScikitlearnMachineLearning2011] and optimized model parameters for each algorithm by cross-validated grid-search over a parameter grid.
+We used implementations of these algorithms from =scikit-learn=\nbsp{}[cite:@pedregosaScikitlearnMachineLearning2011] and optimized model parameters for each algorithm by cross-validated grid-search over a parameter grid.

 Readers unfamiliar with machine learning can think of these specific algorithms as black boxes, but we briefly explain the basic principles of classification for their understanding.
 Supervised learning algorithms use a dataset that contains both inputs and desired outputs to build a model that can be used to predict the output associated with new inputs.
@ -1705,7 +1729,7 @@ As we have data from three editions of each course, the largest possible trainin
 We also made predictions for a snapshot using each of its corresponding snapshots as individual training sets to see if we can still make accurate predictions based on data from only one other course edition.
 Finally, we also made predictions for a snapshot using 5-fold cross-validation to compare the quality of predictions based on data from the same or another cohort of students.
 Note that the latter strategy is not applicable to make predictions in practice, because we will not have pass/fail results as training labels while taking snapshots during the semester.
-To make predictions for a snapshot, we can in practice rely only on corresponding snapshots from previous course editions.
+In practice, to make predictions for a snapshot, we can rely only on corresponding snapshots from previous course editions.
 However, because we can assume that different editions of the same course yield independent data, we also used snapshots from future course editions in our experiments.

 There are many metrics that can be used to evaluate how accurately a classifier predicted which students will pass or fail the course from the data in a given snapshot.
@ -1725,6 +1749,20 @@ The F_1-score is the harmonic mean of precision and recall.
 If we go back to our example, the optimistic classifier that consistently predicts that all students will pass the course and thus fails to identify any failing student will have a balanced accuracy of 50% and an F_1-score of 75%.
 Under the same circumstances, a pessimistic classifier that consistently predicts that all students will fail the course has a balanced accuracy of 50% and an F_1-score of 0%.

+*** Pass/fail predictions
+:PROPERTIES:
+:CREATED:  [2024-01-22 Mon 17:17]
+:END:
+
+In summary, Figure\nbsp{}[[fig:passfailmethodoverview]] outlines the entire flow of the proposed pass/fail prediction framework.
+It starts by extracting metadata for all submissions students made so far within a course (timestamp, status, student, exercise, series) and collecting their marks on intermediate tests and final exams (step 1).
+In practice, applying the framework on a student cohort in the current course edition only requires submission metadata and pass/fail outcomes from student cohorts in previous course editions.
+Successive course editions are then aligned by identifying fixed time points throughout the course where predictions are made, for example at submission deadlines, intermediate tests or final exams (step 2).
+We conducted a longitudinal study to evaluate the accuracy of pass/fail predictions at successive stages of a course (step 3).
+This is done by extracting features from the raw submission metadata of one or more course editions and training machine learning models that can identify at-risk students during other course editions.
+Our scripts that implement this framework are provided as supplementary material.
+Teachers can also interpret the behaviour of students in their class by analysing the feature weights of the machine learning models (step 4).
+
 ** Results and discussion
 :PROPERTIES:
 :CREATED: [2023-10-23 Mon 16:55]
@ -1777,6 +1815,13 @@ Nearly halfway through the semester, before the first evaluation, we see an aver
 After the first evaluation, we can make predictions with a balanced accuracy between 75% and 80% for both courses.
 The predictions for course B stay within this range for the rest of the semester, but for course A we can consistently make predictions with an average balanced accuracy of 80% near the end of the semester.

+Compared to the accuracy results of\nbsp{}[cite/t:@kovacicPredictingStudentSuccess2012], we see a 15-20% increase for our balanced accuracy results.
+Our balanced accuracy results are similar to the accuracy results of\nbsp{}[cite/t:@livierisPredictingSecondarySchool2019], who used semi-supervised machine learning.
+[cite/t:@asifAnalyzingUndergraduateStudents2017] achieve an accuracy of about 80% when using one cohort of training and another cohort for testing, which is again similar to our balanced accuracy results.
+All of these studies used prior academic history as the basis for their methods, which we do not use in our framework.
+We also see similar results as compared to\nbsp{}[cite/t:@vihavainenPredictingStudentsPerformance2013] where we don’t have to rely on data collection that interferes with the learning process.
+Note that we are comparing the basic accuracy results of prior studies with the more reliable balanced accuracy results of our framework.
+
 F_1-scores follow the same trend as balanced accuracy, but the inclination is even more pronounced because it starts lower and ends higher.
 It shows another sharp improvement of predictive performance for both courses when students practice their programming skills in preparation of the final exam (snapshot E3).
 This underscores the need to keep organizing final summative assessments as catalysts of learning, even for courses with a strong focus on active learning.
@ -1797,6 +1842,10 @@ This missing data and associated features had no impact on the performance of th
 Deliberately dropping the same feature types for course A also had no significant effect on the performance of predictions, illustrating that the training phase is where classification algorithms decide themselves how the individual features will contribute to the predictions.
 This frees us from having to determine the importance of features beforehand, allows us to add new features that might contribute to predictions even if they correlate with other features, and makes it possible to investigate afterwards how important individual features are for a given classifier (see Section\nbsp{}[[Interpretability]]).

+Even though the structure of the courses is quite different, our method achieves high accuracy results for both courses.
+The results for course A with reduced features also still gives accurate results.
+This indicates that the method should be generalizable to other courses where similar data can be collected, even if the structure is quite different or when some features are impossible to calculate due to the course structure.
+
 *** Early detection
 :PROPERTIES:
 :CREATED: [2023-10-23 Mon 17:05]
@ -1807,7 +1856,7 @@ Accuracy of predictions systematically increases as we capture more of student b
 But surprisingly we can already make quite accurate predictions early on in the semester, long before students take their first evaluation.
 Because of the steady trend, predictions for course B at the start of the semester are already reliable enough to serve as input for student feedback or teacher interventions.
 It takes some more time to identify at-risk students for course A, but from week four or five onwards the predictions may also become an instrument to design remedial actions for this course.
-Hard deadlines and graded exercises are a strong enforcement of active learning behaviour on the students of course A, and might disguise somewhat more the intrinsic motivation of students to work on their programming skills.
+Hard deadlines and graded exercises are a strong enforcement of active learning behaviour on the students of course A, and might disguise somewhat more the motivation of students to work on their programming skills.
 This might explain why it takes a bit longer to properly measure student motivation for course A than for course B.

 *** Interpretability
@ -1846,7 +1895,9 @@ Although the difficulty of evaluation exercises is lower than those of exam exer
 Also note that these features only show up in snapshots taken at or after the corresponding evaluation.
 They have zero impact on predictions for earlier snapshots, as the information is not available at the time these snapshots are taken.

-#+CAPTION: Importance of evaluation scores in the logistic regression models for course A.
+#+CAPTION: Importance of evaluation scores in the logistic regression models for course A (full feature set).
+#+CAPTION: Reds mean that a growth in the feature value for a student increases the odds of passing the course for that student.
+#+CAPTION: The darker the colour the larger this increase will be.
 #+NAME: fig:passfailfeaturesAevaluation
 [[./images/passfailfeaturesAevaluation.png]]

@ -1873,7 +1924,11 @@ An exception to this pattern are the few red squares forming a diagonal in the m
 These squares correspond with exercises that are solved as soon as they become available as opposed to waiting for the deadline.
 A possible explanation for these few slightly positive weights is that these exercises are solved by highly-motivated, top students.

-#+CAPTION: Importance of feature type =correct_after_15m= (the number of exercises in a series where the first correct submission was made within fifteen minutes after the first submission) in logistic regression models for course A.
+#+CAPTION: Importance of feature type =correct_after_15m= (the number of exercises in a series where the first correct submission was made within fifteen minutes after the first submission) in logistic regression models for course A (full feature set).
+#+CAPTION: Reds mean that a growth in the feature value for a student increases the odds of passing the course for that student.
+#+CAPTION: The darker the colour the larger this increase will be.
+#+CAPTION: Blues mean that a growth in the feature value decreases the odds.
+#+CAPTION: The darker the colour the larger this decrease will be.
 #+NAME: fig:passfailfeaturesAcorrect
 [[./images/passfailfeaturesAcorrect.png]]

@ -1884,7 +1939,11 @@ The lecturer and teaching assistants identify the topics covered in series 2 and
 However, it does not feel very intuitive that being stuck with logical exercises longer than other students either inhibits the odds for passing on topics that are extremely hard or easy or promotes the odds on topics with moderate difficulty.
 This shows that interpreting the importance of feature types is not always straightforward.

-#+CAPTION: Importance of feature type =wrong= (the number of wrong submissions in a series) in logistic regression models for course A.
+#+CAPTION: Importance of feature type =wrong= (the number of wrong submissions in a series) in logistic regression models for course A (full feature set).
+#+CAPTION: Reds mean that a growth in the feature value for a student increases the odds of passing the course for that student.
+#+CAPTION: The darker the colour the larger this increase will be.
+#+CAPTION: Blues mean that a growth in the feature value decreases the odds.
+#+CAPTION: The darker the colour the larger this decrease will be.
 #+NAME: fig:passfailfeaturesAwrong
 [[./images/passfailfeaturesAwrong.png]]

@ -1895,6 +1954,8 @@ The fact that the second evaluation is scheduled a little bit earlier in the sem
 However, we do not see a similar increase of the global performance metrics around the second evaluation of course B, as we see for the first evaluation.

 #+CAPTION: Importance of evaluation scores in the logistic regression models for course B.
+#+CAPTION: Reds mean that a growth in the feature value for a student increases the odds of passing the course for that student.
+#+CAPTION: The darker the colour the larger this increase will be.
 #+NAME: fig:passfailfeaturesBevaluation
 [[./images/passfailfeaturesBevaluation.png]]

@ -1915,12 +1976,22 @@ So making mistakes is beneficial for learning, but it depends on what kind of mi
 We also looked at the number of solutions with logical errors while interpreting feature types for course A.
 Although we hinted there towards the same conclusions as for course B, the signals were less consistent.
 This shows that interpreting feature importances always needs to take the educational context into account.
+This can also be seen in Figure\nbsp{}[[fig:passfailfeaturesAcorrect]], where some weeks contribute positively and some negatively.
+The reasons for these differences depend on the content of the course, which requires knowledge of the course contents to interpret correctly.

 #+CAPTION: Importance of feature type =comp_error= (the number of submissions with compilation errors in a series) in logistic regression models for course B.
+#+CAPTION: Reds mean that a growth in the feature value for a student increases the odds of passing the course for that student.
+#+CAPTION: The darker the colour the larger this increase will be.
+#+CAPTION: Blues mean that a growth in the feature value decreases the odds.
+#+CAPTION: The darker the colour the larger this decrease will be.
 #+NAME: fig:passfailfeaturesBcomp
 [[./images/passfailfeaturesBcomp.png]]

 #+CAPTION: Importance of feature type =wrong= (the number of wrong submissions in a series) in logistic regression models for course B.
+#+CAPTION: Reds mean that a growth in the feature value for a student increases the odds of passing the course for that student.
+#+CAPTION: The darker the colour the larger this increase will be.
+#+CAPTION: Blues mean that a growth in the feature value decreases the odds.
+#+CAPTION: The darker the colour the larger this decrease will be.
 #+NAME: fig:passfailfeaturesBwrong
 [[./images/passfailfeaturesBwrong.png]]

@ -1935,7 +2006,7 @@ The framework already yields high-accuracy predictions early on in the semester
 Being able to identify at-risk students early on in the semester opens windows for remedial actions to improve the overall success rate of students.

 We validated the framework by building separate classifiers for two courses because of differences in course structure, but using the same set of features for training models.
-The results showed that metadata from previous student cohorts can be used to make predictions about the current cohort of students, even if course editions use different sets of exercises.
+The results showed that submission metadata from previous student cohorts can be used to make predictions about the current cohort of students, even if course editions use different sets of exercises, or the courses are structured differently.
 Making predictions requires aligning snapshots between successive editions of a course, where students have the same expected progress at corresponding snapshots.
 Historical metadata from a single course edition suffices if group sizes are large enough.
 Different classification algorithms can be plugged into the framework, but logistic regression resulted in the best-performing classifiers.
@ -1944,6 +2015,16 @@ Apart from their application to make pass/fail predictions, an interesting side
 Visualization of feature importance turned out to be a useful instrument for linking individual feature types with student behaviour that promotes or inhibits learning.
 We applied this interpretability to some important feature types that popped up for the two courses included in this study.

+Our study has several strengths and promising implications for future practice and research.
+First, we were able to predict success based on historical metadata from earlier cohorts, and we are already able to do that early on in the semester.
+In addition to that, the accuracy of our predictions is similar to those of earlier efforts\nbsp{}[cite:@asifAnalyzingUndergraduateStudents2017; @vihavainenPredictingStudentsPerformance2013; @kovacicPredictingStudentSuccess2012] while we are not using prior academic history or interfering with the students’ usual learning workflows.
+However, there are also some limitations and work for the future.
+While our visualizations of the features (Figures\nbsp{}[[fig:passfailfeaturesAcorrect]]\nbsp{}through\nbsp{}[[fig:passfailfeaturesBwrong]]) are helpful to indicate which features are important at which stage of the course in view of increasing versus decreasing the odds of passing the course, they may not be oversimplified and need to be carefully interpreted and placed into context.
+This is where the expertise and experience of teachers comes in.
+These visualizations can be interpreted by teachers and further contextualized towards the specific course objectives.
+For example, teachers know the content and goals of every series of exercises, and they can use the information presented in our visualizations in order to investigate why certain series of exercises are more or less important in view of passing the course.
+In addition, they may use the information to further redesign their course
+
 We can thus conclude that the proposed framework achieves the objectives set for accuracy, early prediction and interpretability.
 Having this new framework at hand immediately raises some follow-up research questions that urge for further exploration:
 #+ATTR_LATEX: :environment enumerate*