CMSC {2,3}3710: Scientific Visualization / Grading Code

Main Syllabus Literature Svn Grading Code

In addition to the basic algorithmic and mathematical elements of scientific visualization, this class also strives to teach some aspects of software development, particularly for the setting of using a low-level language like C for computationally intensive problems. How to grade this kind of code is its own challenge. This page outlines the approach this class has settled on, with the goals of fairness (everyone is graded the same), meaningfulness (grading measures significant not accidental code properties), relevance (what's learned can be used outside this class), and scalability (process works for large classes).

Note that in 2023 and earlier offerings of the class, this information was conveyed via one of the pre-recorded videos ("2A-CodeGrade"), but this page is faster to read and easier to update.

Overview

Code grading in this class is organized into three aspects of consideration, which seek to answer useful questions about your code:

Correctness: Did your code compute correct results, within a reasonable amount of time? This is based on running the grade/go automated tests on CSIL Macs. Additional correctness tests may be quantitatively timed, meaning that the execution time itself is graded.
Cleanliness: how cleanly does your code compile and run? In general, your code won't live in an island: it will be used by other people, in bigger contexts. Cleanliness tries measure how well your code plays well with others. Memory leaks are one example of a cleanliness deduction. This is evaluated by running the grade/clean automated tests on linux.cs.uchicago.edu.
Thoughtfulness Is this code the product of your own careful and intentional work? Uncommented code has low thoughtfulness, as does code generated by hurried copy-and-pasting. This is evaluated by human reading your code and editing the grade/thought file, in which you can read the items being considered.

The relative weights of correctness, cleanliness, and thoughtfulness are determined by the correctness score, in a way that is visualized by the following: plot of how correctness determines weighting of correctness,
cleanliness, and thoughtfulness

plot of how correctness determines weighting of correctness,
cleanliness, and thoughtfulness

Or mathematically:

If correctness below 60%, the grade is simply correctness. This happens when you weren't able to start the assignment soon enough to complete all the important parts.
If the correctness is 100%, the grade is 70% correctness, 15% cleanliness, and 15% thoughtfulness.
Else, the grade is a linear blend (via lerp) between the 60% and 100% cases.

It is always in your interest to improve your correctness score if possible; that will never lower your grade. But it is impossible to get a great grade if you ignore cleanliness and thoughtfulness.

It is educational to scrutinize and evaluate the execution of your own low-level code, on your own laptop, without having to do everything on some remote server. To facilitate this, pre-compiled reference code is provided to you, both for x86-based Linux and for Macs with x86 or ARM (e.g. M2) chips, so that work on improving your correctness grade can be done on your own laptop. Windows itself is not supported, but students in the past have successfully used Windows Subsystem for Linux (WSL). It is very likely that if you have a laptop, you can use it for most of the development work of each project (writing the initial versions of your code, and making sure it basically works), and, you are encouraged to do so. Still, so that there is consistent reference platform used for grading everyone, correctness will be evaluated on CSIL Macs, and cleanliness will be evaluated on linux.cs.uchicago.edu. It is good experience to get to learn about how compilation works in those two different settings, and to be able to manage checkouts of your code in multiple places, including your local machine.

Every coding project comes with a grade sub-directory, which contains everything about code grading. For correctness grading a new executable is created within grade, which calls into both your code and the reference code, to compare results and execution times. By design, none of the given files that control grading have a "." in the filename, but the many files generated by grading do have ".". So, to clean up all the grading results, you can do something as bold as:

rm -rf grade/*.*

rm -rf grade/plotr grade/*.*

to also remove the (for example Project 1) grade/plotr grading executable. The "grade/" prefix is important; it limits the file removal to things in the grade subdirectory. You should not svn add any new files in grade (our grading will ignore them).

Correctness Details

A compiled language like C is a powerful tool for fast and efficient computation, but it is unforgiving and hence requires practice to use well. This class gives you a lot of practice using C. Still, if your code doesn't compile when you hand it in, we can't grade what your code does, and you may get a zero on correctness (for all tests), and hence a zero on the whole assignment. There is a 0-nothanks.sh script at the top level of every programming project that may help detect compilation failures; the name of the script can be interpreted as "Would I like a 0 on my project? No thanks!". Run it ./0-nothanks.sh

If a test makes your code crash (i.e. it has a segmentation fault or a bus error), you will get negative correctness points on that test. Crashes are detected and clearly reported by the grading framework to help steer you towards writing code that doesn't crash.

If your code works great, running grade/go on CSIL Macs versus linux.cs.u.e (versus your own machine) will probably give the same results. However, buggy code often manifests different problems on different platforms, including possibly crashing on one platform but not the other (hence the need for reference platforms for grading). Also, the quantitative timing tests that are usually part of correctness (from grade/go time) may have a hard time finishing on linux.cs.u.e. The higher contention for those machines' CPUs means that it is harder to reliably time the code execution.

Nearly all correctness tests give partial credit: useful work toward a correct result should be rewarded. How correctness is assessed for each test depends on the kind of the test's computational result. The variety of computational outputs in our class code (integers, float point scalars and vectors, polylines, RGB and RGBA images, etc) motivated the creation of the grading framework we use. If the computational result is a single floating point number, the count of ULPs between your and the reference result determines correctness. Vectors of numbers may be assessed by the maximum per-component difference. If the computational result is an image, correctness grading produces triptychs that you can inspect to compare your and reference results:

In the triptychs, the order of sub-images is (1) difference (you-ref), (2) your result, (3) reference result. Correctness points decrease with more wrong pixels, with a threshold for per-pixel error considered wrong. Some tests give wiggle room in pixel location (did you get the correct result in some nearby pixel).

Correctness test sets

The correctness tests are organized into sets organized around program functionality. For Project 1, the test sets include lerp, util, and plot. The test sets are listed in the final column of the todo command (e.g. ./plotr todo, also in ./plotr about). The correctness score is the sum of correctness over all test sets. Every effort has been made to make the test sets ordered and cumulative, so that getting full points on earlier test sets means that a failure in a later test set indicates a bug in the code specific to that later test set, instead of in the code covered by earlier test sets. As projects become more complicated, however, this is harder to ensure: attentiveness and patience is always required for coding and debugging. Note that some projects may include an extra "reserved" or "hidden" correctness test set, which is not distributed to the students with the project. The functionality exercised in the extra test set will, however, be described and exercised in the project web page (which is a reason to read through and understand the project web page, and run the commands therein with your code).

When you run grade/go it first creates the grading executable (e.g. grade/plotr), and then uses this executable to run all the commands in all the test sets (except for the quantitatively timed tests, see next section). Or, you can run a single test set with, e.g. grade/go lerp.

Test set foo is represented by test file grade/T-foo (no file extension). When grade/go executes the commands in grade/T-foo, it prints some feedback to the terminal, and saves information in some files. The terminal output might look like:

> ./grade/go plot
> ./grade/go: running "plot" tests .....:!!..::::::::!!!!.......:..
::!!::..:..::::..:..::::..:..::..

What the single characters mean:

"." = test passed
":" = close but not completely passing; partial credit
"!" = test failed (too far from correct); no credit
"X" = crashed (e.g. segfault); negative credit
"_" = faster than reference; a little extra credit

The test results are saved in summary and log files. grade/T-foo-sum.txt summarizes which tests (in test set foo) did and did not pass, while grade/T-foo-log.txt is a painfully detailed log of every command executed, and what resulted, and how correctness points were determined. All the tests are numbered and the numbers appear at the beginning of the line in both the -sum.txt and -log.txt files. If, for example, you see in -sum.txt that test __032__ failed, you should search for __032__ in the -log.txt to see the corresponding details, possibly including the name of an image (something like diff-032-foo-bar.png) that visualizes where your code went wrong.

Work on an assignment should start with reading through the project web page, and then typically involves repeated runs of grade/go, likely limited to the single test set (for whichever functions you are coding), and then looking at the start of the -sum.txt file to see what is the first test not passing (i.e. which command is highlighting a difference between yours and the reference code). You should copy and paste these commands from the -sum.txt file to your terminal for further isolated execution of your code for that particular test.

Test time-out versus quantitatively timed tests

All correctness tests have a (generous) time-out: if your code just hangs without ever generating a result, or if it takes multiple seconds to run while the reference code can do it (on the same platform) in a fraction of a second, then you get zero points for correctness on that test.

Most projects also have a separate set of quantitatively timed tests, in the grade/T-time test file, and accessed via "grade/go time". In addition to the basic requirement that the code finish within a reasonable time, here the execution time is itself graded. The test code is thus run many many times to assess the execution time within some precision (we use the median execution time of the fastest N runs, once the spread of those N runs falls below some threshold). This is done for both reference code and student code, on the same platform, to determine the ratio of the execution time of your code, versus that of the reference code.

Measuring the execution time ratio of your versus reference code is itself a slow process, which would get annoying if it happened every time you ran grade/go. So the quantitatively timed tests are only run with grade/go time (just the timed tests), or grade/go all (the non-timed tests and the timed tests).

The execution time ratio determines a scaling of the non-timed correctness test score, according to a function graphed here:

plot of how execution time ratio scales the correctness score

The important aspects of the quantitatively time tests are:

The flat portion of the graph: there is no deduction if your code is as fast as the reference code, or is a little slower (e.g. up to 2.5 times longer execution time)
The long slope down to the right: if your code is significantly slower, (e.g. more than 2.5 times slower), the points decrease exponentially (until hitting the time-out, with 0 points)
The short ramp on the left: If your code is getting the correct result, and is appreciably faster than the reference code, then you get a little bit of extra credit. Implementation decisions in the reference code prioritize correctness, not speed, which can provide a low bar for your code to be faster. But given how few points are actually involved, earning this extra credit is more about bragging rights than any significant difference in your course grade. There are likely more important things to focus on, inside and outside of this class.

Cleanliness Details

The cleanliness grade component is assessed with the grade/clean script, running on linux.cs.uchicago.edu. There "which cc" says "/usr/bin/cc" and "cc --version" says "cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0". As with correctness, there are different tests, but for cleanliness everything is self-contained in grade/clean. The typical cleanliness tests are:

twsp: Do you have lines of code with trailing white space?
warn: Does your code compile without warnings?
symb: Does your code create new linkable symbols when it shouldn't or are they not prefixed by the library name?
risd: Does your code avoid trying to increase numerical precision by specifically using the double type, and does it still work when the real typedef is changed to double instead of default float?
impf: Does your code compute things in double-precision via the use of implicit conversions to double-precision? This is specific to this SciVis class, and ensures that we are all operating under the same limits of floating-point precision.
mchk: This uses valgrind --tool=memcheck to monitor how your code uses memory, to flag a variety of issues, including memory leaks. Periodically throughout your coding work, and immediately after you notice either a segfault or any non-determinism, you should be using grade/clean mchk to see if there are issues that Valgrind can see.
hlgd: For only those projects that involve pthreads, this uses valgrind --tool=helgrind to monitor how your code uses threads.
cnum: Do you hard-code C enum values that should instead be left as legible names? By the way this operates (compiling and running your code with different numerical values for the enum values, and checking that the results are the same), this test can become a red-flag for some other non-determinism that has eluded the mchk test.

The cleanliness score is the minimum of all cleanliness tests.

All results of cleanliness tests are saved in a single file: grade/clean-sum.txt, and this includes information or rationale about the test. For twsp, for example, it says:

99 = twsp points: 1 line has trailing whitespace, as follows:
  poly.c:36:    ret = 3;          $
Figure out how to highlight trailing whitespace in your editor so that you
can remove it as soon as it is accidentally created, or use an editor
extension that automatically removes it upon saves. Trailing whitespace is
bad because changing it creates cryptic file differences when using any
version control system, which wastes your colleagues' time and good will.

Thoughtfulness Details

This is the only part of the grading that is based on a human reading your code (though we will also be reading your code to help answer questions you have about it, prior to the deadline). The items of thoughtfulness consideration are listed in grade/thought, which is itself a simple script that produces a grade/thought-sum.txt file when finished. You should read through grade/thought to see what we have in mind when we read your code. Some of the items will be specific to each project, often to verify that you followed some particular instructions for a function (where it would be more cumbersome to try to assess that conformance via an automated test).

Useful descriptive comments are an important thoughtfulness item. Different classes (and different employers) may have their own standards for what constitutes good commenting, but this class has arrived at the following description:

Imagine some "average" student in this SciVis class, as they were prior to starting this assignment. The student has about the same math and coding knowledge as others, but is not aware of the purpose of this project, or the technical elements (structs, functions) in this code.
This student is sitting next to you, eagerly awaiting insight.
At each major step of the code, what is the most concise thing you can say to help this fellow student understand both your code and your thinking behind it?

As a very rough guideline, aim for about 1 comment line for every 4-10 lines of code. Providing a blow-by-blow but mindless narration of the progress of your code, turning every line of code into an equivalent English sentence, is not helpful (and it could just as well have been written by a simple AI). The goal is reflect your own thoughtful understanding that allowed you to write this code, with an emphasis more on how than what. Detailed multi-paragraph expositions are also less helpful, because they aren't concise. Writing good comments requires leaving time for editing and re-writing comments. Short little clarifications within a line (with /* */) or at the end of a line (with //) can be helpful.

Avoiding copy-pasta is also important for thoughtfulness. Copy-pasta is sections of code where some functionality has been created by literal repetition of the code, with only minor alterations between repetitions. This is fast to write, but annoying for others to read, and too fragile to maintain in the long-term. Copy-pasta can arise from multi-line blocks of code, or from long expressions (e.g. ctx->sub->img->data.ss[i]). Use helper functions, or new local variables, to isolate the common elements, so that the parts changing are more legible.

Winter 2024 CMSC {2,3}3710: Scientific Visualization

Overview

Correctness Details

Correctness test sets

Test time-out versus quantitatively timed tests

Cleanliness Details

Thoughtfulness Details