Fairness

Seele allows users to specify the runtime and memory usage limits for judge programs and provides data such as runtime and memory usage after the program has finished running. Users will grade student submissions based on this data. Therefore, Seele needs to ensure stable runtime and memory usage as much as possible to ensure fairness. For the same judge program, if there are significant fluctuations in the reported runtime and memory usage after multiple runs in the judge system, we believe that such a judge system is difficult to meet the fairness requirements.

CPU Performance

Linux, as a preemptive scheduling operating system, allocates CPU time slices to each process according to priority. Moreover, in multi-core systems, processes may be allocated to multiple CPU cores sequentially. When a process running on a CPU core is switched to another process, the CPU cache is likely to be occupied by the memory used by another process. This can lead to cache misses when the original process resumes running or is scheduled to another core, ultimately requiring extra time to read the data from memory again.

To mitigate cache miss issues, we need to use the cpuset controller provided by cgroup to limit the judge process to run only on a specific CPU core, thus preventing the Linux kernel's process scheduler from moving it away from the original CPU core. Additionally, we need to control the CPU cores that other threads on the system can use to ensure they do not use the CPU core being used by the judge process. B. Merry's experiment (opens in a new tab) shows that when the judge process is fixed to the same CPU core, its runtime stability is significantly better than when it is not fixed.

Seele binds all process threads (including the main thread and auxiliary threads) to different CPU cores using cgroup at startup. Moreover, Seele configures the secure sandbox to run on the CPU core where the auxiliary thread is located and puts the auxiliary thread into a dormant state to yield the CPU core each time the secure sandbox is started, ensuring that the judge process can exclusively run on a single CPU core.

Memory Access Performance

Many professional servers are equipped with two or more CPUs on the motherboard and enable NUMA in the operating system to allow each processor to access memory belonging to another processor. However, the speed at which each processor accesses non-local memory is often slower than local memory. Therefore, if the judge process's memory access involves the aforementioned non-local memory access, its runtime will fluctuate significantly. At the same time, memory swapping can also cause fluctuations in memory access performance. When a process's memory is swapped to disk, the access speed to some memory will be severely reduced, slowing down its runtime.

To solve the cross-NUMA memory access issue, Seele uses the cpuset controller provided by cgroup to limit the judge process to run only on a specific memory node, thus preventing the process from accessing non-local memory. Meanwhile, Seele's secure sandbox prevents the Linux kernel from swapping the judge process's memory to disk using the memory controller provided by cgroup.

Disk Access Performance

The Linux kernel uses page cache to speed up disk read and write operations. If the user-provided files happen to be in the page cache, the judge process will read these files much faster than if they were not in the page cache. Similarly, page cache also affects the speed at which the judge process writes data to files. At the level below the page cache, both the disk IO scheduling algorithm and the disk's behavior itself can also affect the speed of file reading and writing for the judge process.

To solve the aforementioned problems, Seele uses the tmpfs filesystem provided by the Linux kernel to store files that need to be accessed by the judge process. The tmpfs filesystem uses memory as the storage medium, and its read and write operations do not require the involvement of page cache or scheduling algorithms. This can, to some extent, avoid the negative impact of factors such as page cache on fairness.

Memory Usage Calculation Method

The memory usage calculation methods used in online judge systems mainly include ru_maxrss and mem.max_usage_in_bytes provided by cgroup v1. This secure sandbox uses the memory.peak provided by cgroup v2 to obtain this data. At the Linux kernel level, both of these data items in cgroup v1 and v2 are taken from the same process memory data structure.

The Linux kernel's memory management mechanism includes a swap mechanism, which writes memory pages used by processes to disk and evicts them from memory when system memory pressure increases. RSS does not include the memory swapped to disk in its calculation. However, cgroup may still include it in the calculation through SwapCache. To ensure stable memory access performance and avoid bias in the above calculation methods, we need to prevent the swap mechanism for the judge process, keeping the SwapCache calculated by cgroup always at 0. In cgroup v2, this can be achieved by setting memory.swap.max to 0.

In addition to the aforementioned SwapCache, both mem.max_usage_in_bytes in cgroup v1 and memory.peak in cgroup v2 include the process's RSS, page cache, and shared pages in their calculations. The calculation mechanism for shared pages is: if a process actively uses shared pages, the shared pages it uses will be included in the calculation (opens in a new tab). The shared pages used by a process mainly include the dynamic link libraries it uses, such as libc. However, this part of the dynamic link library calculation is not included in the RSS calculation. Therefore, cgroup is a more accurate memory usage calculation method compared to ru_maxrss. The cgroup v1 documentation also mentions that the memory usage value obtained by reading mem.usage_in_bytes at any time may be inaccurate (opens in a new tab). It can be inferred that memory.peak may also have fluctuations. Therefore, Seele's secure sandbox combines ru_maxrss and memory.peak to jointly determine the maximum memory usage of the process.

Sandbox Observability