---------------------------------------------------------------------------- Execution Timing, Linux and the IOI by Martin Mares , Charles University, Prague ---------------------------------------------------------------------------- This short pamphlet attempts to summarize what are the possibilities of measuring program execution time and what are their implications for judging programs at the IOI. It focuses on Linux as it's the system currently used for judging and also the one I'm most familiar with; other multi-tasking systems will probably behave in a very similar way, but I didn't study them deeply enough. 1. The timers ~~~~~~~~~~~~~ Linux offers these ways to measure execution time: (a) classical Unix process accounting: for each process, the kernel keeps two tick counters: a user counter and a system counter. When the system timer ticks and generates an interrupt, the kernel looks which process is currently running and increases one of its counters (which one depends on whether the process is currently executing kernel code or not). Basic problems: since we only sample the running processes instead of tracking them explicitly, the result is necessarily only an approximation of the real value and the more processes there are and the shorter are the time intervals, the coarser the approximation is. Which is worse, the sampling is not random, hence there can be huge systematic errors (for example, it's easy to construct a process which synchronizes itself with the timer and although it consumes almost all processor time, the meter shows almost nothing). Also, when a process waits for disk I/O, it isn't running, so the time spent during blocking disk operations is not accounted for. (b) using `wall clock' time: that is, instead of focusing on timing of a single process, measure how much did the system clock advance between start and end of the process tested. Advantages: Wall clock is very accurate (even with sub-tick precision) and it doesn't interfere with anything except badly written drivers which disable interrupts for too long (but drivers for common hardware don't fall to this category). Blocking I/O is accounted properly. Disadvantages: One disadvantage is obvious: This method measures not only the process in question, but also everything else running at the same time, hence it can be used only when the load caused by the other processes is known to be very low. (Actually, it would be sufficient for the load to be predictable, but in practice only a very low load is predictable enough). (c) using precise tracking of process time -- this feature is not a part of the standard Linux kernel yet, it exists only as an experimental patch (the curious ones can look at the diff against Linux 2.4.14 at ftp://atrey.karlin.mff.cuni.cz/pub/local/mj/linux/ucw-profile-0.patch). The principle of this patch is to use the cycle counter of the CPU (the current version works only on the PC with Pentium or a newer CPU) and each time the kernel switches processes or a process enters or exits the kernel, update per-process times according to the counter. Advantages: Very accurate even under extreme load and not suffering from the sampling problems of (a). Also, it can be extended to exclude system interrupts and other stuff from being accounted to processes. Disadvantages: Needs a kernel patch (although a simple one and it's probably going to be accepted to the kernel after it starts working on non-PC machines, too). Doesn't account for disk I/O. 2. What do we want to measure? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ What a strange question? Everybody knows that we want to measure the exact time spent by the given program (a contestant's solution) with as good precision as possible. But beware: On any multi-tasking system, there really is no such thing like an isolated execution time of a process. The processes interact with each other in very intricate ways -- for instance there are lots of caches (not only the disk cache, but also the L1/L2 caches of the CPU) which are shared by all the processes, so the other processes can either throw data of the tested process (slowing it down as it needs to re-fetch the data) or even preload them if they accidentally use the same data (speeding up our process; it is not as uncommon as it might look at the first sight as it usually happens with shared libraries) and so on. And it needn't be processes running at the same time -- a previous invocation of the same program can have such effect as well. Hence the best we can hope for is to get a good "definition" of the execution time, that is one which doesn't vary too much between tests (at least if we're able to keep the system load low; if we aren't, it probably isn't possible at all) and which at least roughly corresponds to the wall clock time on an unloaded computer. 3. Common caveats ~~~~~~~~~~~~~~~~~ Two problems arose at the latest IOI and were discussed a lot at the jury meetings without reaching reasonable consensus. Let's take a look at them: - very short time limits: all three measuring methods shown above can potentially give results with sub-second precision (up to 10ms for (a), up to 1us for (b) and (c)), but the value of such results is highly questionable as the errors caused by inter-process interactions, especially during program loading (caching of the program itself and of shared libraries it uses), are expected to be at a similar scale. Starting the timer after the program is loaded (for example with the first syscall by the process) to minimize these effects is not going to work as Linux uses on-demand loading of executables. It could be theoretically accomplished by running the process two times and measuring the second run, hoping that everything got cached by the first run or maybe by pre-reading all the files before running the program, but these work-arounds are probably not going to be reliable anyway. I'd recommend avoiding extremely short time limits and rather selecting tasks and constructing test data in a way that good and bad solutions can be easily separated even with +/- 1s additive error in timing. - timing of interactive tasks: when a interactive task is tested by communicating with a testing program, we need to separate the time spent by the program from time spent by the tester. Due to frequent context switches, the samping problems of (1a) can be expected. There are two solutions: o Tell the contestants that in such tasks, we will measure the total time spent by both the program and the tester. This will give reasonable values even with (a) as the sampling errors don't affect the total time too much and it's roughly an equivalent of (b). o Measure the time of the tested program precisely using (c). Also, we need to decide what to do with blocking I/O. If the tasks are not going to involve large inputs/outputs/temporary files, we probably can safely ignore the problem; otherwise we're bound to use the wall clock time with all its problems or to invent yet another trick :-) [the approach (c) could theoretically be extended to cover blocking I/O waits as well, but they are so intertwined with the activities of other processes that it would give results comparable to the wall clock]. Another problematic item at this year's jury meetings was limiting of time and memory used by compilation, but I will skip it as in my opinion it's just the matter of deciding on how to update competition rules, not a technical problem. 4. Remarks ~~~~~~~~~~ One of examples of possible approaches to these problems is the tester used at the Czech national olympiad (current version available at ftp://atrey.karlin.mff.cuni.cz/pub/local/mj/priv/mo-eval-011228.tar.gz; the interesting file is src/box.c) which runs the tested programs in a tightly controlled environment with various kinds of limits on time, memory and system calls. If you have any questions or comments to either this text or our tester, feel free to contact me.