Cooperative Multithreading for Fun & Profit

Cooperative multitasking was the staple of early operating systems – as were system lockups due to poorly programmed applications not yielding control of the processor. While modern systems instead implement preemptive multitasking to sidestep these issues, cooperative multitasking still shines in certain applications.

Background

If you have no idea what the hell cooperative multitasking1 is, or why we don’t use it anymore… I hope that you, my dear reader, are ready for a little history lesson. And if you do, feel free to skip ahead to the gory implementation details that you really clicked on this post for…

Hop on into your Time Machine.app and spin the dial to somewhere around the late 70s. The state of the art in home computing is the apple ][, with the Commodore 64 coming a few years later. These machines were awesome, but because of the relatively slow processors and minuscule amounts of RAM, you’d have them doing one thing at a time. There simply wasn’t space to do anything else, and processor cycles were an extremely scarce resource.

Fast forward a few years later to the IBM PC, which had up to 640K of RAM – enough to almost start doing multiple things at the same time. Someone came up with the third version of this thing called Windows, which let you run multiple programs simultaneously with pretty graphics, despite the limited processing power of the PC. Neat!

The way this worked on such a limited system, with a slow processor was actually relatively simple. You can probably guess where I’m going with this…

Cooperative Multitasking

Windows 3.0 implemented cooperative multitasking to share processor time between Windows processes2. Since the expectation was that Windows programs spend their entire lives in the Windows event loop, this worked well – calls to save the task’s state and relinquish the CPU could simply be inserted in those functions, and Windows programs would appear to run simultaneously.

Since most of this was handled inside the system libraries, programmers didn’t really need to concern themselves with this; things just worked. That however changed if you wanted to do something other than process GUI events: for example, background processing work.

The system provided functions that allowed checking whether a message was pending in the event queue, without retrieving it and without yielding the processor; this could be used to multiplex event processing and time-intensive background work. And this background work could also periodically relinquish control of the processor so that other programs could run, and the machine, while sluggish, would still be usable.

Awesome! We now have multiple independent programs sharing the same processor, without the extra overhead of having to maintain a timer and take the costly interrupts into the kernel.

Problems

Cooperative multitasking places a lot of trust in programs to be well behaved. The reality is that many, many programs… simply weren’t. And even the most well-behaved programs aren’t immune to bugs – especially the kind that could lead to pesky infinite loops. Whoops.

Another more fundamental issue is that the applications had to be explicitly aware – or in some cases, the fact that they were aware3 – that they would only allow other apps to run if they explicitly yielded processor time. It should be relatively clear by now why preemptive multitasking, usually driven by a periodic timer to interrupt the current task, is what we’re using today.

Now, if cooperative multitasking sucks so bad and was, for the most part, left behind in the 90s, why am I talking about it in 2021? Because you can make it really, really fast, without any operating system support. This means it can be implemented in userspace!

Userspace Multithreading

If the mere title of this section sounds cursed… well, that’s because it kind of is. But before operating systems themselves had native support for threads, userspace thread libraries were the way to go if you wanted to take advantage of multithreading before widespread system support. This means there’s a good bit of prior art out there in how something like this can be implemented.

So what use does userspace multithreading have today, where every desktop system, and even most embedded systems, support proper kernel threads? Simply put, most overhead from kernel level multithreading is avoided.

Kernel Threads

While the kernel level multithreading provided by most systems today is wonderful and simplifies things greatly, one design decision apparent from the name results in most performance issues: the fact that switching between two user threads requires an additional context switch into kernel space, and then back to userspace.

Processors really don’t like that, because it can thrash caches and introduce lengthy pipeline stalls; and it’s significantly more code that has to get executed just to switch between two threads that might be in the same process. The kernel also has no choice but to take the slower, safer approach to the context switch: for example, it must preserve all general purpose and floating point registers when switching threads4.

For the vast majority of applications, this performance hit is relatively minor – they don’t perform more than a few hundred or thousand context switches a second.

Use Cases

Servers

As alluded to above, servers often spawn a new thread for each request. For high-performance servers, the overhead of actually allocating these threads and performing the context switches can be significant. While one solution is to use thread pools, to reuse threads, cooperative multithreading fits this use case well.

Each request is a self-contained entity, and most requests will execute start to finish without having to do anything else, like block. Paying the overhead of constantly going into the kernel – which also has no idea about our workload’s nature – wastes a lot of processor time that could be used doing productive work.

By taking advantage of asynchronous IO, a feature offered in some form or another by all worthwhile server OSes, these cooperative threads can even handle IO and be blocked/unblocked with appropriate application support.

Emulation

Besides the performance benefits of cooperative threading, there are actually a few instances where it can make code significantly cleaner and easier to follow: the bonus performance is just the cherry on top. One of my current projects is developing a cycle-accurate emulator – it was what spurred this entire adventure – and cooperative threading has simplified much of its code significantly.

Imagine a system with multiple processors, each of which truly executes in parallel. Now build an emulator for this system: easy if you don’t care about the timing, but much, much more difficult if you do. The naive approach is to create a kernel thread for each of these processors and execute them in lockstep to try to take advantage of the multiple cores on basically any processor. While this works… it’s slow. Really slow.

Modern processors absolutely do not like to have to synchronize between cores, because this means that big long pipelines need to be flushed and a significant performance degradation results. Instead, take these processors and execute them on the same kernel thread. A possible (and quite acceptable) implementation of this is to implement the processor as a state machine and expose some method to step this state machine one cycle5 at a time.

That ends up being pretty hard to read code, especially when those state machines are deeply nested as they end up being for all but the most trivial devices. This also means that the device needs to maintain significant external state, and pays the overhead of setting up and tearing down all stack frames every single cycle.

Instead of popping all stack frames, returning to the caller, and repeating all that over again whenever one of these indivisible cycles has finished, the device runs in an infinite loop on a cooperative thread, and yields processor time after every internal cycle. Its thread context is saved, a new thread is selected by the application to run, and its context is restored and execution resumes. State that had to be maintained externally becomes implicit as part of function calls and local variables.

I was going to go a lot further with this explanation, but while researching for this article I found an article about this exact use case that does an excellent job going more in-depth. I suggest giving it a read if you’re curious about how emulators are a good use case for cooperative multithreading.

A Solution

It’s clear we can benefit from performing context switching in userspace here: we don’t need many of the services the kernel provides. And by avoiding any blocking system calls in the emulation loops, we can effectively multiplex multiple threads into one kernel thread without any side effects.

Thankfully, the C library comes in and saves the day yet again! Most POSIX-y C libraries will provide some variation of the setcontext family of methods. These allow callers to construct processor state in a machine-independent and portable way. Sounds great, right?

Unfortunately, these are far from universally supported. As expected, they’re a no-go on Windows. And even on platforms where they exist, the performance is… nothing to write home about. While C++20 introduced coroutines, these are limited to trivial tasks, since they don’t come with their own permanent stack.

While there are many userspace threading libraries out there, most are either positively ancient, poorly documented, or simply didn’t have a nice C++ interface I liked. I ended up writing a C++ cooperative threading library, which does all of its context switching in userspace.

It has very fast context switches on all currently popular (x86, amd64, aarch64) architectures, easily scaling to over 100 million context switches a second. Other platforms are supported at reduced performance by relying on setjmp or setcontext, depending on which are available.

Conclusion

You can find more information about the cothread library I wrote for this project at its project page and its source is available on GitHub under the terms of the ISC license. There’s also the autogenerated documentation that describes the platform implementations, and the public library interface.

The emulator I alluded to in the example use case is being developed here. The first system I’m working to support is the Sega Mega Drive/Genesis; in part because it has a 68000 as its main processor – but also because it’s the system I first started doing bare-metal programming on, over a decade ago, by hacking existing games and eventually writing my own software for the system.

Until next time, maybe actually sooner than a month from now since it seems like I can actually remember to write stuff here. Famous last words… 🙃

  1. I’m using cooperative multitasking and multithreading to refer to the same concept; the idea that some unit of execution must explicitly yield processor time. Whether this is done by the operating system itself, or by the program in the context of a kernel thread is irrelevant. 

  2. If you’re curious, you can check out this page for a bunch of additional cursed design decisions driven by running on equally cursed hardware, the IBM PC. The gist of it is that cooperative multitasking was one of the least screwy things that OS did. 

  3. In this model, tasks did not have to worry about locking or critical sections: whenever they were running, there would be no other apps executing until it yielded processor time. 

  4. Most systems have optimizations in place to reduce the amount of data that needs to be saved on context switches. One such implementation uses native processor instructions to save floating point context only when modified; but this can be simulated by disabling the FPU and trapping floating point operations. 

  5. This does not have to be as fine resolution as a single clock cycle. Most processors, and many other devices, operate on indivisible microcycles multiple clocks long. The Motorola 68000, for example, processes a microinstruction every two cycles, and this is the smallest unit of work it can do, and the smallest time unit that needs to be emulated.