Threadless mode in Mitogen 0.3

Mitogen has been explicitly multi-threaded since the design was first conceived. This choice is hard to regret, as it aligns well with the needs of operating systems like Windows, makes background tasks like proxying possible, and allows painless integration with existing programs where the user doesn't have to care how communication is implemented. Easy blocking APIs simply work as documented from any context, and magical timeouts, file transfers and routing happen in the background without effort.

The story has for the most part played out well, but as work on the Ansible extension revealed, this thread-centric worldview is more than somewhat idealized, and scenarios exist where background threads are not only problematic, but a serious hazard that works against us.

For that reason a new operating mode will hopefully soon be included, one where relatively minor structural restrictions are traded for no background thread at all. This article documents the reasoning behind threadless mode, and a strange set of circumstances that allow such a major feature to be supported with the same blocking API as exists today, and surprisingly minimal disruption to existing code.

Recap

Above is a rough view of Mitogen's process model, revealing a desirable symmetry as it currently exists. In the master program and replicated children, the user's code maintains full control of the main thread, with library communication requirements handled by a background thread using an identical implementation in every process.

Keeping the user in control of the main thread is important, as it possesses certain magical privileges. In Python it is the only thread from which signal handlers can be installed or executed, and on Linux some niche system interfaces require its participation.

When a method like remote_host.call(myfunc) is invoked, an outgoing message is constructed and enqueued with the Broker thread, and a callback handler is installed to cause any return value response message to be posted to another queue created especially to receive it. Meanwhile the thread that invoked Context.call(..) sleeps waiting for a message on the call's dedicated reply queue.

Latches

Those queues aren't simply Queue.Queue, but a custom reimplementation added early during Ansible extension development, as deficiencies in Python 2.x threading began to manifest. Python 2 permits the choice between up to 50 ms latency added to each Queue.get(), or for waits to execute with UNIX signals masked, thus preventing CTRL+C from interrupting the program. Given these options a reimplementation made plentiful sense.

The custom queue is called Latch, a name chosen simply because it was short and vaguely fitting. To say its existence is a great discomfort would be an understatement: reimplementing synchronization was never desired, even if just by leveraging OS facilities. True to tribal wisdom, the folly of Latch has been a vast time sink, costing many days hunting races and subtle misbehaviours, yet without it, good performance and usability is not possible on Python 2, and so it remains.

Due to this, when any thread blocks waiting for a result from a remote process, it always does so within Latch, a detail that will soon become important.

The Broker

Threading requirements are mostly due to Broker, a thread that has often changed role over time. Today its main function is to run an I/O multiplexer, like Twisted or asyncio. Except for some local file IO in master processes, broker thread code is asynchronous, regardless of whether it is communicating with a remote machine via an SSH subprocess or a local thread via a Latch.

When a user's thread is blocked on a reply queue, that thread isn't really blocked on a remote process - it is waiting for the broker thread to receive and decode any reply, then post it to the queue (or Latch) the thread is sleeping on.

Performance

Having a dedicated IO thread in a multi-threaded environment simplifies reasoning about communication, as events like unexpected disconnection always occur in a consistent location far from user code. But as is evident, it means every IO requires interaction of two threads in the local process, and when that communication is with a remote Mitogen process, a further two in the remote process.

It may come as no surprise that poor interaction with the OS scheduler often manifests, where load balancing pushes related communicating threads out across distinct cores, where their execution schedule bears no resemblance to the inherent lock-step communication pattern caused by the request-reply structure of RPCs, and between threads of the same process due to the Global Interpreter Lock. The range of undesirable effects defies simple description, it is sufficient to say that poor behaviour here can be disastrous.

To cope with this, the Ansible extension introduced CPU pinning. This feature locks related threads to one core, so that as a user thread enters a wait on the broker after sending it a message, the broker has much higher chance of being scheduled expediently, and for its use of shared resources (like the GIL) to be uncontended and exist in the cache of the CPU it runs on.

Runs of tests/bench/roundtrip.py with and without pinning.
Pinned?	Round-trip delay
No	960 usec	Average 848 usec ± 111 usec
	782 usec
	803 usec
Yes	198 usec	Average 197 usec ± 1 usec
	197 usec
	197 usec

It is hard to overstate the value of pinning, as revealed by the 20% speedup visible in this stress test, but enabling it is a double-edged sword, as the scheduler loses the freedom to migrate processes to balance load, and no general pinning strategy is possible that does not approach the complexity of an entirely new scheduler. As a simple example, if two uncooperative processes (such as Ansible and, say, a database server) were to pin their busiest workers to the same CPU, both will suffer disastrous contention for resources that a scheduler could alleviate if it were permitted.

While performance loss due to scheduling could be considered a scheduler bug, it could be argued that expecting consistently low latency lock-step communication between arbitrary threads is unreasonable, and so it is desirable that threading rather than scheduling be considered at fault, especially as one and not the other is within our control.

The desire is not to remove threading entirely, but instead provide an option to disable it where it makes sense. For example in Ansible, it is possible to almost halve the running threads if worker processes were switched to a threadless implementation, since there is no benefit in the otherwise single-threaded WorkerProcess from having a distinct broker thread.

UNIX fork()

In its UNIX manifestation, fork() is a defective abstraction surviving through symbolism and dogma, conceived at a time long predating the 1984 actualization of the problem it failed to solve. It has remained obsolete ever since. A full description of this exceeds any one paragraph, and an article in drafting since October already in excess of 8,000 words has not yet succeeded in fully capturing it.

For our purposes it is sufficient to know that, as when mixed with most UNIX facilities, mixing fork() with threads is extremely unsafe, but many UNIX programs presently rely on it, such as in Ansible's forking of per-task worker processes. For that reason in the Ansible extension, Mitogen cannot be permanently active in the top-level process, but only after fork within a "connection multiplexer" subprocess, and within the per-task workers.

In upcoming work, there is a renewed desire for a broker to be active in the top-level process, but this is extremely difficult while remaining compatible with Ansible's existing forking model. A threadless mode would be immediately helpful there.

Python 2.4

Another manifestation of fork() trouble comes in Python 2.4, where the youthful implementation makes no attempt to repair its threading state after fork, leading to incurable deadlocks across the board. For this reason when running on Python 2.4, the Ansible extension disables its internal use of fork for isolation of certain tasks, but it is not enough, as deadlocks while starting subprocesses are also possible.

A common idea would be to forget about Python 2.4 as it is too old, much as it is tempting to imagine HTTP 0.9 does not exist, but as in that case, Python is treated not just as a language runtime, but as an established network protocol that must be implemented in order to communicate with infrastructure that will continue to exist long into the future.

Implementation Approach

Recall it is not possible for a user thread to block without waiting on a Latch. With threadless mode, we can instead reinterpret the presence of a waiting Latch as the user's indication some network IO is pending, and since the user cannot become unblocked until that IO is complete, and has given up forward execution in favour of waiting, Latch.get() becomes the only location where the IO loop must run, and only until the Latch that caused it to run has some result posted to it by the previous iteration.

@mitogen.main(threadless=True)
def main(router):
    host1 = router.ssh(hostname='a.b.c')
    host2 = router.ssh(hostname='c.b.a')

    call1 = host1.call_async(os.system, 'hostname')
    call2 = host2.call_async(os.system, 'hostname')

    print call1.get().unpickle()
    print call2.get().unpickle()

In the example, after the (presently blocking) connection procedure completes, neither call_async() wakes any broker thread, as none exists. Instead they enqueue messages for the broker to run, but the broker implementation does not start execution until call1.get(), where get() is internally synchronized using Latch.

The broker loop ceases after a result becomes available for the Latch that is executing it, only to be restarted again for call2.get(), where it again runs until its result is available. In this way asynchronous execution progresses opportunistically, and only when the calling thread indicated it cannot progress until a result is available.

Owing to the inconvenient existence of Latch, an initial prototype was functional with only a 30 line change. In this way, an ugly and undesirable custom synchronization primitive has accidentally become the centrepiece of an important new feature.

Size Benefit

The intention is that threadless mode will become the new default in a future version. As it has much lower synchronization requirements, it becomes possible to move large pieces of code out of the bootstrap, including any relating to implementing the UNIX self-pipe trick, as required by Latch, and to wake the broker thread from user threads.

Instead this code can be moved to a new mitogen.threads module, where it can progressively upgrade an existing threadless mitogen.core, much like mitogen.parent already progressively upgrades it with an industrial-strength Poller as required.

Any code that can be removed from the bootstrap has an immediate benefit on cold start performance with large numbers of targets, as the bottleneck during cold start is often a restriction on bandwidth.

Performance Benefit

Threadless mode tallies in well with existing desires to lower latency and resource consumption, such as the plan to reduce context switches.

Runs of tests/bench/roundtrip.py with and without threadless
	Threaded+Pinned	Threadless
Average Round-trip Time	201 usec	131 usec (-34.82%)
Elapsed Time	4.220 sec	3.243 sec (-23.15%)
Context Switches	304,330	40,037 (-86.84%)
Instructions	10,663,813,051	8,876,096,105 (-16.76%)
Branches	2,146,781,967	1,784,930,498 (-15.85%)
Page Faults	6,412	17,529 (+173.37%)

Because no broker thread exists, no system calls are required to wake it when a message is enqueued, nor are any necessary to wake the user thread when a reply is received, nor any futex() calls due to one just-woke thread contending on a GIL that has not yet been released by a just-about-to-sleep peer. The effect across two communicating processes is a huge reduction in kernel/user mode switches, contributing to vastly reduced round-trip latency.

In the table an as-yet undiagnosed jump in page faults is visible. One possibility is that either the Python or C library allocator employs a different strategy in the absence of threads, the other is that a memory leak exists in the prototype.

Restrictions

Naturally this will place some restraints on execution. Transparent routing will no longer be quite so transparent, as it is not possible to execute a function call in a remote process that is also acting as a proxy to another process: proxying will not run while Dispatcher is busy executing the function call.

One simple solution is to start an additional child of the proxying process in which function calls will run, leaving its parent dedicated just to routing, i.e. exclusively dedicated to running what was previously the broker thread. It is expected this will require only a few lines of additional code to support in the Ansible extension.

For children of a threadless master, import statements will hang while the master is otherwise busy, but this is not much of a problem, since import statements usually happen once shortly after the first parent->child call, when the master will be waiting in a Latch.

For threadless children, no background thread exists to notice a parent has disconnected, and to ensure the process shuts down gracefully in case the main thread has hung. Some options are possible, including starting a subprocess for the task, or supporting SIGIO-based asynchronous IO, so the broker thread can run from the signal handler and notice the parent is gone.

Another restriction is that when threadless mode is enabled, Mitogen primitives cannot be used from multiple threads. After some consideration, while possible to support, it does not seem worth the complexity, and would prevent the aforementioned reduction of bootstrap code size.

Ongoing Work

Mitogen has quite an ugly concept of Services, added in a hurry during the initial Ansible extension development. Services represent a bundle of a callable method exposed to the network, a security policy determining who may call it, and an execution policy governing its concurrency requirements. Service execution always happens in a background thread pool, and is used to implement things like file transfer in the Ansible extension.

Despite heavy use, it has always been an ugly feature as it partially duplicates the normal parent->child function call mechanism. Looking at services from the perspective of threadless mode reveals some notion of a "threadless service", and how such a threadless service looks even more similar to a function call than previously.

It is possible that as part of the threadless work, the unification of function calls and services may finally happen, although no design for it is certain yet.

Summary

There are doubtlessly many edge cases left to discover, but threadless mode looks very doable, and promises to make Mitogen suitable in even more scenarios than before.

Until next time!

Just tuning in?

2017-09-15: Mitogen, an infrastructure code baseline that sucks less
2018-03-06: Quadrupling Ansible performance with Mitogen
2018-07-10: Mitogen released!