An early goal for Mitogen was to make it simple to retrofit, avoiding any "opinionated" choices likely to cause needless or impossible changes in downstream code. Despite being internally asynchronous, a blocking and mostly thread-safe API is exposed, with management of the asynchrony punted to a thread, making integrating with a deployment script hopefully as easy as with a GUI.
Although rough edges remain due to this, such as struggles with subprocess reaping, based on experience working on Mitogen for Ansible and ignoring complexities unique to that environment, the design appears to mostly function as intended.
Mostly being operative as, due to the API choice, and despite gains already witnessed in the extension, some internals remain overly simplistic. Naturally as has been the lesson throughout, this of course means inefficient: horrifyingly, crying-in-the-shower inefficient.
While recently attacking some of Ansible's grosser naivities, now the excesses of continuous forking are gone, dirty laundry is again visible on Mitogen's side. This post describes one offender: message transmission and routing, how it looks today, why it is a tragedy, and how things will improve.
Just tuning in?
Overview
To recap, communication with a bootstrapped child is message-oriented to escape the limitations of stream-oriented IO. When an application makes a call, the sending thread enqueues a message with the broker thread, which is responsible for all IO, then sleeps waiting for the broker to deliver a reply.
This has many benefits: mutually ignorant threads can share a child without coordination, since a central broker exists behind the scenes. Errors can only occur on the broker thread, so handling is not spread throughout user code.
Message Transmission
Examining just the Mitogen aspects of transmission for an SSH-connected Ansible target, below are the rough steps repeated for every message in the stable branch.
Despite removing most system calls to fit things in one diagram, there is still plenty to absorb, and clearly many parts to what is conceptually a simple task. A component called Waker
abstracts waking the broker thread. This implements a variant of the UNIX self-pipe trick, waking it by writing to a pipe it is sleeping on.
When the broker wakes, it calls waker's on_receive
handler, causing any deferred functions to execute on its thread. Here the asynchronous half of the router runs, picking a stream to forward the message.
The stream responds by asking the broker to tell it when the SSH stream becomes writeable, which is implemented differently depending on OS, but in most cases it entails yet more system calls.
Since usually the SSH input buffer is empty, the broker immediately wakes again to call the stream's on_transmit
handler, finally passing the message to SSH before marking the stream unwriteable again. At this point execution moves to SSH, for little than to read from a socket, do some crypto and write to another socket.
Better Message Transmission
In total transmission requires at least 2 task switches, 2 loop iterations, at least 5 reads/writes, and 2 poller reconfigurations.
While superficially logical, one problem is already obvious: transmitting always entails waking a thread, a nontrivial operation on UNIX. Another is the biggest performance bottleneck, the IO loop, is forced to iterate twice for every transmission, in part to cope with the possibility the SSH input buffer is full.
What if we were more optimistic: an error won't occur, and the SSH input buffer probably has space. Since we aren't expecting to cleanup a failure, there is no reason to involve the broker either. The new sequence:
Coordination is replaced with a lock, and the sending thread writes directly to SSH. We no longer check for writeability: simply try the write and if it fails, or buffered data exists, defer to the broker like before.
Now we have 1 task switch, 0 loop iterations, 2 lock operations, 3 reads/writes, and 0 poller reconfigurations, but still there is that unsightly task switch.
Even Better Message Transmission
The Ansible extension and new strategy work both offer something Ansible previously relied on SSH multiplexing to provide: a process where connection state persists during a run. As persistence is under our control, one final step becomes possible. Simply move SSH in-process:
Now we have 0 task switches, 0 loop iterations, 2 lock operations, 1 write, and 0 poller reconfigurations, or simply put, the minimum possible to support a threaded program communicating via SSH.
Some exciting possibilities emerge: passwords can be typed without allocating a PTY. Since usually Linux only supports 4,096 PTYs, this raises the scalability upper bound while reducing resource usage. Much better buffering is possible, eliminating Mitogen's own buffer, and optimally sizing SSH sockets to support file transfers.
Of course downsides exist: unlike libssh or libssh2, OpenSSH is part of a typical workflow, supports every authentication style, and it is common to stash configuration in ~/.ssh/config
. Although libssh supports SSH configuration parsing, it's unclear how well it works in practice, and at least the author of ParallelSSH (and wrappers for both libraries) appears to have chosen libssh2 over it for reasons I'd like to discover.
Routing
For completeness, and since the diagrams exist already, here is routing between two SSH children from the context of their parent on the stable branch:
While internal switching is avoided, those nasty loop iterations are visible, as are the surrounding task switches. Optimistic sending benefits routing too:
Now the loop iterates once. Finally, with an in-process SSH client:
A single thread is woken, receives the message to be forwarded, delivers it, and sleeps all on one stack.
Summary
Complexity is fractal, but shying from it just leads to mediocre software. Both improvements exist as branches, and both will be supported by the Ansible extension in addition to the new work.
Until next time!