sweetness.hmmz.org

Transmit optimisations in Mitogen 0.3
An early goal for Mitogen was to make it simple to retrofit, avoiding any "opinionated" choices likely to cause needless or impossible changes in downstream code. Despite being internally asynchronous, a blocking and mostly thread-safe API is exposed, with management of the asynchrony punted to a thread, making integrating with a deployment script hopefully as easy as with a GUI.

Although rough edges remain due to this, such as struggles with subprocess reaping, based on experience working on Mitogen for Ansible and ignoring complexities unique to that environment, the design appears to mostly function as intended.

Mostly being operative as, due to the API choice, and despite gains already witnessed in the extension, some internals remain overly simplistic. Naturally as has been the lesson throughout, this of course means inefficient: horrifyingly, crying-in-the-shower inefficient.

While recently attacking some of Ansible's grosser naivities, now the excesses of continuous forking are gone, dirty laundry is again visible on Mitogen's side. This post describes one offender: message transmission and routing, how it looks today, why it is a tragedy, and how things will improve.

Just tuning in?
- 2017-09-15: Mitogen, an infrastructure code baseline that sucks less
- 2018-03-06: Quadrupling Ansible performance with Mitogen
- 2018-07-10: Mitogen released!
- 2018-08-27: A fork in the road for Mitogen
Overview

To recap, communication with a bootstrapped child is message-oriented to escape the limitations of stream-oriented IO. When an application makes a call, the sending thread enqueues a message with the broker thread, which is responsible for all IO, then sleeps waiting for the broker to deliver a reply.

This has many benefits: mutually ignorant threads can share a child without coordination, since a central broker exists behind the scenes. Errors can only occur on the broker thread, so handling is not spread throughout user code.

Message Transmission

Examining just the Mitogen aspects of transmission for an SSH-connected Ansible target, below are the rough steps repeated for every message in the stable branch.

Despite removing most system calls to fit things in one diagram, there is still plenty to absorb, and clearly many parts to what is conceptually a simple task. A component called Waker abstracts waking the broker thread. This implements a variant of the UNIX self-pipe trick, waking it by writing to a pipe it is sleeping on.

When the broker wakes, it calls waker's on_receive handler, causing any deferred functions to execute on its thread. Here the asynchronous half of the router runs, picking a stream to forward the message.

The stream responds by asking the broker to tell it when the SSH stream becomes writeable, which is implemented differently depending on OS, but in most cases it entails yet more system calls.

Since usually the SSH input buffer is empty, the broker immediately wakes again to call the stream's on_transmit handler, finally passing the message to SSH before marking the stream unwriteable again. At this point execution moves to SSH, for little than to read from a socket, do some crypto and write to another socket.

Better Message Transmission

In total transmission requires at least 2 task switches, 2 loop iterations, at least 5 reads/writes, and 2 poller reconfigurations.

While superficially logical, one problem is already obvious: transmitting always entails waking a thread, a nontrivial operation on UNIX. Another is the biggest performance bottleneck, the IO loop, is forced to iterate twice for every transmission, in part to cope with the possibility the SSH input buffer is full.

What if we were more optimistic: an error won't occur, and the SSH input buffer probably has space. Since we aren't expecting to cleanup a failure, there is no reason to involve the broker either. The new sequence:

Coordination is replaced with a lock, and the sending thread writes directly to SSH. We no longer check for writeability: simply try the write and if it fails, or buffered data exists, defer to the broker like before.

Now we have 1 task switch, 0 loop iterations, 2 lock operations, 3 reads/writes, and 0 poller reconfigurations, but still there is that unsightly task switch.

Even Better Message Transmission

The Ansible extension and new strategy work both offer something Ansible previously relied on SSH multiplexing to provide: a process where connection state persists during a run. As persistence is under our control, one final step becomes possible. Simply move SSH in-process:

Now we have 0 task switches, 0 loop iterations, 2 lock operations, 1 write, and 0 poller reconfigurations, or simply put, the minimum possible to support a threaded program communicating via SSH.

Some exciting possibilities emerge: passwords can be typed without allocating a PTY. Since usually Linux only supports 4,096 PTYs, this raises the scalability upper bound while reducing resource usage. Much better buffering is possible, eliminating Mitogen's own buffer, and optimally sizing SSH sockets to support file transfers.

Of course downsides exist: unlike libssh or libssh2, OpenSSH is part of a typical workflow, supports every authentication style, and it is common to stash configuration in ~/.ssh/config. Although libssh supports SSH configuration parsing, it's unclear how well it works in practice, and at least the author of ParallelSSH (and wrappers for both libraries) appears to have chosen libssh2 over it for reasons I'd like to discover.

Routing

For completeness, and since the diagrams exist already, here is routing between two SSH children from the context of their parent on the stable branch:

While internal switching is avoided, those nasty loop iterations are visible, as are the surrounding task switches. Optimistic sending benefits routing too:

Now the loop iterates once. Finally, with an in-process SSH client:

A single thread is woken, receives the message to be forwarded, delivers it, and sleeps all on one stack.

Summary

Complexity is fractal, but shying from it just leads to mediocre software. Both improvements exist as branches, and both will be supported by the Ansible extension in addition to the new work.

Until next time!
- August 30, 2018
A fork in the road for Mitogen
Mitogen for Ansible's original plan described facets of a scheme centered on features made possible by a rigorous single cohesive distributed program model, but of those facets, it quickly became clear that most users are really only interested in the big one: a much faster Ansible.

While I'd prefer feature work, this priority is fine: better performance usually entails enhancements that benefit the overall scheme, and improving people's lives in this manner is highly rewarding, so incentives remain aligned. It is impossible not to find renewed energy when faced with comments like this:

Enabling the mitogen plugin in ansible feels like switching from floppy to SSD
https://t.co/nCshkioX9h

Although feedback on the project has been very positive, the existing solution is sometimes not enough. Limitations in the extension and Ansible really bite, most often manifesting when running against many targets. In these scenarios, it is heartbreaking to see the work fail to help those who could benefit from it most, and that's what I'd like to talk about.

Controller-side Performance

Some time ago I began refactoring Ansible's linear strategy, aiming to get it to where controller-side enhancements might exist without adding more spaghetti, while becoming familiar with requirements for later features. To recap, the strategy plugin is responsible for almost every post-parsing task, including worker management. It is in many ways the beating heart at the core of every Ansible run.

After some months and one particularly enlightening conversation that work was resumed, eventually subsuming all of the remaining strategy support and result processing code, forming one huge refactor of a big chunk of upstream that has been gathering dust for almost a month.

The result exists today and is truly wonderful. It integrates Mitogen into the heart of Ansible without baking it in, introduces a carefully designed process model with strong persistence properties, eliminating most bottlenecks endured by the extension and vanilla Ansible, and provides an architectural basis for the next planned iteration of scalability work, Windows compatibility, some features mentioned, and quite a few that have been kept quiet.

With the new strategy it is possible to almost perfectly saturate an 8 vCPU machine given 100 targets, with minimal loss of speedup compared to single-target. Regarding single target, simple loops against localhost are up to 4x faster than the current stable extension.

While there are at least 2 obvious additional enhancements possible with this work, development reached a natural break in order to allow stablizing one piece of the puzzle at a time. Once this is done, it is clear exactly where to pick things up next.

Deep Cuts

There's just a small hitch: this work goes deep, entailing changes that, while so far would be possible as monkey-patches, are highly version-specific, and unlikely to remain monkey-patchable as the branch receives real-world usage. There must be a mechanism to ship unknown future patches to upstream code.

It was hoped it could land after Ansible 2.7, benefitting from related changes planned upstream, but they appear to have been delayed or abandoned, and so a situation exists where improvements cannot be shipped for at least another 4-6 months, assuming the related changes finally arrived in Ansible 2.8.

To the right is a rough approximation of components involved in executing a playbook. Those modified or replaced by the stable extension are green, yellow are replaced by the branch-in-waiting. Finally in orange are components affected by planned features and optimizations.

Although there are tens of thousands of lines of surrounding code, as should hopefully be clear, the number of untouched major components involved in a run has been dwindling fast. Put simply, the existing mechanism for delivering improvements is reaching its limit.

The F Word

Any seasoned developer, especially those familiar with the size of the Ansible code base, will hopefully understand the predicament. There is no problem delivering improvements today, assuming an unsupported one-off code dump was all anyone wanted, but that is never the case.

The problem lies in entering an unsustainable permanent marriage with a large project, not forgetting this outcome was an explicit non-goal from the start. Simultaneously over the months significant trust has been garnered to deliver these kinds of improvements, and abandoning one of the best yet would seem foolish.

Something of a many-variabled optimization process has recently come to an end, and I've found a solution that I am comfortable with. While making a release needs more time and may still not be definite, it seemed worth documenting at least some of the reasoning behind it before it comes.

Even though this outcome was undesirable, and although the solution in mind is not without restraint, it is still a cloud with many silver linings. For instance, new user configuration steps can be reduced to almost zero, core features can be added with minimal friction, and creative limitations are significantly uncapped.

What About The Extension?

The planned structure keeps the extension front-and-centre, so regardless of outcome it will continue to receive the majority of feature work and maintenance. It is definitely not going away.

With a third stable release looming, it's probably high time for a quick update. Many bugs were squashed since July, with stable work recently centered around problems with Ansible 2.6. This involved some changes to temporary file handling, and in the process, discovery of a huge missed optimization.

v0.2.3 will need only 2 roundtrips for each copy and template, or in terms of a 250ms transcontinental link, 10 seconds to copy 20 files vs. 30 seconds previously, or 2 minutes compared to vanilla's best configuration. This work is delayed somewhat as a new RPC chaining mechanism is added to better support all similar future changes, and identical situations likely to appear in similar tools.

Just tuning in?
- 2017-09-15: Mitogen, an infrastructure code baseline that sucks less
- 2018-03-06: Quadrupling Ansible performance with Mitogen
- 2018-03-28: Crowdfunding Mitogen: day 23
- 2018-04-20: Crowdfunding Mitogen: day 46
- 2018-05-23: Mitogen for Ansible status, 23 May
- 2018-07-10: Mitogen released!
Until next time!
- August 27, 2018

Transmit optimisations in Mitogen 0.3

A fork in the road for Mitogen