Crowfunding Mitogen: day 46 — sweetness.hmmz.org

Crowfunding Mitogen: day 46
This is the second update on the status of developing the Mitogen extension for Ansible, only 2 weeks late!

Too long, didn’t read

Gearing up to remove the scary warning labels and release a beta! Running a little behind, but not terribly. Every major risk is solved except file transfer, which should be addressed this week.

23 days, 257 commits, 186 files changed, 7292 insertions(+), 1503 deletions(-)

Just tuning in?
- 2017-09-15: Mitogen, an infrastructure code baseline that sucks less
- 2018-03-06: Quadrupling Ansible performance with Mitogen
- 2018-03-28: Crowdfunding Mitogen: day 23
Started: Python 3 Support

A very rough branch exists for this, and I’m landing volleys of fixes when I have downtime between bigger pieces of work. Ideally this should have been ready for the end of April, but it may take a few weeks more.

I originally hoped to have a clear board before starting this, instead it is being interwoven as busywork when I need a break from whatever else I’m working on.

Done: multiplexer throughput

The situation has improved massively. Hybrid TTY/socketpair mode is a thing and as promised it significantly helps, but just not quite as much as I hoped.

Today on a 2011-era Macbook Pro Mitogen can pump an SSH client/daemon at around 13MB/sec, whereas scp in the same configuration hits closer to 19MB/sec. In the case of SSH, moving beyond this is not possible without a patched SSH installation, since SSH hard-wires its buffer sizes around 16KB, with no ability to override them at runtime.

With multiple SSH connections that 13MB should cleanly multiply up, since every connection can be served in a single IO loop iteration.

A bunch of related performance fixes were landed, including removal of yet another special case for handling deferred function calls, only taking locks when necessary, and reducing the frequency of the stream implementations modifying the status of their descriptors' readability/writeability.

As we’re in the ballpark of existing tools, I’m no longer considering this as much of a priority as before. There is definitely more low-hanging fruit, but out-of-the-box behaviour should no longer raise eyebrows.

Done: task isolation

As before, by default each script is compiled once, however it is now re-executed in a spotless namespace prior to each invocation, working around any globals/class variable sharing issues that may be present. The cost of this is negligible, on the order of 100 usec.

When this is insufficient, a mitogen_task_isolation=fork per-task variable exists to allow explicitly forcing a particular module to run in a new process. Enabling this by default causes something on the order of a 33% slowdown, which is much better than expected, but still not good enough to enable forking by default.

Aside from building up a blacklist of modules that should always be forked, task isolation is pretty much all done, with just a few performance regressions remaining to fix in the forking case.

Done: exotic module support

Every style of Ansible module is supported aside from the prehistorical “module replacer” type. That means today all of these work and are covered by automated tests:
- Built-in new-style Python scripts
- User-supplied new-style Python scripts
- Ancient key=value style input scripts
- Statically linked Go programs
- Perl scripts
Python module support was updated to remove the monkey-patching in use before. Instead, sys.stdin, sys.stdout and sys.stderr are redirected to StringIO objects, allowing a much larger variety of custom user scripts to be run in-process even when they don’t use the new-style Ansible module APIs.

Done: free strategy support

The "free" strategy can now be used by specifying ANSIBLE_STRATEGY=mitogen_free. The mitogen strategy is now an alias of mitogen_linear.

Done: temporary file handling

This should be identical to Ansible’s handling in all cases.

Done: interpreter recycling

An upper bound exists to prevent a remote machine from being spammed with thousands of Python interpreters, which was previously possible when e.g. using a with_items loop that templatized become_user.

Once 20 interpreters exist, the extension shuts down the most recently created interpreter before starting a new one. This strategy isn’t perfect, but it should suffice to avoid raised eyebrows in most common cases for the time being.

Done: precise standard IO emulation

Ansible’s complex semantics for when it does/does not merge stdout and stderr during module runs are respected in every case, including emulation of extraneous \r characters. This may seem like a tiny and pointless nit, however it is almost certainly the difference between a tested real-world playbook succeeding under the extension or breaking horribly.

Done: async tasks

We’re on the third iteration of asynchronous tasks, and I really don’t want to waste any more time on it. The new implementation works a lot more like Ansible’s existing implementaion, for as much as that implementation can be said to “work” at all.

Done: better error messages

Connection errors no longer crash with an inscrutible stack trace, but trigger Ansible’s internal error handling by raising the right exception types.

Mitogen’s logging integration with the Ansible display framework is much improved, and errors and warnings correctly show up on the console in red without having to specify -vvv.

Still more work to do on this when internal RPCs fail, but that’s less likely to be triggered than a connection error.

New debugging mode

An “emergency” debugging mode has been added, in the form of MITOGEN_DUMP_THREAD_STACKS=1. When this is present, every interpreter will dump the stack of every thread into the logging framework every 5 seconds, allowing hangs to be more easily diagnosed directly from the controller machine’s logs.

While adding this, it struck me that there is a really sweet piece of functionality missing here that would be easy to add – an interactive debugger. This might turn up in the form of an in-process web server allowing viewing the full context hierarchy, and running code snippets against remotely executing stacks, much like Werkzeug’s interactive debugger.

Performance regressions

In addition to simply not being my focus recently, a lot of the new functionality has introduced import statements that impact code running in the target, and so performance has likely slipped a little from the original posted benchmarks, most likely during run startup in the presence of a high latency network.

I will be back to investigate these problems (and fix those for which no investigation is required – the module loader!) once all remaining functionality is stable.

File Transfer

This seemingly simple function has required the greatest deal of thought out of every issues I’ve encountered so far. The initial problem relates to flow control, and the absense of any natural mechanism to block a producer (file server) while intermediary pipe buffers (i.e. the SSH connection) are filled.

Even when flow control exists, an additional problem arises since with Mitogen there is no guarantee that one SSH connection = one target machine, especially once connection delegation is implemented. Some kind of bandwidth sharing mechanism must also exist, without poorly reimplementing the entirety of TCP/IP in a Python script.

For the initial release I have settled on basic design that should ensure the available bandwidth is fully utilized, with each upload target having its file data served on a first-come-first-served basis.

When any file transfer is active, one of the service threads in the associated connection multiplexer process (the same ones used for setting up connections) will be dedicated to a long-running loop that monitors every connected stream’s transmit queue size, enqueuing additional file chunks as the queue drains.

Files are served one-at-a-time to make it more likely that if a run is interrupted, rather than having every partial file transfer thrown away, at least a few targets will have received the full file, allowing that copy to be skipped when the play is restarted.

The initial implementation will almost certainly be replaced eventually, but this basic design should be sufficient for what is needed today, and should continue to suffice when connection delegation is implemented.

Testing / CI

The smattering of unit and integration tests that exist are running and passing under Travis CI. In preparation for a release, master is considered always-healthy and my development has moved to a new dmw branch.

I’m taking a “mostly top down” approach to testing, written in the form of Ansible playbooks, as this gives the widest degree of coverage, ensuring that high level Ansible behaviour is matched with/without the extension installed. For each new test written, the result must pass under regular Ansible in addition to Ansible with the extension.

“Bottom up” type tests are written as needs arise, usually when Ansible’s user interface doesn’t sufficiently expose whatever is being tested.

Also visible in Travis is a debops_common target: this is running all 255 tasks from DebOps common.yml against a Docker instance. It’s the first in what should be 4-5 similar DebOps jobs, deploying real software with the final extension.

I have begun exploring integrating the extension with Ansible’s own integration tests, but it looks likely this is too large a job for Travis. Work here is ongoing.

Security

A few items have been chipped off the list.
- Message source verification was audited everywhere, and is covered by automated tests.
- All internal message handlers specify a policy indicating what kind of participants are allowed to deliver messages to them.
- As above, but for mitogen.service. A service cannot be exposed without attaching an access policy to it.
Notably absent is unidirectional routing mode. I will make time to finish that shortly.

User bug fixes
- Poor refactoring broke select EINTR handling
- SSH password was being supplied as the sudo password
- Acquiring a controlling TTY was fixed on FreeBSD
Summary

Super busy, slightly behind! Until next time..
- April 20, 2018
- Home