A very rough branch exists for this, and I’m landing volleys of fixes when I have downtime between bigger pieces of work. Ideally this should have been ready for the end of April, but it may take a few weeks more.
I originally hoped to have a clear board before starting this, instead it is being interwoven as busywork when I need a break from whatever else I’m working on.
Done: multiplexer throughput
The situation has improved massively. Hybrid TTY/socketpair mode is a thing and as promised it significantly helps, but just not quite as much as I hoped.
Today on a 2011-era Macbook Pro Mitogen can pump an SSH client/daemon at around 13MB/sec, whereas scp in the same configuration hits closer to 19MB/sec. In the case of SSH, moving beyond this is not possible without a patched SSH installation, since SSH hard-wires its buffer sizes around 16KB, with no ability to override them at runtime.
With multiple SSH connections that 13MB should cleanly multiply up, since every connection can be served in a single IO loop iteration.
A bunch of related performance fixes were landed, including removal of yet another special case for handling deferred function calls, only taking locks when necessary, and reducing the frequency of the stream implementations modifying the status of their descriptors' readability/writeability.
As we’re in the ballpark of existing tools, I’m no longer considering this as much of a priority as before. There is definitely more low-hanging fruit, but out-of-the-box behaviour should no longer raise eyebrows.
Done: task isolation
As before, by default each script is compiled once, however it is now re-executed in a spotless namespace prior to each invocation, working around any globals/class variable sharing issues that may be present. The cost of this is negligible, on the order of 100 usec.
When this is insufficient, a mitogen_task_isolation=fork per-task variable exists to allow explicitly forcing a particular module to run in a new process. Enabling this by default causes something on the order of a 33% slowdown, which is much better than expected, but still not good enough to enable forking by default.
Aside from building up a blacklist of modules that should always be forked, task isolation is pretty much all done, with just a few performance regressions remaining to fix in the forking case.
Done: exotic module support
Every style of Ansible module is supported aside from the prehistorical “module replacer” type. That means today all of these work and are covered by automated tests:
Built-in new-style Python scripts
User-supplied new-style Python scripts
Ancient key=value style input scripts
Statically linked Go programs
Python module support was updated to remove the monkey-patching in use before. Instead, sys.stdin, sys.stdout and sys.stderr are redirected to StringIO objects, allowing a much larger variety of custom user scripts to be run in-process even when they don’t use the new-style Ansible module APIs.
Done: free strategy support
The "free" strategy can now be used by specifying ANSIBLE_STRATEGY=mitogen_free. The mitogen strategy is now an alias of mitogen_linear.
Done: temporary file handling
This should be identical to Ansible’s handling in all cases.
Done: interpreter recycling
An upper bound exists to prevent a remote machine from being spammed with thousands of Python interpreters, which was previously possible when e.g. using a with_items loop that templatized become_user.
Once 20 interpreters exist, the extension shuts down the most recently created interpreter before starting a new one. This strategy isn’t perfect, but it should suffice to avoid raised eyebrows in most common cases for the time being.
Done: precise standard IO emulation
Ansible’s complex semantics for when it does/does not merge stdout and stderr during module runs are respected in every case, including emulation of extraneous \r characters. This may seem like a tiny and pointless nit, however it is almost certainly the difference between a tested real-world playbook succeeding under the extension or breaking horribly.
Done: async tasks
We’re on the third iteration of asynchronous tasks, and I really don’t want to waste any more time on it. The new implementation works a lot more like Ansible’s existing implementaion, for as much as that implementation can be said to “work” at all.
Done: better error messages
Connection errors no longer crash with an inscrutible stack trace, but trigger Ansible’s internal error handling by raising the right exception types.
Mitogen’s logging integration with the Ansible display framework is much improved, and errors and warnings correctly show up on the console in red without having to specify -vvv.
Still more work to do on this when internal RPCs fail, but that’s less likely to be triggered than a connection error.
New debugging mode
An “emergency” debugging mode has been added, in the form of MITOGEN_DUMP_THREAD_STACKS=1. When this is present, every interpreter will dump the stack of every thread into the logging framework every 5 seconds, allowing hangs to be more easily diagnosed directly from the controller machine’s logs.
While adding this, it struck me that there is a really sweet piece of functionality missing here that would be easy to add – an interactive debugger. This might turn up in the form of an in-process web server allowing viewing the full context hierarchy, and running code snippets against remotely executing stacks, much like Werkzeug’s interactive debugger.
In addition to simply not being my focus recently, a lot of the new functionality has introduced import statements that impact code running in the target, and so performance has likely slipped a little from the original posted benchmarks, most likely during run startup in the presence of a high latency network.
I will be back to investigate these problems (and fix those for which no investigation is required – the module loader!) once all remaining functionality is stable.
This seemingly simple function has required the greatest deal of thought out of every issues I’ve encountered so far. The initial problem relates to flow control, and the absense of any natural mechanism to block a producer (file server) while intermediary pipe buffers (i.e. the SSH connection) are filled.
Even when flow control exists, an additional problem arises since with Mitogen there is no guarantee that one SSH connection = one target machine, especially once connection delegation is implemented. Some kind of bandwidth sharing mechanism must also exist, without poorly reimplementing the entirety of TCP/IP in a Python script.
For the initial release I have settled on basic design that should ensure the available bandwidth is fully utilized, with each upload target having its file data served on a first-come-first-served basis.
When any file transfer is active, one of the service threads in the associated connection multiplexer process (the same ones used for setting up connections) will be dedicated to a long-running loop that monitors every connected stream’s transmit queue size, enqueuing additional file chunks as the queue drains.
Files are served one-at-a-time to make it more likely that if a run is interrupted, rather than having every partial file transfer thrown away, at least a few targets will have received the full file, allowing that copy to be skipped when the play is restarted.
The initial implementation will almost certainly be replaced eventually, but this basic design should be sufficient for what is needed today, and should continue to suffice when connection delegation is implemented.
Testing / CI
The smattering of unit and integration tests that exist are running and passing under Travis CI. In preparation for a release, master is considered always-healthy and my development has moved to a new dmw branch.
I’m taking a “mostly top down” approach to testing, written in the form of Ansible playbooks, as this gives the widest degree of coverage, ensuring that high level Ansible behaviour is matched with/without the extension installed. For each new test written, the result must pass under regular Ansible in addition to Ansible with the extension.
“Bottom up” type tests are written as needs arise, usually when Ansible’s user interface doesn’t sufficiently expose whatever is being tested.
Also visible in Travis is a debops_common target: this is running all 255 tasks from DebOpscommon.yml against a Docker instance. It’s the first in what should be 4-5 similar DebOps jobs, deploying real software with the final extension.
I have begun exploring integrating the extension with Ansible’s own integration tests, but it looks likely this is too large a job for Travis. Work here is ongoing.
This is the first in what I hope will be at least a bi-weekly series to keep backers up to date on the current state of delivering the Mitogen extension for Ansible. I’m trying to use every second I have wisely until every major time risk is taken care of, so please forgive the knowledge-dump style of this post :)
Well ahead of time. Some exciting new stuff popped up, none of it intractably scary.
I have some fabulous news on funding: in addition to what was already public on Kickstarter, significant additional funding has become available, enough that I should be able to dedicate full time to the project for at least another 10 weeks!
Naturally this has some fantastic implications, including making it significantly likely that I’ll be able to implement Topology-aware File Synchronization.
I could not commit to this due to worrying Python 3 would become a huge and destablizing time sink, ruining any chance of delivering more immediately useful functionality.
The missing piece (exception syntax) to support from Python 2.4 all the way to 3.x has been found - it came via an extraordinarily fruitful IRC chat with the Ansible guys, and was originally implemented in Ansible itself by Marius Gedminas. With this last piece of the puzzle, the only bugs left to worry about are renamed imports and the usual bytes/str battles. Both are trivial to address with strong tests - something already due for the coming weeks. It now seems almost guaranteed Python 3 will be completed as part of this work, although I am still holding off on a 100% commitment until more pressing concerns are addressed.
New Risk: multiplexer throughput
Some truly insane performance bugs have been found and fixed already, particularly around the stress caused by delivering huge single messages, however during that work a new issue was found: IO multiplexer throughput truly sucks for many small messages.
This doesn’t impact things much except in one area: file transfer. While I haven’t implemented a final solution for file transfer yet, as part of that I will need to address what (for now) seems a hard single-thread performance limit: Mitogen’s current IO loop cannot push more than ~300MiB/sec in 128KiB-sized chunks, or to put it another way, best case 3MiB/sec given 100 targets.
Single thread performance: the obvious solution is sharding the multiplexer across multiple processes, and already that was likely required for completing the multithreaded connect work. This is a straightforward change that promises to comfortably saturate a Gigabit Ethernet port using a 2011 era Macbook while leaving plenty of room for components further up (Ansible) and down (ssh) the stack.
TTY layer: I’ve already implemented some fixes for this (increase buffer sizes, reduce loop iterations), but found some ugly new problems as a result: the TTY layer in every major UNIX has, at best, around a 4KiB buffer, forcing many syscalls and loop iterations, and it seems on no OS is this buffer tunable. Fear not, there is already a kick-ass solution for this too.
This problem should disappear entirely by the time real file transfer support is implemented - today the extension is still delivering files as a single large message. The blocker to fixing that is a missing flow control mechanism to prevent saturation of the message queue, which requires a little research. This hopefully isn’t going to be a huge amount of work, and I’ve already got a bunch of no-brainer yet hacky ways to fix it.
New risk: task isolation
It was only a matter of time, but the first isolation-related bug was found, due to a class variable in a built-in Ansible module that persists some state across invocations of the module’s main() function. I’d been expecting something of this sort, so already had ideas for solving it when it came up, and really it was quite a surprise that only one such bug was reported out of all those reports from initial testers.
The obvious solution is forking a child for each task by default, however as always the devil is in the details, and in many intractable ways forking actually introduces state sharing problems far deadlier than those it promises to solve, in addition to introducing a huge (3ms on Xeon) penalty that is needless in most cases. Basically forking is absolute hell to get right - even for a tiny 2 kLOC library written almost entirely by one author who wrote his first fork() call somewhere in the region of 20 years ago, and I’m certain this is liable to become a support nightmare.
The most valuable de facto protection afforded by fork - memory safety, is pretty redundant in an almost perfectly memory safe language like Python, that’s why the language is so popular at all.
Meanwhile forking is needed anyway for robust implementation of asynchronous tasks, so while implementing it would never have been wasted work, it is not obvious to me that forking could or should ever become the default mode. It amounts to a very ripe field for impossible to spot bugs of much harder classes than the simple solution of running everything in a single process, where we only need to care about version conflicts, crap monkey patches, needlessly global variables and memory/resource leaks.
I’m still exploring the solution space for this one, current thinking is maybe (maybe! this is totally greenfield) something like:
Built-in list of fixups for ridiculously easy to repair bugs, like the yum_repository example above.
Whitelist for in-process execution any module known (and manually audited) to be perfectly safe. Common with_items modules like lineinfile easily fit in this class.
Whitelist for in-process safe but nonetheless leaky modules, such as the buggy yum_repository module above that simply needs its bytecode re-executed (100usec) to paper over the bug. Can’t decide whether to keep this mode or not - or simply merge it with the above mode.
Default to forking (3ms - max 333 with_items/sec) for all unknown bespoke (user) modules and built-in modules of dubious quality, with a mitogen_task_isolation variable permitting the mode to be overridden by the user on a per-task basis. “Oh that one loop is eating 45 minutes? Try it with mitogen_task_isolation=none”
All the Mitogen-side forking bits are implemented already, and I’m deferring the Ansible-side bits to be done simultaneous to supporting exotic module types, since that whole chunk of code needs a rewrite and no point in rewriting it twice.
Meanwhile whatever the outcome of this work, be assured you will always have your cake and eat it - this project is all about fixing performance, not regressing it. I hope this entire topic becomes a tiny implementation detail in the coming weeks.
On the testing front I was absolutely overjoyed to discover DebOps by way of a Mitogen bug report. This deserves a whole article on its own, meanwhile it represents what is likely to be a huge piece of the testing puzzle.
A big chunk is already implemented in order to fix an unrelated bug! The default pool size has 16 threads in one process, so there will only be a minor performance penalty for the first task to run when the number of targets exceeds 16. Meanwhile, the queue size is adjustable via an environment variable. I’ll tidy this up later.
Even though it basically already exists, I’m not yet focused on making multithreaded connect work - including analysing the various performance weirdness that appears when running Mitogen against multiple targets. These definitely exist, I just haven’t made time yet to determine whether it’s an Ansible-side scaling issue or a Mitogen-side issue. Stay tuned and don’t worry! Multi-target runs are already zippy, and I’m certain any issues found can be addressed.
At least a full day will be dedicated to nothing but coming up with new attack scenarios, meanwhile I’m feeling pretty good about security already. The fabulous Alex Willmer has been busily inventing new cPickle attack scenarios, and some of them are absolutely fantastically scary! He’s sitting on at least one exciting new attack that represents a no-brainer decider on the viability of keeping cPickle or replacing it.
Serialization aside, I’ve been busy comparing Ansible’s existing security model to what the extension provides today, and have at least identified unidirectional routing mode as a must-have for delivering the extension. Regarding that, it is possible to have a single playbook safely target 2 otherwise completely partitioned networks. Today with Mitogen, one network could route messages towards workers in the other network using the controller as a bridge. While this should be harmless (given existing security mitigations), it still introduces a scary capability for an attacker that shouldn’t exist.
Really screwed up on planning here - turns out Ansible on Windows does not use Python whatsoever, and so implementing the support in Mitogen would mean increasing the installation requirements for Windows targets. That’s stupid, it violates Ansible’s zero-install design and was explicitly a non-goal from the get go.
Meanwhile WinRM has extremely poor options for bidirectional IO, and likely viable Mitogen support for Windows will include introducing a, say, SSL-encrypted reversion connection from the target machine in order to get efficient IO.
I will shortly be polling everyone who has pledged towards the project, and if nobody speaks up to save Windows, it’s being pushed to the back of the queue.
A big, big thanks, once again!
It goes without saying but none of this work has been a lone effort, starting from planning, article review, funding, testing, and an endless series of suggestions, questions and recommendations coming from so many people. Thanks to everyone, whether you contributed a single $1 or a single typo bug report.
Super busy, but also super on target! Until next time..
It’s been an incredibly intense first week crowdfunding the Mitogen extension for Ansible involving far more effort than anticipated, where I have worked almost flat out from waking until the early hours just to ensure any queries are answered thoroughly. I cannot complain, because it has been so much fun that I’d change almost nothing of the experience, and already the campaign has reached 46% from the exposure it received.
As a recap Mitogen is a library for writing distributed programs that require zero deployment, with the prototype extension implementing an architectural change that vastly improves Ansible’s performance in common scenarios, laying a framework to extend this advantage far beyond simple overhead reduction.
A great deal of work has simply been staying on top of bug reports and ensuring experiences with the prototype are solid – for each report from one tester, we can assume 10 more hit the same bug but did not or could not report it.
Of the many reports received, I have addressed almost all of them promptly. Some fabulous bugs have been found and fixed along with one report via Reddit of a performance improvement so fantastical that it exceeds even my most contrived overhead-heavy example:
"With mitogen my playbook runtime went from 45 minutes to just under 3 minutes. Awesome work!"
This is a common theme – anywhere with_items appears, Mitogen has the most profound impact. The obvious reason is that during loops the same module is executed repeatedly, and after one iteration is guaranteed to be compiled and ready on the target.
So many lessons!
Developing the campaign from a thought exercise one idle Sunday evening into an actually practical project has taken a lot of work – far more than I anticipated, and at almost every step I have learned something novel. This is all reuseable knowledge for anyone attempting a similar project in future, and I will write it up as time permits.
Regardless of outcomes the campaign has already proven one very exciting result: real users will stake real money towards something as seemingly mundane as free infrastructure, and I think that’s beyond amazing. In a world content to throw millions of dollars at junk ICOs almost weekly, crowdfunding free software seems to me a practice that should happen far more often.
I wish to thank everyone for the support shown thus far, and I’d encourage you to consider tapping that Ansible user you know on the shoulder to let them know about the project. For those working close to infrastructure consulting, please consider using the final week to corner your boss regarding associating your company logo with a sexy project that promises to receive many eyeballs over the coming years.
Allegedly on site as a developer, two summers ago I found myself in a situation you are no doubt familiar with, where despite preferences unrelated problems inevitably gravitate towards whoever can deal with them. Following an exhausting day spent watching a dog-slow Ansible job fail repeatedly, one evening I dusted off a personal aid to help me relax: an ancient, perpetually unfinished hobby project whose sole function until then had simply been to remind me things can always improve.
Something of a miracle had struck by the early hours of next morning, as almost every outstanding issue had been solved, and to my disbelief the code ran reliably. 18 months later and for the first time in living memory, I am excited to report delivery of that project, one of sufficient complexity as to have warranted extreme persistence - in this case from concept to implementation, over more than a decade.
The miracle? It comes in the form of Mitogen - a tiny Python library you won’t have heard of, but I hope as an Ansible user you will soon eternally be glad for, on discovering ansible-playbook now completes in very reasonable time even in the face of deeply unreasonable operating conditions.
Mitogen is a library for writing distributed programs that require zero deployment, specifically designed to fit the needs of infrastructure software like Ansible. Without upfront configuration it supports any UNIX machine featuring an installed Python interpreter, which is to say almost all of them. While the concept is hard to explain - even to fellow engineers, its value is easy to grasp:
This trace shows two Ansible runs of a basic 100-step playbook over a 1 ms latency network against a single target host. The first run employs SSH pipelining, Ansible’s current most optimal configuration, where it consumes almost 4.5 Mbytes network bandwidth in a running time of 59 secs.
The second uses the prototype Mitogen extension for Ansible, with a far more reasonable 90 Kbytes consumed in 8.1 secs. An unmodified playbook executes over 7 times faster while consuming 50x less bandwidth.
Less than half the CPU time was consumed on the host machine, meaning that by one metric it should handle at least twice as many targets. Crucially no changes were required to the target machine, including new software or nasty on-disk caches to contend with.
While only pure overhead is measured above, the benefits very much extend to real-world scenarios. See the documentation (1.75x time) and issue #85 (4.2x time, 3.1x CPU) for examples.
How is this possible?
Mitogen is perhaps most easily described as a kind of network-capable fork() on steroids. It allows programs to establish lazily-loaded duplicates on remote hosts, without requiring any upfront remote disk writes, and to communicate with those copies once they exist. The copies can in turn recursively split to produce further children - with bidirectional message routing between every copy handled automatically.
In the context of Ansible, unlike with SSH pipelining where up to one SSH invocation, sudo invocation and script compilation are required for every playbook step, and with all scripts re-uploaded for each step, with Mitogen only one of each exists per target for the duration of the playbook run, with all code cached in RAM between steps. Absolutely everything is reused, saving 300-800 ms on every step.
The extension represents around a week’s work, replaces hundreds of lines of horrid shell-related code in Ansible, and is already at the point where on one real-world playbook, Ansible is only 2% slower than equivalent SSH commands. Presently connection establishment is single-threaded, so the prototype is only good for a few hosts, but rest assured this limitation’s days are numbered.
Not just a speed up, a paradigm shift you’ll adore
If this seems impressive and couldn’t be improved upon, prepare for some deep shocks. You can think of the extension not just as a performance improvement, but something of a surreptitious beachhead from which I intend to thoroughly assault your sense of reality.
This performance is a side effect of a far more interesting property: Ansible is no longer running on just the host machine, but temporarily distributed throughout the target network for the duration of the run, with bidirectional communication between all pieces, and you won’t believe the crazy functionality this enables.
What if I told you it were possible not only to eliminate that final 2%, but turn it sharply negative, while simultaneously reducing resource consumption? “Surely Ansible can’t execute faster than equivalent raw SSH commands?” You bet it can! And if you care about such things, this could be yours by Autumn. Read on..
Pushing brains into the ether, no evil agents required
As I teased last year, Ansible takes its name from a faster-than-light communication device from science fiction, yet despite these improvements it is still fundamentally bound by the speed with which information physically propagates. Pull and agent-based tooling is strongly advantageous here: control flow occurs at the same point as the measurements necessary to inform that flow, and no penalty is incurred for traversing the network.
Today, reducing latency in Ansible means running it within the target network, or in pull mode, where the playbook is stored on the target alongside for example, secrets for decrypting any vaults, and the hairy mechanics required to keep that in sync and executing when appropriate. This is a far cry from the simplicity of tapping ansible-playbook live.yml on your laptop, and so it is an option of last resort.
What would be amazing is some hybrid where we could have the performance and scaleability benefits of pull, combined with the stateless simplicity of push, without introducing dedicated hosts or permanent caches and agents running on the target machines, that amount to persistent intermediate state and introduce huge headaches of their own, all without sacrificing the fabulous ability to shut everything down with a simple CTRL+C.
The opening volley: connection delegation
As a first step to exploiting previously impossible functionality, I will enhance the extension to support delegating connection establishment to a machine on the target network, avoiding the cost of establishing hundreds of SSH connections over a low throughput, high latency network link.
Unlike with SSH proxying, this has the huge benefit of caching and serving Ansible code from RAM on the intermediary, avoiding uploading approximatey 50KiB of code for every playbook step, and ensuring those cached responses are delivered over the low latency LAN fabric on the target network. For 100 target machines, this replaces the transmission of 5 Mbytes of data for every playbook step with on the order of kilobytes worth of tiny remote procedure calls.
All the Mitogen-side infrastructure for this exists today, and is already used to implement become support. It could be flipped on with a few lines of code in the Ansible extension, but there are a few more importer bugs to fix before it’ll work perfectly.
Finally as a reminder, since Mitogen operates recursively delegation also operates recursively, with code caching and connection establishment happening at each hop. Not only is this useful for navigating slow links and complicated firewall setups, as we’ll see, it enables some exciting new scenarios.
Ansible is intended to manage many machines simultaneously, and while the extension’s improvements presently work well for single-machine playbooks, that is all but a niche application for many users.
Having the newfound ability to delegate connection establishment to an intermediary on the target network, far away from our laptop’s high latency 3G connection, and with the ability to further sub-delegate from that intermediary, we can implement a divide and conquer strategy, forming a large tree comprising the final network of target machines for the playbook run, with responsibility for caching and connection multiplexing evenly divided across the tree, neatly avoiding single resource bottlenecks.
I will rewrite Mitogen’s connection establishment to be asynchronous: creation of many downstream connections can be scheduled in parallel, with the ability to enqueue commands prior to completion, including recursive commands that would cause those connections to in turn be used as intermediaries.
The cost of establishing connections should become only the cost of code upload (~50KiB) and the latency of a single SSH connection per tree layer, as connections at each layer occur in parallel. For an imaginary 1,700 node cluster split into quarters of 17 racks and 25 nodes per rack, connection via a 300 ms 3G network should complete in well under 15 seconds.
Topology-aware file synchronization
So you have a playbook on your laptop deploying a Django application via the synchronize module, to 100 Ubuntu machines running in a datacentre 300 ms away. Each run of the playbook entails a groan followed by a long walk, as a 3.8 second rsync run is invoked 100 times via your 3G connection, just to synchronize a 3 Mbyte asset the design team won’t stop tweaking. Not only are there 6 minutes of roundtrips buried in those invocations, but that puny 3G connection is forced to send a total of 300 Mbytes toward the target network.
What is the point of continually re-sending that file to the same set of machines in some far-off network? What if it could be uploaded exactly once, then automatically cached and redistributed within the target network, producing exactly one upload per layer in the hierarchy:
Why stop at delegating connection establishment and module caching? Now we have a partial copy of Ansible within the network, nothing prevents implementing all kinds of smarts. Here is another feature that is a cinch to build once bidirectional communication exists between topology-aware code, which the prototype extension already provides today.
After a brutal 4 hour meeting involving 10 executives our hero Bob, Senior Disaster Architect III, emerges bloodstained yet victorious against the tyrannical security team, as his backends can talk with impunity to the entire Internet just so apt-get can reach packages.debian.org for the 15 seconds Bob’s daily Ansible CI job requires.
That evening, having regaled his giddy betrothed (HR Coordinator II) with heroic story of war, Bob catches a brief yet chilling glimmer of doubt for all that transpired. “Was there another way?” he sleepily ponders, before succumbing to a cosier battle waged by those fatigued and heavy eyelids. Suddenly aware again, Bob emerges bathed in a mysterious utopian dreamscape where CI jobs executed infinitely quickly, war and poverty did not exist, and the impossible had always been possible.
Building on Mitogen’s message routing, forwarding all kinds of pipes and network sockets becomes trivial, including schemes that would allow exposing a transient, locked down HTTP proxy to Bob’s apt-get invocation only for as long as necessary, all with a few lines of YAML in a playbook.
While this is already possible with SSH forwarding, the hand-configuration involved is messy, and becomes extremely hairy when the target of the forward is not the host machine. My initial goal is to support forwarding of UNIX and TCP sockets, as they cover all use cases I have in mind. Speaking of which..
Topology-aware Git pull
Another common security fail seen in Ansible playbooks is to call Git directly from target machines, including granting those machines access to a Git server. This is a horrid violation: even read-only access implies the machine needs permanent firewall rules that shouldn’t exist, just for the scant moments a pull is in progress. Granting backends access to a site as complex as GitHub.com, you may as well abandon all outbound firewalling, as this is enough for even the puniest script kiddy to exfiltrate a production database.
What if Git could run with the permissions of the local Ansible user, on the user’s own machine, and be served efficiently to the target machines only for the duration of the push, faster than 100 machines talking to GitHub.com, and only to the single read-only repository intended?
Building on generalized forwarding, topology-aware Git repeats all the caching and single-upload tricks of file synchronization, but this time implementing the Git protocol between each node.
In the scheme I will implement, a single round-trip is necessary for git-fetch-pack to pull just the changed objects from the laptop over the high latency 3G link, before propagating at LAN speeds throughout the target network, with git-ls-remote output delivered as part of the message that initiates the pull. Not only is the result more efficient than a normal git-pull, but backends no longer require network access to Git.
The final word: Inversion of control
Remember we talked about making Ansible run faster than equivalent SSH commands? Well, today Ansible requires one network round-trip per playbook step, so just like SSH, it must pay the penalty for every round-trip unless something gives, and that something is the partial delegation of control to the target machine itself.
With inversion of control, the role of ansible-playbook simply becomes that of shipping code and selective chunks of data to target machines, where those machines can execute and make control decisions without necessitating a conversation with the master after each step, just to figure out what to execute next.
Ansible has all the framework to enable implementing this today, by significantly extending the prototype extension’s existing strategy plug-in, and teaching it how to automatically send and wait on batches of tasks, rather than on single tasks at a time.
Aside from improved performance, the semantics of the existing linear strategy will be preserved, and playbooks need not be changed to cope: on the target machine tasks will not suddenly begin running concurrently, or in any order different to previously.
App-level connection persistence
As a final battle against latency during playbook development and debugging, I will support detaching the connection tree from ansible-playbook on exit, and teach the extension to reuse it at startup. This will reduce the overhead of repeat runs, especially against many targets, to the order of hundreds of milliseconds, as no new SSH connections, module compilations or code uploads are required.
Connection persistence opens the floodgates for adding sweet new tooling, although I’m not sure how desirable it is to expose an implementation detail like this forever, while also extending the interface provided by Ansible itself. As a simple example, we could provide an ansible-ssh tool that reuses the connection tree along with Ansible’s tunnelling, delegation, dynamic inventory and authentication configuration to forward a pipe to a remote shell.
The cost of slow tooling
Ansible has over 28,500 stars on GitHub, representing just those users who have a GitHub account and ever thought to star it, and appears to grow by 150 stars per week. Around London the going rate to hire one user is $100/hour, and conservatively, we could expect that user is trotting out a 15 minute run of ansible-playbook
live.yml at least once per week.
We can expect that if Ansible is running merely twice as slowly as necessary, 7.5 minutes of that run is lost productivity, and across those 28,500 users, the economic cost is in the region of $356,250 per invocation or $17,100,000 per year. In reality the average user is running Ansible far more often, including thousands of times per minute under various CI systems worldwide, and those runs often last far longer than 15 minutes, but I’d recommend that mental guesstimation is left as an exercise to readers who are already blind drunk.
The future is beautiful if you want it to be
My name is David, and nothing jinxes my day quite like slow tooling. I have poured easily 500 hours in some form into this project over a decade and on my own time. The project has now reached an inflection point where the fun part is over, the science is done and the effect is real, and only a small, highly predictable set of milestones remain to deliver what I hope you agree is a much brighter future.
Before reading I doubt you would have believed it possible to provide the features described without a complex infrastructure running in the target network, now I hope you’ll join me in disproving one final impossibility.
While everything here will exist in time, it cannot exist in 2018 without your support, and that’s why I’d like to try something crazy, that would allow me to devote myself to delivering a vastly improved daily routine for thousands of people just like you and me.
You may have guessed already: I want you to crowdfund awesome tooling.
What value would you place on an extra productive hour every working week? In the UK that’s an easy question: it’s around $4,800 per year. And what risk is there to contributing $100 to an already proven component? I hope you’ll agree this too is a no-brainer, both for you and your employer.
To encourage success I’m offering a unique permanent placement of your brand on the GitHub repository and documentation. Funds will be returned if the minimum goal cannot be reached, however just 3 weeks are sufficient to ensure a well tested extension, with my full attention given to every bug, ready to save many hours right on time to enjoy the early sunlight of Spring.
Totalling much less than the economic damage caused by a single run of today’s Ansible, the grand plan is divided into incrementally related stretch goals. I cannot imagine this will achieve full funding, but if it does, as a finale I’ll deliver a feature built on Ansible that you never dreamed possible.
As a modern area deployment tooling is exposed to the ebb and flow of the software industry far more than typical, and unexpected disruption happens continuously. Without ongoing evolution, exposure to buggy and unfamiliar new tooling is all but guaranteed, with benefits barely justifying the cost of their integration. As we know all too well, rational ideas like cost/benefit rarely win the hearts of buzzword-hungry and youthful infrastructure teams, so counterarguments must be presented another way.
As a recent example there is growing love for mgmt, which is designed from the outset as an agent-based reactive distributed system, much as Mitogen nudges Ansible towards. However unlike mgmt, Ansible preserves its zero-install and agentless nature, while laying a sound framework for significantly more exciting features. If that alone does not win loyalty, we’re at least guaranteed that every migration-triggering new feature implemented in such systems can be headed off with minimal effort, long into the foreseeable future.
After a long winter break from recreational programming, over the past days I finally built up steam and broke a chunk of new ground on Mitogen, this time growing its puny module forwarder into a bona fide beast, ready to handle almost any network condition and user code thrown at it.
Mitogen is a library for executing parts of a Python program in a remote context, primarily over sudo and SSH connections, and establishing bidirectional communication with those parts. Targeting infrastructure applications, it requires no upfront configuration of target machines, aside from an SSH daemon and Python 2.x interpreter, which is the default for almost every Linux machine found on any conceivable network.
The target need not possess a writeable filesystem, code is loaded dynamically on demand, and execution occurs entirely from RAM.
How Import Works
To implement dynamic loading, child Python processes (“contexts”) have a PEP-302 import hook installed that causes attempts to import modules unavailable locally to automatically be served over the network connection to the parent process. For example, in a script like:
If the requests package is missing on the host k3, it will automatically be copied and imported in RAM, without requiring upfront configuration, or causing or requiring writes to the remote filesystem.
So far, so good. Just one hitch
While the loader has served well over the library’s prototypical life (which in real time, is approaching 12 years!), it has always placed severe limits on the structure of the loaded code, as each additional source file introduced one network round-trip to serve it.
Given a relatively small dependency such as Kenneth Reitz' popular Requests package, comprising 17 submodules, this means 17 additional network round-trips. While that may not mean much over a typical local area network segment where roundtrips are measured in microseconds, it quickly multiplies over even modest wide-area networks, where infrastructure tooling is commonly deployed.
For a library like Requests, 17 round-trips amounts to 340ms latency over a reasonably local 20ms link, which is comfortably within the realms of acceptable, however over common radio and international links of 200ms or more, already this adds at least 3.4 seconds to the startup cost of any Mitogen program, time wasted doing nothing but waiting on the network.
Sadly, Requests is hardly even the biggest dependency Mitogen can expect to encounter. For testing I chose django.db.models as a representative baseline: heavily integrated with all of Django, it transitively imports over 160 modules across numerous subpackages. That means on an international link, over 30 seconds of startup latency spent on one dependency.
It is worth note that Django is not something I’d expect to see in a typical Mitogen program, it’s simply an extraordinarily worst-case target worth hitting. If Mitogen can handle django.db.models, it should cope with pretty much anything.
Combining evils, over an admittedly better-than-average Nepali mobile data network, and an international link to my IRC box mail server in Paris, django.db.models takes almost 60 seconds to load with the old design.
In the real world, this one-file-per-roundtrip characteristic means the current approach sucks almost as much as Ansible does, which calls into doubt my goal of implementing an Ansible-trumping Ansible connection plug-in. Clearly something must give!
Over the years I discarded many approaches for handling this latency nightmare:
Having the user explicitly configure a module list to deliver upfront to new contexts, which sucks and is plainly unmaintainable.
Installing a PEP-302 hook in the master in order to observe the import graph, which would be technically exciting, but likely to suck horribly due to fragility and inevitable interference with real PEP-302 hooks, such as py2exe.
Observing the import graph caused by a function call in a single context, then using it to preload modules in additional contexts. This seems workable, except the benefit would only be felt by multiple-child Mitogen programs. Single child programs would continue to pay the latency tax.
Variants of 2 and 3, except caching the result as intermediate state in the master’s filesystem. Ignoring the fact persistent intermediate state is always evil (a topic for later!), that would require weird and imperfect invalidation rules, which means performance would suck during development and prototyping, and bugs are possible where state gets silently wedged and previously working programs inexplicably slow down.
Finally last year I settled on using static analysis, and restricting preloading at package boundaries. When a dependency is detected in a package external to the one being requested, it is not preloaded until the child has demonstrated, by requesting the top-level package module from its parent, that the child lacks all of the submodules contained by it.
This seems like a good rule: preloading can occur aggressively within a package, but must otherwise wait for a child to signal a package as missing before preemptively wasting time and bandwidth delivering code the child never needed.
As a final safeguard, preloading is restricted to only modules the master itself loaded. It is not sufficient for an import statement to exist: surrounding conditional logic must have caused the module to be loaded by the master. In this manner the semantics of platform, version-specific and lazy imports are roughly preserved.
Syntax tree hell
Quite predictably, after attempting to approach the problem with regexes, I threw my hands up on realizing a single regex may not handle every possible import statement:
import a as b
from a import b
from a import b as c
from a import (b, c, d)
I gleefully thought I’d finally found a use for the compiler and ast modules, and these were the obvious alternative to avoiding the rats nest of multiple regexes. Not quite. You see, across Python releases the grammar has changed, and in lock-step so have the representations exported by the compiler and ast modules.
Adding insult to injury: neither module is supported through every interesting Python version. I have seen Python 2.4 deployed commercially as recently as summer 2016, and therefore consider it mandatory for the kind of library I want on my toolbelt. To support antique and chic Python alike, it was necessary to implement both approaches and select one at runtime. Many might see this is an opportunity to drop 2.4, but “just upgrade lol” is never a good answer while maintaining long shelf-life systems, and should never be a a barrier to applying a trusted Swiss Army Knife.
After some busy days last September, I had a working scanner built around syntax trees, except for a tiny problem: it was ridiculously slow. Parsing the 8KiB mitogen.core module took 12ms on my laptop, which multiplied up is over a second of CPU burnt scanning dependencies for a package like Django. If memory serves, reality was closer to 3 seconds: far exceeding the latency saved while talking to a machine on a LAN.
Sometimes hacking bytecode make perfect sense
I couldn’t stop groaning the day I abandoned ASTs. As is often true when following software industry best practice, we are left holding a decomposing trout that, while technically fulfilling its role, stinks horribly, costs all involved a fortune to support and causes pains worse than those it was intended to relieve. Still hoping to avoid regexes, I went digging for precedent elsewhere in tools dealing with the same problem.
That’s when I discovered the strange and unloved modulefinder buried in the standard library, a forgotten relic from a bygone era, seductively deposited there as a belated Christmas gift to all, on a gloomy New Year’s Eve 2002 by Guido’s own brother. Diving in, I was shocked and mesmerized to find dependencies synthesized by recompiling each module and extracting IMPORT_FROM opcodes from the compiled bytecode. Reimplementing a variant, I was overjoyed to discover django.db.models transitive dependencies enumerated in under 350ms on my laptop. A workable solution!
The solution has some further crazy results: IMPORT_FROM has barely changed since the Python 2.4 days, right through to Python 3.x. The same approach works everywhere, including PyPy, which uses the same format, which makes this more portable than the ast and compiler modules!
Coping with concurrency
Now a mechanism exists to enumerate dependencies, we need a mode of delivery. The approach used is simplistic, and (as seen later), will likely require future improvement.
On receiving a GET_MODULE message from a child, a parent (don’t forget, Mitogen operates recursively!) first tries to satisfy the request from its own cache, before forwarding it upwards towards the master. The master sends LOAD_MODULE messages for all dependencies known to be missing from the child before sending a final message containing the module that was actually requested. Since contexts always cache unsolicited LOAD_MODULE messages from upstream, by the time the message arrives for the requested module, many dependencies should be in RAM and no further network roundtrips requesting them are required.
Meanwhile for each stream connected to any parent, a set of module names ever delivered on that stream are recorded. Each parent is allowed to ignore any GET_MODULE for which a corresponding LOAD_MODULE has already been sent, preventing a race between in-flight requests causing the same module to ever be sent twice.
This places the onus on downstream contexts to ensure the single LOAD_MODULE message received for each distinct module always reaches every interested party. In short, GET_MODULE messages must be deduplicated and synchronized not only for any arriving from a context’s children, but also from its own threads.
And finally the result. For my test script, the total number of roundtrips dropped from 166 to 13, one of which is for the script itself, and 3 negative requests for extension modules that cannot be transferred. That leaves, bugs aside, 9 roundtrips to transfer the most obscene dependency I could think of.
One more look at the library’s network profile. Over the same connection as previously, the situation has improved immensely:
Not only is performance up, but the number of frames transmitted has dropped by 42%. That’s a 42% fewer changes of connection hang due to crappy WiFi!
One final detail is visible: around the 10 second mark, a tall column of frames is sent with progressively increasing size, almost in the same instant. This is not some bug, it is Path MTU Discovery (PMTUD) in action. PMTUD is a mechanism by which IP subprotocols can learn the maximum frame size tolerated by the path between communicating peers, which in turn maximizes link efficiency by minimizing bandwidth wasted on headers. The size is ramped up until either loss occurs or an intermediary signals error via ICMP.
Just like the network path, PMTUD is dynamic and must restart on any signal indicating network conditions have changed. Comparing this graph with the previous, we see one final improvement as a result of providing the network layer enough data to do its job: PMTUD appears restart much less frequently, and the stream is pegged at the true path MTU for much longer.
Aside from simple fixes to reduce wasted roundtrips for extension modules that can’t be imported, and optional imports of top-level packages that don’t exist on the master, there are two major niggles remaining in how import works today.
The first is an irritating source of latency present in deep trees: currently it is impossible for intermediary nodes satisfying GET_MODULE requests for children to streamily send preloaded modules towards a child until the final LOAD_MODULE arrives at the intermediary for the module actually requested by the child. That means preloading is artificially serialized at each layer in the tree, when a better design would allow it to progress concurrent to the LOAD_MODULE messages still in-flight from the master.
This will present itself when doing multi-machine hops where links between the machines are slow or suffer high latency. It will also be important to fix before handling hundreds to thousands of children, such as should become practical once asynchronous connect() is implemented.
There are various approaches to tweaking the design so that concurrency is restored, but I would like to let the paint dry a little on the new implementation before destablizing it again.
The second major issue is almost certainly a bug waiting to be discovered, but I’m out of energy to attack it right now. It relates to complex situations where many children have different functions invoked in them, from a complex set of overlapping packages. In such cases, it is possible that a LOAD_MODULE for an unrelated GET_MODULE prematurely delivers the final module from another import, before it has had all requisite modules preloaded into the child.
To fix that, the library must ensure the tree of dependencies for all module requests are sent downstream depth-first, i.e. it is never possible for any module to appear in a LOAD_MODULE before all of its dependencies have first.
Finally there are latency sources buried elsewhere in the library, including at least 2 needless roundtrips during connection setup. Fighting latency is an endless war, but with module loading working efficiently, the most important battle is over.