Allegedly on site as a developer, two summers ago I found myself in a situation you are no doubt familiar with, where despite preferences unrelated problems inevitably gravitate towards whoever can deal with them. Following an exhausting day spent watching a dog-slow Ansible job fail repeatedly, one evening I dusted off a personal aid to help me relax: an ancient, perpetually unfinished hobby project whose sole function until then had simply been to remind me things can always improve.
Something of a miracle had struck by the early hours of next morning, as almost every outstanding issue had been solved, and to my disbelief the code ran reliably. 18 months later and for the first time in living memory, I am excited to report delivery of that project, one of sufficient complexity as to have warranted extreme persistence - in this case from concept to implementation, over more than a decade.
The miracle? It comes in the form of Mitogen - a tiny Python library you won’t have heard of, but I hope as an Ansible user you will soon eternally be glad for, on discovering ansible-playbook now completes in very reasonable time even in the face of deeply unreasonable operating conditions.
Mitogen is a library for writing distributed programs that require zero deployment, specifically designed to fit the needs of infrastructure software like Ansible. Without upfront configuration it supports any UNIX machine featuring an installed Python interpreter, which is to say almost all of them. While the concept is hard to explain - even to fellow engineers, its value is easy to grasp:
This trace shows two Ansible runs of a basic 100-step playbook over a 1 ms latency network against a single target host. The first run employs SSH pipelining, Ansible’s current most optimal configuration, where it consumes almost 4.5 Mbytes network bandwidth in a running time of 59 secs.
The second uses the prototype Mitogen extension for Ansible, with a far more reasonable 90 Kbytes consumed in 8.1 secs. An unmodified playbook executes over 7 times faster while consuming 50x less bandwidth.
Less than half the CPU time was consumed on the host machine, meaning that by one metric it should handle at least twice as many targets. Crucially no changes were required to the target machine, including new software or nasty on-disk caches to contend with.
While only pure overhead is measured above, the benefits very much extend to real-world scenarios. See the documentation (1.75x time) and issue #85 (4.2x time, 3.1x CPU) for examples.
How is this possible?
Mitogen is perhaps most easily described as a kind of network-capable fork() on steroids. It allows programs to establish lazily-loaded duplicates on remote hosts, without requiring any upfront remote disk writes, and to communicate with those copies once they exist. The copies can in turn recursively split to produce further children - with bidirectional message routing between every copy handled automatically.
In the context of Ansible, unlike with SSH pipelining where up to one SSH invocation, sudo invocation and script compilation are required for every playbook step, and with all scripts re-uploaded for each step, with Mitogen only one of each exists per target for the duration of the playbook run, with all code cached in RAM between steps. Absolutely everything is reused, saving 300-800 ms on every step.
The extension represents around a week’s work, replaces hundreds of lines of horrid shell-related code in Ansible, and is already at the point where on one real-world playbook, Ansible is only 2% slower than equivalent SSH commands. Presently connection establishment is single-threaded, so the prototype is only good for a few hosts, but rest assured this limitation’s days are numbered.
Not just a speed up, a paradigm shift you’ll adore
If this seems impressive and couldn’t be improved upon, prepare for some deep shocks. You can think of the extension not just as a performance improvement, but something of a surreptitious beachhead from which I intend to thoroughly assault your sense of reality.
This performance is a side effect of a far more interesting property: Ansible is no longer running on just the host machine, but temporarily distributed throughout the target network for the duration of the run, with bidirectional communication between all pieces, and you won’t believe the crazy functionality this enables.
What if I told you it were possible not only to eliminate that final 2%, but turn it sharply negative, while simultaneously reducing resource consumption? “Surely Ansible can’t execute faster than equivalent raw SSH commands?” You bet it can! And if you care about such things, this could be yours by Autumn. Read on..
Pushing brains into the ether, no evil agents required
As I teased last year, Ansible takes its name from a faster-than-light communication device from science fiction, yet despite these improvements it is still fundamentally bound by the speed with which information physically propagates. Pull and agent-based tooling is strongly advantageous here: control flow occurs at the same point as the measurements necessary to inform that flow, and no penalty is incurred for traversing the network.
Today, reducing latency in Ansible means running it within the target network, or in pull mode, where the playbook is stored on the target alongside for example, secrets for decrypting any vaults, and the hairy mechanics required to keep that in sync and executing when appropriate. This is a far cry from the simplicity of tapping ansible-playbook live.yml on your laptop, and so it is an option of last resort.
What would be amazing is some hybrid where we could have the performance and scaleability benefits of pull, combined with the stateless simplicity of push, without introducing dedicated hosts or permanent caches and agents running on the target machines, that amount to persistent intermediate state and introduce huge headaches of their own, all without sacrificing the fabulous ability to shut everything down with a simple CTRL+C.
The opening volley: connection delegation
As a first step to exploiting previously impossible functionality, I will enhance the extension to support delegating connection establishment to a machine on the target network, avoiding the cost of establishing hundreds of SSH connections over a low throughput, high latency network link.
Unlike with SSH proxying, this has the huge benefit of caching and serving Ansible code from RAM on the intermediary, avoiding uploading approximatey 50KiB of code for every playbook step, and ensuring those cached responses are delivered over the low latency LAN fabric on the target network. For 100 target machines, this replaces the transmission of 5 Mbytes of data for every playbook step with on the order of kilobytes worth of tiny remote procedure calls.
All the Mitogen-side infrastructure for this exists today, and is already used to implement become support. It could be flipped on with a few lines of code in the Ansible extension, but there are a few more importer bugs to fix before it’ll work perfectly.
Finally as a reminder, since Mitogen operates recursively delegation also operates recursively, with code caching and connection establishment happening at each hop. Not only is this useful for navigating slow links and complicated firewall setups, as we’ll see, it enables some exciting new scenarios.
Ansible is intended to manage many machines simultaneously, and while the extension’s improvements presently work well for single-machine playbooks, that is all but a niche application for many users.
Having the newfound ability to delegate connection establishment to an intermediary on the target network, far away from our laptop’s high latency 3G connection, and with the ability to further sub-delegate from that intermediary, we can implement a divide and conquer strategy, forming a large tree comprising the final network of target machines for the playbook run, with responsibility for caching and connection multiplexing evenly divided across the tree, neatly avoiding single resource bottlenecks.
I will rewrite Mitogen’s connection establishment to be asynchronous: creation of many downstream connections can be scheduled in parallel, with the ability to enqueue commands prior to completion, including recursive commands that would cause those connections to in turn be used as intermediaries.
The cost of establishing connections should become only the cost of code upload (~50KiB) and the latency of a single SSH connection per tree layer, as connections at each layer occur in parallel. For an imaginary 1,700 node cluster split into quarters of 17 racks and 25 nodes per rack, connection via a 300 ms 3G network should complete in well under 15 seconds.
Topology-aware file synchronization
So you have a playbook on your laptop deploying a Django application via the synchronize module, to 100 Ubuntu machines running in a datacentre 300 ms away. Each run of the playbook entails a groan followed by a long walk, as a 3.8 second rsync run is invoked 100 times via your 3G connection, just to synchronize a 3 Mbyte asset the design team won’t stop tweaking. Not only are there 6 minutes of roundtrips buried in those invocations, but that puny 3G connection is forced to send a total of 300 Mbytes toward the target network.
What is the point of continually re-sending that file to the same set of machines in some far-off network? What if it could be uploaded exactly once, then automatically cached and redistributed within the target network, producing exactly one upload per layer in the hierarchy:
Why stop at delegating connection establishment and module caching? Now we have a partial copy of Ansible within the network, nothing prevents implementing all kinds of smarts. Here is another feature that is a cinch to build once bidirectional communication exists between topology-aware code, which the prototype extension already provides today.
After a brutal 4 hour meeting involving 10 executives our hero Bob, Senior Disaster Architect III, emerges bloodstained yet victorious against the tyrannical security team, as his backends can talk with impunity to the entire Internet just so apt-get can reach packages.debian.org for the 15 seconds Bob’s daily Ansible CI job requires.
That evening, having regaled his giddy betrothed (HR Coordinator II) with heroic story of war, Bob catches a brief yet chilling glimmer of doubt for all that transpired. “Was there another way?” he sleepily ponders, before succumbing to a cosier battle waged by those fatigued and heavy eyelids. Suddenly aware again, Bob emerges bathed in a mysterious utopian dreamscape where CI jobs executed infinitely quickly, war and poverty did not exist, and the impossible had always been possible.
Building on Mitogen’s message routing, forwarding all kinds of pipes and network sockets becomes trivial, including schemes that would allow exposing a transient, locked down HTTP proxy to Bob’s apt-get invocation only for as long as necessary, all with a few lines of YAML in a playbook.
While this is already possible with SSH forwarding, the hand-configuration involved is messy, and becomes extremely hairy when the target of the forward is not the host machine. My initial goal is to support forwarding of UNIX and TCP sockets, as they cover all use cases I have in mind. Speaking of which..
Topology-aware Git pull
Another common security fail seen in Ansible playbooks is to call Git directly from target machines, including granting those machines access to a Git server. This is a horrid violation: even read-only access implies the machine needs permanent firewall rules that shouldn’t exist, just for the scant moments a pull is in progress. Granting backends access to a site as complex as GitHub.com, you may as well abandon all outbound firewalling, as this is enough for even the puniest script kiddy to exfiltrate a production database.
What if Git could run with the permissions of the local Ansible user, on the user’s own machine, and be served efficiently to the target machines only for the duration of the push, faster than 100 machines talking to GitHub.com, and only to the single read-only repository intended?
Building on generalized forwarding, topology-aware Git repeats all the caching and single-upload tricks of file synchronization, but this time implementing the Git protocol between each node.
In the scheme I will implement, a single round-trip is necessary for git-fetch-pack to pull just the changed objects from the laptop over the high latency 3G link, before propagating at LAN speeds throughout the target network, with git-ls-remote output delivered as part of the message that initiates the pull. Not only is the result more efficient than a normal git-pull, but backends no longer require network access to Git.
The final word: Inversion of control
Remember we talked about making Ansible run faster than equivalent SSH commands? Well, today Ansible requires one network round-trip per playbook step, so just like SSH, it must pay the penalty for every round-trip unless something gives, and that something is the partial delegation of control to the target machine itself.
With inversion of control, the role of ansible-playbook simply becomes that of shipping code and selective chunks of data to target machines, where those machines can execute and make control decisions without necessitating a conversation with the master after each step, just to figure out what to execute next.
Ansible has all the framework to enable implementing this today, by significantly extending the prototype extension’s existing strategy plug-in, and teaching it how to automatically send and wait on batches of tasks, rather than on single tasks at a time.
Aside from improved performance, the semantics of the existing linear strategy will be preserved, and playbooks need not be changed to cope: on the target machine tasks will not suddenly begin running concurrently, or in any order different to previously.
App-level connection persistence
As a final battle against latency during playbook development and debugging, I will support detaching the connection tree from ansible-playbook on exit, and teach the extension to reuse it at startup. This will reduce the overhead of repeat runs, especially against many targets, to the order of hundreds of milliseconds, as no new SSH connections, module compilations or code uploads are required.
Connection persistence opens the floodgates for adding sweet new tooling, although I’m not sure how desirable it is to expose an implementation detail like this forever, while also extending the interface provided by Ansible itself. As a simple example, we could provide an ansible-ssh tool that reuses the connection tree along with Ansible’s tunnelling, delegation, dynamic inventory and authentication configuration to forward a pipe to a remote shell.
The cost of slow tooling
Ansible has over 28,500 stars on GitHub, representing just those users who have a GitHub account and ever thought to star it, and appears to grow by 150 stars per week. Around London the going rate to hire one user is $100/hour, and conservatively, we could expect that user is trotting out a 15 minute run of ansible-playbook
live.yml at least once per week.
We can expect that if Ansible is running merely twice as slowly as necessary, 7.5 minutes of that run is lost productivity, and across those 28,500 users, the economic cost is in the region of $356,250 per invocation or $17,100,000 per year. In reality the average user is running Ansible far more often, including thousands of times per minute under various CI systems worldwide, and those runs often last far longer than 15 minutes, but I’d recommend that mental guesstimation is left as an exercise to readers who are already blind drunk.
The future is beautiful if you want it to be
My name is David, and nothing jinxes my day quite like slow tooling. I have poured easily 500 hours in some form into this project over a decade and on my own time. The project has now reached an inflection point where the fun part is over, the science is done and the effect is real, and only a small, highly predictable set of milestones remain to deliver what I hope you agree is a much brighter future.
Before reading I doubt you would have believed it possible to provide the features described without a complex infrastructure running in the target network, now I hope you’ll join me in disproving one final impossibility.
While everything here will exist in time, it cannot exist in 2018 without your support, and that’s why I’d like to try something crazy, that would allow me to devote myself to delivering a vastly improved daily routine for thousands of people just like you and me.
You may have guessed already: I want you to crowdfund awesome tooling.
What value would you place on an extra productive hour every working week? In the UK that’s an easy question: it’s around $4,800 per year. And what risk is there to contributing $100 to an already proven component? I hope you’ll agree this too is a no-brainer, both for you and your employer.
To encourage success I’m offering a unique permanent placement of your brand on the GitHub repository and documentation. Funds will be returned if the minimum goal cannot be reached, however just 3 weeks are sufficient to ensure a well tested extension, with my full attention given to every bug, ready to save many hours right on time to enjoy the early sunlight of Spring.
Totalling much less than the economic damage caused by a single run of today’s Ansible, the grand plan is divided into incrementally related stretch goals. I cannot imagine this will achieve full funding, but if it does, as a finale I’ll deliver a feature built on Ansible that you never dreamed possible.
As a modern area deployment tooling is exposed to the ebb and flow of the software industry far more than typical, and unexpected disruption happens continuously. Without ongoing evolution, exposure to buggy and unfamiliar new tooling is all but guaranteed, with benefits barely justifying the cost of their integration. As we know all too well, rational ideas like cost/benefit rarely win the hearts of buzzword-hungry and youthful infrastructure teams, so counterarguments must be presented another way.
As a recent example there is growing love for mgmt, which is designed from the outset as an agent-based reactive distributed system, much as Mitogen nudges Ansible towards. However unlike mgmt, Ansible preserves its zero-install and agentless nature, while laying a sound framework for significantly more exciting features. If that alone does not win loyalty, we’re at least guaranteed that every migration-triggering new feature implemented in such systems can be headed off with minimal effort, long into the foreseeable future.
After a long winter break from recreational programming, over the past days I finally built up steam and broke a chunk of new ground on Mitogen, this time growing its puny module forwarder into a bona fide beast, ready to handle almost any network condition and user code thrown at it.
Mitogen is a library for executing parts of a Python program in a remote context, primarily over sudo and SSH connections, and establishing bidirectional communication with those parts. Targeting infrastructure applications, it requires no upfront configuration of target machines, aside from an SSH daemon and Python 2.x interpreter, which is the default for almost every Linux machine found on any conceivable network.
The target need not possess a writeable filesystem, code is loaded dynamically on demand, and execution occurs entirely from RAM.
How Import Works
To implement dynamic loading, child Python processes (“contexts”) have a PEP-302 import hook installed that causes attempts to import modules unavailable locally to automatically be served over the network connection to the parent process. For example, in a script like:
If the requests package is missing on the host k3, it will automatically be copied and imported in RAM, without requiring upfront configuration, or causing or requiring writes to the remote filesystem.
So far, so good. Just one hitch
While the loader has served well over the library’s prototypical life (which in real time, is approaching 12 years!), it has always placed severe limits on the structure of the loaded code, as each additional source file introduced one network round-trip to serve it.
Given a relatively small dependency such as Kenneth Reitz' popular Requests package, comprising 17 submodules, this means 17 additional network round-trips. While that may not mean much over a typical local area network segment where roundtrips are measured in microseconds, it quickly multiplies over even modest wide-area networks, where infrastructure tooling is commonly deployed.
For a library like Requests, 17 round-trips amounts to 340ms latency over a reasonably local 20ms link, which is comfortably within the realms of acceptable, however over common radio and international links of 200ms or more, already this adds at least 3.4 seconds to the startup cost of any Mitogen program, time wasted doing nothing but waiting on the network.
Sadly, Requests is hardly even the biggest dependency Mitogen can expect to encounter. For testing I chose django.db.models as a representative baseline: heavily integrated with all of Django, it transitively imports over 160 modules across numerous subpackages. That means on an international link, over 30 seconds of startup latency spent on one dependency.
It is worth note that Django is not something I’d expect to see in a typical Mitogen program, it’s simply an extraordinarily worst-case target worth hitting. If Mitogen can handle django.db.models, it should cope with pretty much anything.
Combining evils, over an admittedly better-than-average Nepali mobile data network, and an international link to my IRC box mail server in Paris, django.db.models takes almost 60 seconds to load with the old design.
In the real world, this one-file-per-roundtrip characteristic means the current approach sucks almost as much as Ansible does, which calls into doubt my goal of implementing an Ansible-trumping Ansible connection plug-in. Clearly something must give!
Over the years I discarded many approaches for handling this latency nightmare:
Having the user explicitly configure a module list to deliver upfront to new contexts, which sucks and is plainly unmaintainable.
Installing a PEP-302 hook in the master in order to observe the import graph, which would be technically exciting, but likely to suck horribly due to fragility and inevitable interference with real PEP-302 hooks, such as py2exe.
Observing the import graph caused by a function call in a single context, then using it to preload modules in additional contexts. This seems workable, except the benefit would only be felt by multiple-child Mitogen programs. Single child programs would continue to pay the latency tax.
Variants of 2 and 3, except caching the result as intermediate state in the master’s filesystem. Ignoring the fact persistent intermediate state is always evil (a topic for later!), that would require weird and imperfect invalidation rules, which means performance would suck during development and prototyping, and bugs are possible where state gets silently wedged and previously working programs inexplicably slow down.
Finally last year I settled on using static analysis, and restricting preloading at package boundaries. When a dependency is detected in a package external to the one being requested, it is not preloaded until the child has demonstrated, by requesting the top-level package module from its parent, that the child lacks all of the submodules contained by it.
This seems like a good rule: preloading can occur aggressively within a package, but must otherwise wait for a child to signal a package as missing before preemptively wasting time and bandwidth delivering code the child never needed.
As a final safeguard, preloading is restricted to only modules the master itself loaded. It is not sufficient for an import statement to exist: surrounding conditional logic must have caused the module to be loaded by the master. In this manner the semantics of platform, version-specific and lazy imports are roughly preserved.
Syntax tree hell
Quite predictably, after attempting to approach the problem with regexes, I threw my hands up on realizing a single regex may not handle every possible import statement:
import a as b
from a import b
from a import b as c
from a import (b, c, d)
I gleefully thought I’d finally found a use for the compiler and ast modules, and these were the obvious alternative to avoiding the rats nest of multiple regexes. Not quite. You see, across Python releases the grammar has changed, and in lock-step so have the representations exported by the compiler and ast modules.
Adding insult to injury: neither module is supported through every interesting Python version. I have seen Python 2.4 deployed commercially as recently as summer 2016, and therefore consider it mandatory for the kind of library I want on my toolbelt. To support antique and chic Python alike, it was necessary to implement both approaches and select one at runtime. Many might see this is an opportunity to drop 2.4, but “just upgrade lol” is never a good answer while maintaining long shelf-life systems, and should never be a a barrier to applying a trusted Swiss Army Knife.
After some busy days last September, I had a working scanner built around syntax trees, except for a tiny problem: it was ridiculously slow. Parsing the 8KiB mitogen.core module took 12ms on my laptop, which multiplied up is over a second of CPU burnt scanning dependencies for a package like Django. If memory serves, reality was closer to 3 seconds: far exceeding the latency saved while talking to a machine on a LAN.
Sometimes hacking bytecode make perfect sense
I couldn’t stop groaning the day I abandoned ASTs. As is often true when following software industry best practice, we are left holding a decomposing trout that, while technically fulfilling its role, stinks horribly, costs all involved a fortune to support and causes pains worse than those it was intended to relieve. Still hoping to avoid regexes, I went digging for precedent elsewhere in tools dealing with the same problem.
That’s when I discovered the strange and unloved modulefinder buried in the standard library, a forgotten relic from a bygone era, seductively deposited there as a belated Christmas gift to all, on a gloomy New Year’s Eve 2002 by Guido’s own brother. Diving in, I was shocked and mesmerized to find dependencies synthesized by recompiling each module and extracting IMPORT_FROM opcodes from the compiled bytecode. Reimplementing a variant, I was overjoyed to discover django.db.models transitive dependencies enumerated in under 350ms on my laptop. A workable solution!
The solution has some further crazy results: IMPORT_FROM has barely changed since the Python 2.4 days, right through to Python 3.x. The same approach works everywhere, including PyPy, which uses the same format, which makes this more portable than the ast and compiler modules!
Coping with concurrency
Now a mechanism exists to enumerate dependencies, we need a mode of delivery. The approach used is simplistic, and (as seen later), will likely require future improvement.
On receiving a GET_MODULE message from a child, a parent (don’t forget, Mitogen operates recursively!) first tries to satisfy the request from its own cache, before forwarding it upwards towards the master. The master sends LOAD_MODULE messages for all dependencies known to be missing from the child before sending a final message containing the module that was actually requested. Since contexts always cache unsolicited LOAD_MODULE messages from upstream, by the time the message arrives for the requested module, many dependencies should be in RAM and no further network roundtrips requesting them are required.
Meanwhile for each stream connected to any parent, a set of module names ever delivered on that stream are recorded. Each parent is allowed to ignore any GET_MODULE for which a corresponding LOAD_MODULE has already been sent, preventing a race between in-flight requests causing the same module to ever be sent twice.
This places the onus on downstream contexts to ensure the single LOAD_MODULE message received for each distinct module always reaches every interested party. In short, GET_MODULE messages must be deduplicated and synchronized not only for any arriving from a context’s children, but also from its own threads.
And finally the result. For my test script, the total number of roundtrips dropped from 166 to 13, one of which is for the script itself, and 3 negative requests for extension modules that cannot be transferred. That leaves, bugs aside, 9 roundtrips to transfer the most obscene dependency I could think of.
One more look at the library’s network profile. Over the same connection as previously, the situation has improved immensely:
Not only is performance up, but the number of frames transmitted has dropped by 42%. That’s a 42% fewer changes of connection hang due to crappy WiFi!
One final detail is visible: around the 10 second mark, a tall column of frames is sent with progressively increasing size, almost in the same instant. This is not some bug, it is Path MTU Discovery (PMTUD) in action. PMTUD is a mechanism by which IP subprotocols can learn the maximum frame size tolerated by the path between communicating peers, which in turn maximizes link efficiency by minimizing bandwidth wasted on headers. The size is ramped up until either loss occurs or an intermediary signals error via ICMP.
Just like the network path, PMTUD is dynamic and must restart on any signal indicating network conditions have changed. Comparing this graph with the previous, we see one final improvement as a result of providing the network layer enough data to do its job: PMTUD appears restart much less frequently, and the stream is pegged at the true path MTU for much longer.
Aside from simple fixes to reduce wasted roundtrips for extension modules that can’t be imported, and optional imports of top-level packages that don’t exist on the master, there are two major niggles remaining in how import works today.
The first is an irritating source of latency present in deep trees: currently it is impossible for intermediary nodes satisfying GET_MODULE requests for children to streamily send preloaded modules towards a child until the final LOAD_MODULE arrives at the intermediary for the module actually requested by the child. That means preloading is artificially serialized at each layer in the tree, when a better design would allow it to progress concurrent to the LOAD_MODULE messages still in-flight from the master.
This will present itself when doing multi-machine hops where links between the machines are slow or suffer high latency. It will also be important to fix before handling hundreds to thousands of children, such as should become practical once asynchronous connect() is implemented.
There are various approaches to tweaking the design so that concurrency is restored, but I would like to let the paint dry a little on the new implementation before destablizing it again.
The second major issue is almost certainly a bug waiting to be discovered, but I’m out of energy to attack it right now. It relates to complex situations where many children have different functions invoked in them, from a complex set of overlapping packages. In such cases, it is possible that a LOAD_MODULE for an unrelated GET_MODULE prematurely delivers the final module from another import, before it has had all requisite modules preloaded into the child.
To fix that, the library must ensure the tree of dependencies for all module requests are sent downstream depth-first, i.e. it is never possible for any module to appear in a LOAD_MODULE before all of its dependencies have first.
Finally there are latency sources buried elsewhere in the library, including at least 2 needless roundtrips during connection setup. Fighting latency is an endless war, but with module loading working efficiently, the most important battle is over.
First and foremost, a glaring error: the linked CCC talk had absolutely nothing to do with the group that originated the KAISER patches. Please accept my sincerest apologies for that, I have absolutely no idea how the topics became intermingled -- possibly due to spending only a few hours reading, and around an hour at 4AM writing.
I cannot emphasize enough the article is (and can only be) supposition in a domain I lack expertise in, as repeatedly highlighted throughout. Written in the style of a paranoid and conspiratorial murder mystery, it was as fun to write as I hope it was to read, and besides, who doesn't enjoy a planetary scale whodunnit to pick over during the holidays? (Well, apparently a lot of super serious infosec types)
As with most stuff I post, the aim is not to be right on the Internet (after all, I lack a career in infosec!), but to generate the kind of frothy noise that leads to learning more about whatever it is I'm playing with, often precisely by being horribly wrong. In this case, anonymous feedback alongside some Hacker News commentary cleared up much of my conflation, indicated the issue may not affect AMD CPUs, and the embargo may end on January 4, now-common knowledge tidbits I consider resoundingly positive successes of writing the article at all.
The article was largely sourced from the marvellous work of the wonderful editor over at LWN, and intentionally did not make use of LWN's member link function to bypass the paywall. As is evident in various places on the Internet, the original LWN article has been shared repeatedly, but that decades old, consistently trustworthy yet dry format was insufficiently clickbaity for communicating what I believe promises to be one of the most interesting events in 2018 relevant to my profession.
I hope you would agree that however inaccurate, it contained sufficient additional value as to not be considered a clone, and meanwhile has thus far delivered 1,100 clicks to a page requesting the viewer subscribe to LWN to continue reading, creating awareness for what I worry is a high quality yet continuously surviving niche news outlet. This I also consider a positive result.
An original source branded this blog a "regurgibloid" for daring to link some already-public tweets, that, to my knowledge have never previously appeared on the same page anywhere on the Internet, suggesting instead the common uninformed tech shall have but two options for acquiring what was allegedly already public knowledge: consume thousands of tweets comprised primarily of egotistical teen angst stretching back aeons, often filled with even less accurate supposition than this article, or remain coldly in the dark until such times as CPUBLEED.COM (or whatever this bug is branded) powered by WordPress and hastily drawn cartoons gets peppered all over the BBC news.
It's sufficient to say this attitude is beyond repugnant, frankly embarrassing, and there could be nothing closer to why I'm thankful almost daily that I never pursued a career in infosec. Please remind yourself at least once a year, us clueless lowly plebeian developers are the very reason you earn an income at all! Without us there'd be nothing to break, no precious knowledge to hoard, and nothing over which to self-aggrandize.
And finally, since signing up for Tumblr, its UI editor, once a bastion of modern web app design, has regressed to the point of almost total unusability. Correcting the numerous typos in the post required reformatting from start to end each time I clicked save. I have left the typos verbatim, partly as a sign of the haste in which it was written, but mainly as a kind of negative reinforcement loop to eventually push me off Tumblr. Lord knows it would avert some of the cheapest feedback the article received.
I wish there were some moral to finish with, but really the holidays are over, the mystery continues, and all that remains is a bad taste from all the flack I have received for daring intrude upon the sacred WordPress-powered tapestry of a global security embargo. Trust me, it will never happen again -- life is simply too short.
tl;dr: there is presently an embargoed security bug impacting apparently all contemporary CPU architectures that implement virtual memory, requiring hardware changes to fully resolve. Urgent development of a software mitigation is being done in the open and recently landed in the Linux kernel, and a similar mitigation began appearing in NT kernels in November. In the worst case the software fix causes huge slowdowns in typical workloads. There are hints the attack impacts common virtualization environments including Amazon EC2 and Google Compute Engine, and additional hints the exact attack may involve a new variant of Rowhammer.
I don’t really care much for security issues normally, but I adore a little intrigue, and it seems anyone who would normally write about these topics is either somehow very busy, or already knows the details and isn’t talking, which leaves me with a few hours on New Years’ Day to go digging for as much information about this mystery as I could piece together.
Beware this is very much a connecting-the-invisible-dots type affair, so it mostly represents guesswork until such times as the embargo is lifted. From everything I’ve seen, including the vendors involved, many fireworks and much drama is likely when that day arrives.
The purpose of the series is conceptually simple: to prevent a variety of attacks by unmapping as much of the Linux kernel from the process page table while the process is running in user space, greatly hindering attempts to identify kernel virtual address ranges from unprivileged userspace code.
The group’s paper describing KAISER, KASLR is Dead: Long Live KASLR, makes specific reference in its abstract to removing all knowledge of kernel address space from the memory management hardware while user code is active on the CPU.
Of particular interest with this patch set is that it touches a core, wholly fundamental pillar of the kernel (and its interface to userspace), and that it is obviously being rushed through with the greatest priority. When reading about memory management changes in Linux, usually the first reference to a change happens long before the change is ever merged, and usually after numerous rounds of review, rejection and flame war spanning many seasons and moon phases.
The KAISER (now KPTI) series was merged in some time less than 3 months.
On the surface, the patches appear designed to ensure Address Space Layout Randomization remains effective: this is a security feature of modern operating systems that attempts to introduce as many random bits as possible into the address ranges for commonly mapped objects.
For example, on invoking /usr/bin/python, the dynamic linker will arrange for the system C library, heap, thread stack and main executable to all receive randomly assigned address ranges:
Notice how the start and end offset for the bash process heap changes across runs.
The effect of this feature is that, should a buffer management bug lead to an attacker being able to overwrite some memory address pointing at program code, and that address should later be used in program control flow, such that the attacker can divert control flow to a buffer containing contents of their choosing, it becomes much more difficult for the attacker to populate the buffer with machine code that would lead to, for example, the system() C library function being invoked, as the address of that function varies across runs.
This is a simple example, ASLR is designed to protect many similar such scenarios, including preventing the attacker from learning the addresses of program data that may be useful for modifying control flow or implementing an attack.
KASLR is “simply” ASLR applied to the kernel itself: on each reboot of the system, address ranges belonging to the kernel are randomized such that an attacker who manages to divert control flow while running in kernel mode cannot guess addresses for functions and structures necessary for implementing their attack, such as locating the current process data, and flipping the active UID from an unprivileged user to root, etc.
Bad news: the software mitigation is expensive
The primary reason for the old Linux behaviour of mapping kernel memory in the same page tables as user memory is so that when the user’s code triggers a system call, fault, or an interrupt fires, it is not necessary to change the virtual memory layout of the running process.
Since it is unnecessary to change the virtual memory layout, it is further unnecessary to flush highly performance-sensitive CPU caches that are dependant on that layout, primarily the Translation Lookaside Buffer.
With the page table splitting patches merged, it becomes necessary for the kernel to flush these caches every time the kernel begins executing, and every time user code resumes executing. For some workloads, the effective total loss of the TLB lead around every system call leads to highly visible slowdowns: @grsecurity measured a simple case where Linux “du -s” suffered a 50% slowdown on a recent AMD CPU.
Recap: Virtual Memory
In the usual case, when some machine code attempts to load, store, or jump to a memory address, modern CPUs must first translate this virtual address to a physical address, by way of walking a series of OS-managed arrays (called page tables) that describe a mapping between virtual memory and physical RAM installed in the machine.
Virtual memory is possibly the single most important robustness feature in modern operating systems: it is what prevents, for example, a dying process from crashing the operating system, a web browser bug crashing your desktop environment, or one virtual machine running in Amazon EC2 from effecting changes to another virtual machine on the same host.
The attack works by exploiting the fact that the CPU maintains numerous caches, and by carefully manipulating the contents of these caches, it is possible to infer which addresses the memory management unit is accessing behind the scenes as it walks the various levels of page tables, since an uncached access will take longer (in real time) than a cached access. By detecting which elements of the page table are accessed, it is possible to recover the majority of the bits in the virtual address the MMU was busy resolving.
Evidence for motivation, but not panic
We have found motivation, but so far we have not seen anything to justify the sheer panic behind this work. ASLR in general is an imperfect mitigation and very much a last line of defence: there is barely a 6 month period where even a non-security minded person can read about some new method for unmasking ASLR’d pointers, and reality has been this way for as long as ASLR has existed.
Fixing ASLR alone is not sufficient to describe the high priority motivation behind the work.
Evidence: it’s a hardware security bug
From reading through the patch series, a number of things are obvious.
First of all, as @grsecurity points out, some comments in the code have been redacted, and additionally the main documentation file describing the work is presently missing entirely from the Linux source tree.
Examining the code, it is structured in the form of a runtime patch applied at boot only when the kernel detects the system is impacted, using exactly the same mechanism that, for example, applies mitigations for the infamous Pentium F00F bug:
More clues: Microsoft have also implemented page table splitting
From a little digging through the FreeBSD source tree, it seems that so far other free operating systems are not implementing page table splitting, however as noted by Alex Ioniscu on Twitter, the work already is not limited to Linux: public NT kernels from as early as November have begun to implement the same technique.
In this paper, we present novel Rowhammer attack and exploitation primitives, showing that even a combination of all defenses is ineffective. Our new attack technique, one-location hammering, breaks previous assumptions on requirements for triggering the Rowhammer bug
As a quick recap, Rowhammer is a class of problem fundamental to most (all?) kinds of commodity DRAMs, such as the memory in the average computer. Through precise manipulation of one area of memory, it is possible to cause degradation of storage in a related (but otherwise logically distinct) area of memory. The effect is that Rowhammer can be used to flip bits of memory that unprivileged user code should have no access to, such as bits describing how much access that code should have to the rest of the system.
I found this work on Rowhammer particularly interesting, not least for its release being in such close proximity to the page table splitting patches, but because Rowhammer attacks require a target: you must know the physical address of the memory you are attempting to mutate, and a first step to learning a physical address may be learning a virtual address, such as in the KASLR unmasking work.
Guesswork: it effects major cloud providers
On the kernel mailing list we can see, in addition to the names of subsystem maintainers, e-mail addresses belonging to employees of Intel, Amazon and Google. The presence of the two largest cloud providers is particularly interesting, as this provides us with a strong clue that the work may be motivated in large part by virtualization security.
Which leads to even more guessing: virtual machine RAM, and the virtual memory addresses used by those virtual machines are ultimately represented as large contiguous arrays on the host machine, arrays that, especially in the case of only 2 tenants on a host machine, are assigned by memory allocators in the Xen and Linux kernels that likely have very predictable behaviour.
Favourite guess: it is a privilege escalation attack against hypervisors
Putting it all together, I would not be surprised if we start 2018 with the release of the mother of all hypervisor privilege escalation bugs, or something similarly systematic as to drive so much urgency, and the presence of so many interesting names on the patch set’s CC list.
One final tidbit, while I’ve lost my place reading through the patches, there is some code that specifically marked either paravirtual or HVM Xen as unaffected.
Invest in popcorn, 2018 is going to be fun
It’s totally possible this guess is miles off reality, but one thing is for sure, it’s going to be an exciting few weeks when whatever this thing is published.
After many years of occasional commitment, I'm finally getting close to a solid implementation of a module I've been wishing existed for over a decade: given a remote machine and an SSH connection, just magically make Python code run on that machine, with no hacks involving error-prone shell snippets, temporary files, or hugely restrictive single use request-response shell pipelines, and suchlike.
I'm borrowing some biology terminology and calling it Mitogen, as that's pretty much what the library does. Apply some to your program, and it magically becomes able to recursively split into self-replicating parts, with bidirectional communication and message routing between all the pieces, without any external assistance beyond an SSH client and/or sudo installation.
Mitogen's goal is straightforward: make it childsplay to run Python code on remote machines, eventually regardless of connection method, without being forced to leave the rich and error-resistant joy that is a pure-Python environment. My target users would be applications like Ansible, Salt, Fabric and similar who (through no fault of their own) are universally forced to resort to obscene hacks in their implementations to affect a similar result. Mitogen may also be of interest to would-be authors of pure Python Internet worms, although support for autonomous child contexts is currently (and intentionally) absent.
Because I want this tool to be useful to infrastructure folk, Mitogen does not require free disk space on the remote machines, or even a writeable filesystem -- everything is done entirely in RAM, making it possible to run your infrastructure code against a damaged machine, for example to implement a repair process. Newly spawned Python interpreters have import hooks and logging handlers configured so that everything is fetched or forwarded over the network, and the only disk accesses necessary are those required to start a remote interpreter.
Mitogen can be used recursively: newly started child contexts can in turn be used to run portions of itself to start children-of-children, with message routing between all contexts handled automatically. Recursion is used to allow first SSHing to a machine before sudoing to a new account, all with the user's Python code retaining full control of each new context, and executing code in them transparently, as easily as if no SSH or sudo connection were involved at all. The master context is able to control and manipulate children created in this way as easily as if they were directly connected, the API remains the same.
Currently there exists just two connection methods: ssh and sudo, with the sudo support able to cope with typing passwords interactively, and crap configurations that have requiretty enabled.
I am explicitly planning to support Windows, either via WMI, psexec, or Powershell Remoting. As for other more exotic connection methods, I might eventually implement bootstrap over an IPMI serial console connection if for nothing else then as a demonstrator of how far this approach can be taken, but the ability to use the same code to manage a machine with or without a functional networking configuration would be in itself a very powerful feature.
This looks a bit like X. Isn't this just X?
Mitogen is far from the first Python library to support remote bootstrapping, but it may be the first to specifically target infrastructure code, minimal networking footprint, read-only filesystems, stdio and logging redirection, cross-child communication, and recursive operation. Notable similar packages include Pyro and py.execnet.
This looks a bit like Fabric. Isn't this just Fabric?
Fabric's API feels kinda similar to what Mitogen offers, but it fundamentally operates in terms of chunks of shell snippets to implement all its functionality. You can't easily (at least, as far as I know) trick Fabric into running your Python code remotely, or for that matter recursively across subsequent sudo and SSH connections, and arrange for that code to communicate bidirectionally with code running in the local process and autonomously between any spawned children.
Mitogen internally reuses this support for bidirectional communication to implement some pretty exciting functionality:
SSH Client Emulation
So your program has an elaborate series of tunnels setup, and it's running code all over the place. You hit a problem, and suddenly feel the temptation to drop back to raw shell and SSH again: "I just need to sync some files!", you tell yourself, before loudly groaning on realizing the spaghetti of duplicated tunnel configurations that would be required to get rsync running the same way as your program. What's more, you realize that you can't even use rsync, because you're relying on Mitogen's ability to run code over sudo with requiretty enabled, and you can't even directly log into that target account.
Not a problem: Mitogen supports running local commands with a modified environment that causes their attempt to use SSH to run remote command lines to be redirected into Mitogen, and tunnelled over your program's existing tunnels. No duplicate configuration, no wasted SSH connections, no 3-way handshake latency.
The primary goal of the SSH emulator to simplify porting existing infrastructure scripts away from shell, including those already written in Python. As a first concrete target for Mitogen, I aim to retrofit it to Ansible as a connection plug-in, where this functionality becomes necessary to support e.g. Ansible's synchronize module.
Compared To Ansible
To understand the value of Mitogen, a short comparison against Ansible may be useful. I created an Ansible playbook talking to a VMWare Fusion Ubuntu machine, with SSH pipelining enabled (the current best performance mode in Ansible). The playbook simply executes /bin/true with become: true and discards the result 100 times.
[Side note: this is comparing performance characteristics only, in particular I am not advocating writing code against Mitogen directly! It's possible, but you get none of the ease of use that a tool like Ansible provides. On saying that, though, a Mitogen-enabled tool composed of tens of modules would have similar performance to the numbers below, just a slightly increased base cost due to initial module upload]
Mitogen local loop
Mitogen remote loop
The first and most obvious property of Ansible is that it uses a metric crap-ton of bandwidth, averaging 45kb of data for each run of /bin/true. In comparison, the raw command line "ssh host /bin/true" generates only 4.7kb and 311ms, including SSH connection setup and teardown.
Bandwidth aside, CPU alone cannot account for runtime duration, clearly significant roundtrips are involved, generating sufficient latency to become visible on an in-memory connection to a local VM. Why is that? Things are about to get real ugly, and I'm already starting to feel myself getting depressed. Remember those obscene hacks I mentioned earlier? Well, buckle your seatbelt Dorothy, because Kansas is going bye-bye..
[Side note: the name Ansible is borrowed from Ender's Game, where it refers to a faster-than-light communication technology. Giggles]
When you write some code in Ansible, like shell: /bin/true, you are telling Ansible (in most cases) that you want to execute a module named shell.py on the target machine, passing /bin/true as its argument.
So far, so logical. But how is Ansible actually running shell.py? "Simple", by default (no pipelining) it looks like this:
First it scans shell.py for every module dependency,
then it adds the module and all dependents into an in-memory ZIP file, alongside a file containing the module's serialized arguments,
then it base64-encodes this ZIP file and mixes it into a templatized self-extracting Python script (module_common.py),
then it writes the templatized script to the local filesystem, where it can be accessed by sftp,
first it creates yet another temporary directory on the target machine, this time using the tempfile module,
then it writes a base64-decoded copy of the embedded ZIP file as ansible_modlib.zip into that directory,
then it opens the newly written ZIP file using the zipfile module and extracts the module to be executed into the same temporary directory, named like ansible_mod_<modname>.py,
then it opens the newly written ZIP file in append mode and writes a custom sitecustomize.py module into it, causing the ZIP file to be written to disk for a second time on this machine, and a third time in total,
then it uses the subprocess module to execute the extracted script, with PYTHONPATH set to cause Python's ZIP importer to search for additional dependent modules inside the extracted-and-modified ZIP file,
then it uses the shutil module to delete the second temporary directory,
then the shell snippet that executed the templatized script is used to run rm -rf over the first temporary directory.
When pipelining is disabled, which is the default, and required for cases where sudo has requiretty enabled, these steps (and their associated network roundtrips) recur for every single playbook step. And now you know why Ansible makes execution over a local 1Gbit LAN feel like it's communicating with a host on Mars.
Need a breath? Don't worry, things are about to get better. Here are some pretty graphs to look at while you're recovering..
The Ugly (from your network's perspective)
This shows Ansible's pipelining mode, constantly reuploading the same huge data part and awaiting a response for each run. Be sure to note the sequence numbers (transmit byte count) and the scale of the time axis:
Now for Mitogen, demonstrating vastly more conservative use of the network:
The SSH connection setup is clearly visible in this graph, accounting for about the first 300ms on the time axis. Additional excessive roundtrips are visible as Mitogen waits for its command-line to signal successful first stage bootstrap before uploading the main implementation, and 2 subsequent roundtrips first to fetch mitogen.sudo module followed by the mitogen.master module. Eliminating module import roundtrips like these will probably be an ongoing battle, but there is a clean 80% solution that would apply in this specific case I just haven't gotten around to implementing yet.
The fine curve representing repeated executions of /bin/true is also visible: each bump in the curve is equivalent to Ansible's huge data uploads from earlier, but since Mitogen caches code in RAM remotely, unlike Ansible it doesn't need to reupload everything for each call, or start a new Python process, or rewrite a ZIP file on disk, or .. etc.
Finally one last graph, showing Mitogen with the execution loop moved to the remote machine. All the latency induced by repeatedly invoking /bin/true from the local machine has disappeared.
The Less Ugly
Ansible's pipelining mode is much better, and somewhat resembles Mitogen's own bootstrap process. Here the templatized initial script is fed directly into the target Python interpreter, however they immediately deviate since Ansible starts by extracting the embedded ZIP file per step 8 above, and discarding all the code it uploaded once the playbook step completes, with no effort made to preserve either the Python processes spawned, or the significant amount of uploaded module code for each step.
Pipelining mode is a huge improvement, however it still suffers from making use of the SSH stdio pipeline only once (which was expensive to setup, even with multiplexing enabled), the destination Python interpreter only once (usually ~100ms+ per invocation), and as mentioned repeatedly, no caching of code in the target, not even on disk.
When Mitogen is executing your Python function:
it executes SSH with a single Python command-line,
then it waits for that command-line to report "EC0" on stdout,
then it writes a copy of itself over the SSH pipe,
meanwhile the remote Python interpreter forks into two processes,
the first re-execs itself to clear the huge Python command-line passed over SSH, and resets argv to something descriptive,
the second signals "EC0" and waits for the parent context to send 7KiB worth of Mitogen source, which it decompresses and feeds to the first before exitting,
the Mitogen source reconfigures the Python module importer, stdio, and logging framework to point back into itself, then starts a private multiplexer thread,
the main thread writes "EC1" then sleeps waiting for CALL_FUNCTION messages,
meanwhile the multiplexer routes messages between this context's main thread, the parent, and any child contexts, and waits for something to trigger shutdown.
then it waits for the remote process to report "EC1",
then it writes a CALL_FUNCTION message which includes the target module, class, and function name and parameters,
the slave receives the CALL_FUNCTION message and begins execution, satisfying in-RAM module imports using the connection to the parent context as necessary.
On subsequent invocations of your Python function, or other functions from the same module, only steps 3.6, 5, and 5.1 are necessary.
This all sounds fine and dandy, but how can I use it?
I'm working on it! For now my goal is to implement enough functionality so that Mitogen can be made to work with Ansible's process model. The first problem is that Ansible runs playbooks using multiple local processes, and has no subprocess<->host affinity, so it is not immediately possible to cache Mitogen's state for a host. I have a solid plan for solving that, but it's not yet implemented.
There are a huge variety of things I haven't started yet, but will eventually be needed for more complex setups:
Asynchronous connect(): so large numbers of contexts can be spawned in reasonable time. For, say, 3 tiers targeting a 1,500 node network connecting in 30 seconds or so: a per-rack tier connecting to 38-42 end nodes, a per-quadrant tier connecting to 10 or so racks, a single box in the datacentre tier for access to a management LAN, reducing latency and caching uploaded modules within a datacenter's network, and the top-level tier which is the master program itself.
Better Bootstrap, Module Caching And Prefetching: currently Mitogen is wasting network roundtrips in various places. This makes me lose sleep.
General Robustness: no doubt with real-world use, many edge cases, crashes, hangs, races and suchlike will be be discovered. Of those, I'm most concerned with ensuring the master process never hangs with CTRL+C or SIGTERM, and in the case of master disconnect, orphaned contexts completely shut down 100% of the time, even if their main thread has hung.
Better Connection Types: it should at least support SSH connection setup over a transparently forwarded TCP connection (e.g. via a bastion host), so that key material never leaves the master machine. Additionally I haven't even started on Windows support yet.
Security Audit: currently the package is using cPickle with a highly restrictive class whitelist. I still think it should be possible to use this safely, but I'm not yet satisfied this is true. I'd also like it to optionally use JSON if the target Python version is modern enough. Additionally some design tweaks are needed to ensure a compromised slave cannot use Mitogen to cross-infect neighbouring nodes.
Richer Primitives: I've spent so much effort keeping the core of Mitogen compact that overall design has suffered, and while almost anything is possible using the base code, often it involves scrobbling around in the internal plumbing to get things working. Specifically I'd like to make it possible to pass Context handles as RPC parameters, and generalise the fakessh code so that it can handle other kinds of forwarding (e.g. TCP connections, additional UNIX pipe scenarios).
Tests. The big one: I've only started to think about tests recently as the design has settled, but so much system-level trickery is employed, always spread out across at least 2 processes, that an effective test strategy is so far elusive. Logical tests don't capture any of the complex OS/IO ordering behaviour, and while typical integration tests would capture that, they are too coarse to rely on for catching new bugs quickly and with strong specificity.
Why are you writing about this now?
If you read this far, there's a good chance you either work in infrastructure tooling, or were so badly burned by your experience there that you moved into management. Either way, you might be the person who could help me spend more time on this project. Perhaps you are on a 10-person team with a budget, where 30% of the man-hours are being wasted on Ansible's connection latency? If so, you should definitely drop me an e-mail.
The problem with projects like this is that it is almost impossible to justify commercially, it is much closer to research than product, and nobody ever wants to pay for that. However, that phase is over, the base implementation looks clean and feels increasingly solid, my development tasks are becoming increasingly target-driven, and I'd love the privilege to polish up what I have, to make contemporary devops tooling a significantly less depressing experience for everyone involved.
If you merely made it to the bottom of the article because you're interested or have related ideas, please drop me an e-mail. It's not quite ready for the prime time, but things work more than sufficiently that early experiementation is probably welcome at this point.
Meanwhile I will continue aiming to make it suitable for use with Ansible, or perhaps a gentle fork of Ansible, since its internal layering isn't the greatest.