This is the first in what I hope will be at least a bi-weekly series to keep backers up to date on the current state of delivering the Mitogen extension for Ansible. I’m trying to use every second I have wisely until every major time risk is taken care of, so please forgive the knowledge-dump style of this post :)
Haven't been following?
Too long, didn’t read
Well ahead of time. Some exciting new stuff popped up, none of it intractably scary.
Funding Update
I have some fabulous news on funding: in addition to what was already public on Kickstarter, significant additional funding has become available, enough that I should be able to dedicate full time to the project for at least another 10 weeks!
Naturally this has some fantastic implications, including making it significantly likely that I’ll be able to implement Topology-aware File Synchronization.
Python 3 Support
I could not commit to this due to worrying Python 3 would become a huge and destablizing time sink, ruining any chance of delivering more immediately useful functionality.
The missing piece (exception syntax) to support from Python 2.4 all the way to 3.x has been found - it came via an extraordinarily fruitful IRC chat with the Ansible guys, and was originally implemented in Ansible itself by Marius Gedminas. With this last piece of the puzzle, the only bugs left to worry about are renamed imports and the usual bytes/str battles. Both are trivial to address with strong tests - something already due for the coming weeks. It now seems almost guaranteed Python 3 will be completed as part of this work, although I am still holding off on a 100% commitment until more pressing concerns are addressed.
New Risk: multiplexer throughput
Some truly insane performance bugs have been found and fixed already, particularly around the stress caused by delivering huge single messages, however during that work a new issue was found: IO multiplexer throughput truly sucks for many small messages.
This doesn’t impact things much except in one area: file transfer. While I haven’t implemented a final solution for file transfer yet, as part of that I will need to address what (for now) seems a hard single-thread performance limit: Mitogen’s current IO loop cannot push more than ~300MiB/sec in 128KiB-sized chunks, or to put it another way, best case 3MiB/sec given 100 targets.
Single thread performance: the obvious solution is sharding the multiplexer across multiple processes, and already that was likely required for completing the multithreaded connect work. This is a straightforward change that promises to comfortably saturate a Gigabit Ethernet port using a 2011 era Macbook while leaving plenty of room for components further up (Ansible) and down (ssh) the stack.
TTY layer: I’ve already implemented some fixes for this (increase buffer sizes, reduce loop iterations), but found some ugly new problems as a result: the TTY layer in every major UNIX has, at best, around a 4KiB buffer, forcing many syscalls and loop iterations, and it seems on no OS is this buffer tunable. Fear not, there is already a kick-ass solution for this too.
This problem should disappear entirely by the time real file transfer support is implemented - today the extension is still delivering files as a single large message. The blocker to fixing that is a missing flow control mechanism to prevent saturation of the message queue, which requires a little research. This hopefully isn’t going to be a huge amount of work, and I’ve already got a bunch of no-brainer yet hacky ways to fix it.
New risk: task isolation
It was only a matter of time, but the first isolation-related bug was found, due to a class variable in a built-in Ansible module that persists some state across invocations of the module’s main()
function. I’d been expecting something of this sort, so already had ideas for solving it when it came up, and really it was quite a surprise that only one such bug was reported out of all those reports from initial testers.
The obvious solution is forking a child for each task by default, however as always the devil is in the details, and in many intractable ways forking actually introduces state sharing problems far deadlier than those it promises to solve, in addition to introducing a huge (3ms on Xeon) penalty that is needless in most cases. Basically forking is absolute hell to get right - even for a tiny 2 kLOC library written almost entirely by one author who wrote his first fork()
call somewhere in the region of 20 years ago, and I’m certain this is liable to become a support nightmare.
The most valuable de facto protection afforded by fork - memory safety, is pretty redundant in an almost perfectly memory safe language like Python, that’s why the language is so popular at all.
Meanwhile forking is needed anyway for robust implementation of asynchronous tasks, so while implementing it would never have been wasted work, it is not obvious to me that forking could or should ever become the default mode. It amounts to a very ripe field for impossible to spot bugs of much harder classes than the simple solution of running everything in a single process, where we only need to care about version conflicts, crap monkey patches, needlessly global variables and memory/resource leaks.
I’m still exploring the solution space for this one, current thinking is maybe (maybe! this is totally greenfield) something like:
Built-in list of fixups for ridiculously easy to repair bugs, like the yum_repository
example above.
Whitelist for in-process execution any module known (and manually audited) to be perfectly safe. Common with_items
modules like lineinfile
easily fit in this class.
Whitelist for in-process safe but nonetheless leaky modules, such as the buggy yum_repository
module above that simply needs its bytecode re-executed (100usec) to paper over the bug. Can’t decide whether to keep this mode or not - or simply merge it with the above mode.
Default to forking (3ms - max 333 with_items
/sec) for all unknown bespoke (user) modules and built-in modules of dubious quality, with a mitogen_task_isolation
variable permitting the mode to be overridden by the user on a per-task basis. “Oh that one loop is eating 45 minutes? Try it with mitogen_task_isolation=none
”
All the Mitogen-side forking bits are implemented already, and I’m deferring the Ansible-side bits to be done simultaneous to supporting exotic module types, since that whole chunk of code needs a rewrite and no point in rewriting it twice.
Meanwhile whatever the outcome of this work, be assured you will always have your cake and eat it - this project is all about fixing performance, not regressing it. I hope this entire topic becomes a tiny implementation detail in the coming weeks.
CI
On the testing front I was absolutely overjoyed to discover DebOps by way of a Mitogen bug report. This deserves a whole article on its own, meanwhile it represents what is likely to be a huge piece of the testing puzzle.
Multithreaded connect
A big chunk is already implemented in order to fix an unrelated bug! The default pool size has 16 threads in one process, so there will only be a minor performance penalty for the first task to run when the number of targets exceeds 16. Meanwhile, the queue size is adjustable via an environment variable. I’ll tidy this up later.
Even though it basically already exists, I’m not yet focused on making multithreaded connect work - including analysing the various performance weirdness that appears when running Mitogen against multiple targets. These definitely exist, I just haven’t made time yet to determine whether it’s an Ansible-side scaling issue or a Mitogen-side issue. Stay tuned and don’t worry! Multi-target runs are already zippy, and I’m certain any issues found can be addressed.
Security
At least a full day will be dedicated to nothing but coming up with new attack scenarios, meanwhile I’m feeling pretty good about security already. The fabulous Alex Willmer has been busily inventing new cPickle attack scenarios, and some of them are absolutely fantastically scary! He’s sitting on at least one exciting new attack that represents a no-brainer decider on the viability of keeping cPickle or replacing it.
Serialization aside, I’ve been busy comparing Ansible’s existing security model to what the extension provides today, and have at least identified unidirectional routing mode as a must-have for delivering the extension. Regarding that, it is possible to have a single playbook safely target 2 otherwise completely partitioned networks. Today with Mitogen, one network could route messages towards workers in the other network using the controller as a bridge. While this should be harmless (given existing security mitigations), it still introduces a scary capability for an attacker that shouldn’t exist.
Some more security bugs I’m fixing here
Deferring Windows support
Really screwed up on planning here - turns out Ansible on Windows does not use Python whatsoever, and so implementing the support in Mitogen would mean increasing the installation requirements for Windows targets. That’s stupid, it violates Ansible’s zero-install design and was explicitly a non-goal from the get go.
Meanwhile WinRM has extremely poor options for bidirectional IO, and likely viable Mitogen support for Windows will include introducing a, say, SSL-encrypted reversion connection from the target machine in order to get efficient IO.
I will shortly be polling everyone who has pledged towards the project, and if nobody speaks up to save Windows, it’s being pushed to the back of the queue.
A big, big thanks, once again!
It goes without saying but none of this work has been a lone effort, starting from planning, article review, funding, testing, and an endless series of suggestions, questions and recommendations coming from so many people. Thanks to everyone, whether you contributed a single $1 or a single typo bug report.
Summary
Super busy, but also super on target! Until next time..