Mitogen for Ansible's original plan described facets of a scheme centered on features made possible by a rigorous single cohesive distributed program model, but of those facets, it quickly became clear that most users are really only interested in the big one: a much faster Ansible.
While I'd prefer feature work, this priority is fine: better performance usually entails enhancements that benefit the overall scheme, and improving people's lives in this manner is highly rewarding, so incentives remain aligned. It is impossible not to find renewed energy when faced with comments like this:
Enabling the mitogen plugin in ansible feels like switching from floppy to SSD
https://t.co/nCshkioX9h
Although feedback on the project has been very positive, the existing solution is sometimes not enough. Limitations in the extension and Ansible really bite, most often manifesting when running against many targets. In these scenarios, it is heartbreaking to see the work fail to help those who could benefit from it most, and that's what I'd like to talk about.
Controller-side Performance
Some time ago I began refactoring Ansible's linear strategy, aiming to get it to where controller-side enhancements might exist without adding more spaghetti, while becoming familiar with requirements for later features. To recap, the strategy plugin is responsible for almost every post-parsing task, including worker management. It is in many ways the beating heart at the core of every Ansible run.
After some months and one particularly enlightening conversation that work was resumed, eventually subsuming all of the remaining strategy support and result processing code, forming one huge refactor of a big chunk of upstream that has been gathering dust for almost a month.
The result exists today and is truly wonderful. It integrates Mitogen into the heart of Ansible without baking it in, introduces a carefully designed process model with strong persistence properties, eliminating most bottlenecks endured by the extension and vanilla Ansible, and provides an architectural basis for the next planned iteration of scalability work, Windows compatibility, some features mentioned, and quite a few that have been kept quiet.
With the new strategy it is possible to almost perfectly saturate an 8 vCPU machine given 100 targets, with minimal loss of speedup compared to single-target. Regarding single target, simple loops against localhost are up to 4x faster than the current stable extension.
While there are at least 2 obvious additional enhancements possible with this work, development reached a natural break in order to allow stablizing one piece of the puzzle at a time. Once this is done, it is clear exactly where to pick things up next.
Deep Cuts
There's just a small hitch: this work goes deep, entailing changes that, while so far would be possible as monkey-patches, are highly version-specific, and unlikely to remain monkey-patchable as the branch receives real-world usage. There must be a mechanism to ship unknown future patches to upstream code.
It was hoped it could land after Ansible 2.7, benefitting from related changes planned upstream, but they appear to have been delayed or abandoned, and so a situation exists where improvements cannot be shipped for at least another 4-6 months, assuming the related changes finally arrived in Ansible 2.8.
To the right is a rough approximation of components involved in executing a playbook. Those modified or replaced by the stable extension are green, yellow are replaced by the branch-in-waiting. Finally in orange are components affected by planned features and optimizations.
Although there are tens of thousands of lines of surrounding code, as should hopefully be clear, the number of untouched major components involved in a run has been dwindling fast. Put simply, the existing mechanism for delivering improvements is reaching its limit.
The F Word
Any seasoned developer, especially those familiar with the size of the Ansible code base, will hopefully understand the predicament. There is no problem delivering improvements today, assuming an unsupported one-off code dump was all anyone wanted, but that is never the case.
The problem lies in entering an unsustainable permanent marriage with a large project, not forgetting this outcome was an explicit non-goal from the start. Simultaneously over the months significant trust has been garnered to deliver these kinds of improvements, and abandoning one of the best yet would seem foolish.
Something of a many-variabled optimization process has recently come to an end, and I've found a solution that I am comfortable with. While making a release needs more time and may still not be definite, it seemed worth documenting at least some of the reasoning behind it before it comes.
Even though this outcome was undesirable, and although the solution in mind is not without restraint, it is still a cloud with many silver linings. For instance, new user configuration steps can be reduced to almost zero, core features can be added with minimal friction, and creative limitations are significantly uncapped.
What About The Extension?
The planned structure keeps the extension front-and-centre, so regardless of outcome it will continue to receive the majority of feature work and maintenance. It is definitely not going away.
With a third stable release looming, it's probably high time for a quick update. Many bugs were squashed since July, with stable work recently centered around problems with Ansible 2.6. This involved some changes to temporary file handling, and in the process, discovery of a huge missed optimization.
v0.2.3 will need only 2 roundtrips for each copy and template, or in terms of a 250ms transcontinental link, 10 seconds to copy 20 files vs. 30 seconds previously, or 2 minutes compared to vanilla's best configuration. This work is delayed somewhat as a new RPC chaining mechanism is added to better support all similar future changes, and identical situations likely to appear in similar tools.
Just tuning in?
Until next time!