After a long winter break from recreational programming, over the past days I
finally built up steam and broke a chunk of new ground on Mitogen, this time
growing its puny module forwarder into a bona fide beast, ready to handle
almost any network condition and user code thrown at it.
Mitogen is a library for executing parts of a Python program in a remote
context, primarily over sudo and
SSH connections, and establishing bidirectional communication with those
parts. Targeting infrastructure applications, it requires no upfront
configuration of target machines, aside from an SSH daemon and Python 2.x
interpreter, which is the default for almost every Linux machine found on any
The target need not possess a writeable filesystem, code is loaded
dynamically on demand, and execution occurs entirely from RAM.
How Import Works
To implement dynamic loading, child Python processes (“contexts”) have a
PEP-302 import hook installed
that causes attempts to import modules unavailable locally to automatically be
served over the network connection to the parent process. For example, in a
If the requests package is missing on the host k3, it will automatically
be copied and imported in RAM, without requiring upfront configuration, or
causing or requiring writes to the remote filesystem.
So far, so good. Just one hitch
While the loader has served well over the library’s prototypical life (which
in real time, is approaching 12 years!), it has always placed severe limits on
the structure of the loaded code, as each additional source file introduced
one network round-trip to serve it.
Given a relatively small dependency such as Kenneth Reitz' popular
Requests package, comprising 17
submodules, this means 17 additional network round-trips. While that may not
mean much over a typical local area network segment where roundtrips are
measured in microseconds, it quickly multiplies over even modest wide-area
networks, where infrastructure tooling is commonly deployed.
For a library like Requests, 17 round-trips amounts to 340ms latency over a
reasonably local 20ms link, which is comfortably within the realms of
acceptable, however over common radio and international links of 200ms or
more, already this adds at least 3.4 seconds to the startup cost of any
Mitogen program, time wasted doing nothing but waiting on the network.
Sadly, Requests is hardly even the biggest dependency Mitogen can expect to
encounter. For testing I chose
a representative baseline: heavily integrated with all of Django, it
transitively imports over 160 modules across numerous subpackages. That means
on an international link, over 30 seconds of startup latency spent on one
It is worth note that Django is not something I’d expect to see in a typical
Mitogen program, it’s simply an extraordinarily worst-case target worth
hitting. If Mitogen can handle django.db.models, it should cope with pretty
Combining evils, over an admittedly better-than-average Nepali mobile data
network, and an international link to my IRC box mail server in Paris,
django.db.models takes almost 60 seconds to load with the old design.
In the real world, this one-file-per-roundtrip characteristic means the
current approach sucks almost as much as
which calls into doubt my goal of implementing an Ansible-trumping Ansible
connection plug-in. Clearly something must give!
Over the years I discarded many approaches for handling this latency
Having the user explicitly configure a module list to deliver upfront
to new contexts, which sucks and is plainly unmaintainable.
Installing a PEP-302 hook in the master in order to observe the import
graph, which would be technically exciting, but likely to suck horribly due
to fragility and inevitable interference with real PEP-302 hooks, such as
Observing the import graph caused by a function call in a single context,
then using it to preload modules in additional contexts. This seems
workable, except the benefit would only be felt by multiple-child Mitogen
programs. Single child programs would continue to pay the latency tax.
Variants of 2 and 3, except caching the result as intermediate state in the
master’s filesystem. Ignoring the fact persistent intermediate state is
always evil (a topic for later!), that would require weird and imperfect
invalidation rules, which means performance would suck during development
and prototyping, and bugs are possible where state gets silently wedged
and previously working programs inexplicably slow down.
Finally last year I settled on using static analysis, and restricting
preloading at package boundaries. When a dependency is detected in a package
external to the one being requested, it is not preloaded until the child has
demonstrated, by requesting the top-level package module from its parent, that
the child lacks all of the submodules contained by it.
This seems like a good rule: preloading can occur aggressively within a
package, but must otherwise wait for a child to signal a package as missing
before preemptively wasting time and bandwidth delivering code the child never
As a final safeguard, preloading is restricted to only modules the master
itself loaded. It is not sufficient for an import statement to exist:
surrounding conditional logic must have caused the module to be loaded by the
master. In this manner the semantics of platform, version-specific and lazy
imports are roughly preserved.
Syntax tree hell
Quite predictably, after attempting to approach the problem with regexes, I
threw my hands up on realizing a single regex may not handle every possible
import a as b
from a import b
from a import b as c
from a import (b, c, d)
I gleefully thought I’d finally found a use for the
ast modules, and these were the
obvious alternative to avoiding the rats nest of multiple regexes. Not quite.
You see, across Python releases the grammar has changed, and in lock-step so
have the representations exported by the compiler and ast modules.
Adding insult to injury: neither module is supported through every interesting
Python version. I have seen Python 2.4 deployed commercially as recently as
summer 2016, and therefore consider it mandatory for the kind of library I
want on my toolbelt. To support antique and chic Python alike, it was
necessary to implement both approaches and select one at runtime. Many might
see this is an opportunity to drop 2.4, but “just upgrade lol” is never a
good answer while maintaining long shelf-life systems, and should never be a
a barrier to applying a trusted Swiss Army Knife.
After some busy days last September, I had a working scanner built around
syntax trees, except for a tiny problem: it was ridiculously slow. Parsing
the 8KiB mitogen.core module took 12ms on my laptop, which multiplied up is
over a second of CPU burnt scanning dependencies for a package like Django. If
memory serves, reality was closer to 3 seconds: far exceeding the latency
saved while talking to a machine on a LAN.
Sometimes hacking bytecode make perfect sense
I couldn’t stop groaning the day I abandoned ASTs. As is often true when
following software industry best practice, we are left holding a decomposing
trout that, while technically fulfilling its role, stinks horribly, costs all
involved a fortune to support and causes pains worse than those it was
intended to relieve. Still hoping to avoid regexes, I went digging for
precedent elsewhere in tools dealing with the same problem.
That’s when I discovered the strange and unloved
modulefinder buried in
the standard library, a forgotten relic from a bygone era, seductively
deposited there as a belated Christmas gift to all, on a gloomy New Year’s
Eve 2002 by Guido’s own
Diving in, I was shocked and mesmerized to find dependencies synthesized by
recompiling each module and extracting
opcodes from the compiled bytecode. Reimplementing a variant, I was overjoyed
to discover django.db.models transitive dependencies enumerated in under
350ms on my laptop. A workable solution!
The solution has some further crazy results: IMPORT_FROM has barely changed
since the Python 2.4 days, right through to Python 3.x. The same approach
works everywhere, including PyPy, which uses the same format, which makes this
more portable than the ast and compiler modules!
Coping with concurrency
Now a mechanism exists to enumerate dependencies, we need a mode of delivery.
The approach used is simplistic, and (as seen later), will likely require
On receiving a
message from a child, a parent (don’t forget, Mitogen operates recursively!)
first tries to satisfy the request from its own cache, before forwarding it
upwards towards the master. The master sends
messages for all dependencies known to be missing from the child before
sending a final message containing the module that was actually requested.
Since contexts always cache unsolicited LOAD_MODULE messages from upstream,
by the time the message arrives for the requested module, many dependencies
should be in RAM and no further network roundtrips requesting them are required.
Meanwhile for each stream connected to any parent, a set of module names ever
delivered on that stream are recorded. Each parent is allowed to ignore any
GET_MODULE for which a corresponding LOAD_MODULE has already been sent,
preventing a race between in-flight requests causing the same module to ever
be sent twice.
This places the onus on downstream contexts to ensure the single LOAD_MODULE
message received for each distinct module always reaches every interested
party. In short, GET_MODULE messages must be deduplicated and synchronized
not only for any arriving from a context’s children, but also from its own
And finally the result. For my test script, the total number of roundtrips
dropped from 166 to 13, one of which is for the script itself, and 3 negative
requests for extension modules that cannot be transferred. That leaves, bugs
aside, 9 roundtrips to transfer the most obscene dependency I could think
One more look at the library’s network profile. Over the same connection as
previously, the situation has improved immensely:
Not only is performance up, but the number of frames transmitted has dropped
by 42%. That’s a 42% fewer changes of connection hang due to crappy WiFi!
One final detail is visible: around the 10 second mark, a tall column
of frames is sent with progressively increasing size, almost in the same
instant. This is not some bug, it is Path MTU
Discovery (PMTUD) in
action. PMTUD is a mechanism by which IP subprotocols can learn the maximum
frame size tolerated by the path between communicating peers, which in turn
maximizes link efficiency by minimizing bandwidth wasted on headers. The size
is ramped up until either loss occurs or an intermediary signals error via
Just like the network path, PMTUD is dynamic and must restart on any signal
indicating network conditions have changed. Comparing this graph with the
previous, we see one final improvement as a result of providing the network
layer enough data to do its job: PMTUD appears restart much less frequently,
and the stream is pegged at the true path MTU for much longer.
Aside from simple fixes to reduce wasted roundtrips for extension modules that
can’t be imported, and optional imports of top-level packages that don’t exist
on the master, there are two major niggles remaining in how import works
The first is an irritating source of latency present in deep trees: currently
it is impossible for intermediary nodes satisfying
requests for children to streamily send preloaded modules towards a child
until the final
arrives at the intermediary for the module actually requested by the child.
That means preloading is artificially serialized at each layer in the tree,
when a better design would allow it to progress concurrent to the
LOAD_MODULE messages still in-flight from the master.
This will present itself when doing multi-machine hops where links between the
machines are slow or suffer high latency. It will also be important to fix
before handling hundreds to thousands of children, such as should become
practical once asynchronous connect() is implemented.
There are various approaches to tweaking the design so that concurrency is
restored, but I would like to let the paint dry a little on the new
implementation before destablizing it again.
The second major issue is almost certainly a bug waiting to be discovered, but
I’m out of energy to attack it right now. It relates to complex situations
where many children have different functions invoked in them, from a complex
set of overlapping packages. In such cases, it is possible that a
LOAD_MODULE for an unrelated GET_MODULE prematurely delivers the final
module from another import, before it has had all requisite modules preloaded
into the child.
To fix that, the library must ensure the tree of dependencies for all module
requests are sent downstream depth-first, i.e. it is never possible for any
module to appear in a LOAD_MODULE before all of its dependencies have first.
Finally there are latency sources buried elsewhere in the library, including
at least 2 needless roundtrips during connection setup. Fighting latency is an
endless war, but with module loading working efficiently, the most important
battle is over.