Command Line Tool As An Integration Point
Here is a software design pattern that you won’t find in Martin Fowler’s design pattern list. I don’t think this pattern is new or unique, but I have not seen it described or used anywhere else. At the right time and place it can be very handy, if not a little bit triggering to software purists. I had a need for this pattern three times in the past 10 years. I thought I’ll share a couple of examples. For context - all my projects were Python projects.
Imagine yourself in any of these situations:
- You need to write some code, but the language and library support for what you want to do is lacking, e.g. most popular library is outdated.
- You need to make a tool that is written in a different programming language work with your infrastructure, but one feature the tool relies on is missing.
- You just cannot be bothered to
shave yakwrite an efficient solution from scratch using the libraries available.
The approach I’ve seen others take is to try and find a compiled binary library
that has the required features, and integrate via ctypes
or similar. Or dig
into foreign codebase and implement the feature. Or spend days understanding the
tooling to implement the solution. These are good approaches, but sometimes I
felt like I’m solving a problem that has already been solved. Or I’m spending
time doing what others could probably do much better than I could.
Luckily for me, in those three situations mentioned above, I found a command line tool existed that did what I wanted. So without much glamour, I wrote Python wrappers around those tools. One of them is open-source (though not useful anymore, but you can admire the gory code), the other two were built in private setting and cannot be shared, but I hope to give a flavour of how they worked.
poor-smime-sign
This one hails from 2015, when Apple released their Apple Wallet and I
worked at a company that was selling event tickets. Perfect use-case for Wallet
Passes. Apple designed the pass security very well - it required cryptographic
signing. Unfortunately for us, Python3 cryptography support was lacking at that
time. As we all know, implementing your own crypto is a bad idea. Thus, I found
myself cornered. Luckily for me, there was a way to sign the passes using
openssl
command line tool. Lo and behold, I had a prototype, opensourced
here, that did the signing and
we shipped the feature soon after. The name was inspired by another open-source
tool, that sadly I cannot recall the name of.
The tool served its purpose well, but eventually we moved to a different library that had the required functionality. I always thought this was an amusing hack that I’d never have to do again. Until…
H2oCluster
Fast forward to 2019, I was working at a company that was using
H2O.ai for automated machine learning. The software was
written in Java and had a thin Python wrapper around its REST API. You could
launch a machine, start the software, throw data at it, tune some knobs, drink a
cup of coffee, and get multiple different models to evaluate. The fun part was
that our datasets were large, like 800GB uncompressed large, and the software
was keen on making copy of the data after each operation on the dataframe. Add a
new feature column? Copy. Drop a column? Copy. Split the data? Copy. Maybe it
did try to make it efficient, but it wasn’t doing great. Incidentally, I learned
how to read and analyse Java garbage collection logs and I ran the largest EC2
instances available at the time (x1e.32xlarge
if I recall), and we had to get
approved for that. But I digress.
H2O.ai had a nice feature, called cluster mode. In theory, you would spin a few
machines up, launch the software, and say “you’re now part of the same cluster
named foo
.” The machines would then use multicast networking to discover each
other, and connect. From a users point of view, they don’t care if it’s a single
instance or 10, it looks the same. But from the resource point of view it could
scale to a lot. Sounds great, until I learned that multicast networking is not
supported in AWS VPCs. Luckily for me, the software had a command line option to
pass a file containing IPs of all the other machines in the cluster at start-up
time. You can see where this is going.
I wrote a Python script that did the following: when launched, it would use EC2 metadata service to get its own instance id, query EC2 API to determine which autoscaling group it was in, get the list of IPs of all the other instances in the group once they’re all up and running, write it to a file and pass it to the H2O.ai executable through command line.
It. Worked. Like. A. Charm. I wrote a Python library internally that would allow a datascientist to launch these clusters ad-hoc with a single Python function. It was beautiful. Eventually we retired it in favour of our own tooling, but I was very proud and again amused that this approach has worked.
Data Warehousing
More recently, I was building a data warehouse on Postgres. The data size is
manageable, however, there are certain constraints about what can and cannot be
moved to the warehouse. We depend heavily on Django, and avoid dealing with low
level SQL. Hence, I just did not feel like digging into psycopg2
internals to
write my own “replication” tool. It also seemed like a solved problem. There are
command line tools that can do the job sufficiently well for our scale and
needs.
I prototyped data egress and ingress using psql
to test the speed. Then wrote
a Python script that takes a Django queryset and generates a select and create
table SQL queries. The script then executes psql -c "COPY (SELECT...) TO STDOUT"
command and streams output into a file. The create table is executed
inside the warehouse, followed by psql -c "COPY ... FROM STDIN"
. The whole
thing is smart to load into a temporary table first, and then swap it with the
real table. It takes minutes to move tens of gigabytes of data. I’m sure it
won’t scale infinitely, but at the same time - it works and I am sure it
would’ve taken me longer to write something comparative from scratch.
Ironically, as solved problems go, I found django-postgres-copy a week later. :shrug:
Final Words
Not sure what to say here, except don’t let anyone tell you that “it’s not elegant” or “not the right way to do it.” If it works, and is manageable - then it works, and that’s all that matters.