Low-code automation tradeoffs
  |   Source

Low-code tools and the tradeoffs of automation approaches

It is no secret that I am not a fan of YAML and its use and misuse in everything from Kubernetes to Salt to Ansible and beyond. Simply having a sour attitude about something is no way to make progress, so I sat down with a good friend from Disney Streaming to discuss what the real root of the problem is. Is YAML a bad thing? No! Certainly not. It is a tool like every other tool in the technologist's bag of tricks. If you use a screwdriver as a hammer, "you will have a bad time" ™️.

I struggle with rabbit holes and failing to see the forest for the trees at times. There are so many interrelated topics here that I could spend an entire blog on each of them. Bear with me; I will do my best to stay focused.

TL;DR; Russ White frequently repeats, "If you haven't found the tradeoffs, you haven't looked hard enough". This is my attempt at evaluating tradeoffs in automation tools and approaches.

Software hasn't eaten the world, long live the tools

Basic logic skills are essential for every technologist, but some people gravitate to certain jobs more than others. There has been plenty of hype and buzz, particularly in the network domain, about the "fact" that all the engineers need to become programmers. This, of course, has turned out to be false. Why is that? Numerous amazing tools have become widely available over the last several years like Ansible, Salt, and others. These tools have allowed engineers with few programming resources (or time!) to scale their roles in ways that were unthinkable ten years ago. Kubernetes is a widespread application deployment platform that has given software developers similar super-powers. One common factor between several of these tools is that they abstract complex operations behind a simple declaration format. Here is a snippet straight from the Ansible documentation:

- name: Install apache httpd  (state=present is optional)
  [Ansible](https://www.ansible.com/).builtin.apt:
    name: apache2
    state: present

If you need to perform a series of tasks on a long list of devices, the elegance and simplicity of this approach is difficult to argue with. Engineers can add logic to those blocks that can perform nested iteration or skip the task based on complex boolean expressions. I really can't understate the power of this approach. In the previous block, Ansible will perform the following actions on each endpoint:

  • SSH to the remote
  • Check the list of installed packages
    • If apache2 is present, do nothing and signal to the user that no change was made.
    • If it is not present, install the package and signal to the user success or failure.

I trust that most folks reading this have worked in this space or similar and understand how those previous steps are much easier said than done. What if you are running a Fedora-based system? Ansible has you covered there as well, and yes, you can abstract the package list over a list of package managers. It is automagic! Ansible, Salt, and friends work really well in the vast majority of use cases. They are mature tools with very strong community involvement.

Superpowers

I am painting a broad generalization here; ultimately, I am trying to make a more nuanced point.

Let's start with a list of strengths:

  • Low-code solutions. You really don't need to be a programmer to build scalable infrastructure.
  • Highly abstracted complex operations.
    • Automation frameworks handle a myriad of low-level details. Most expose these as needed.
  • Great community support.
  • "It just works." In most cases, yes.
  • You don't spend developer hours reinventing the wheel.
    • This can be very expensive and error-prone.
  • Extensible (more on this later).

It gets complicated

If the environment you manage is sufficiently large and/or complex, you will find yourself drawing outside the lines. Third-party tools like Helm, Dhall, CDK8S, and CDKTF all exist simply to address the shortcomings of the configuration format. I am not picking on YAML here; the problem exists for any format lacking sufficient expressiveness (Some would say Turing-complete, but I'm fresh out of unbounded tape). This includes:

All of these automation tools implement some kind of bespoke domain specific language inside of a markup format that was largely designed for files small enough to fit inside of a single terminal pane.

Here is a less trivial example found in the wild:

- name: Set iptables rules for SQL clients
  set_fact:
    sql_client_rules: >
      {{ groups.backend | map('extract', hostvars)
         | list | json_query(get_sql_clients) }}
  vars:
    get_sql_clients: >
      [?backend_is_sql_client].{
        comment: join(' ', ['Allow SQL traffic from', inventory_hostname]),
        protocol: 'tcp'
        source: [Ansible](https://www.ansible.com/)_host,
        destination_port: '3306',
        jump: 'ACCEPT',
        chain: 'INPUT'
      }

- include_role:
    name: example-iptables
    tasks_from: rules
  vars:
    iptables_rules: "{{ sql_client_rules }}"

YAML is similar to JSON. It is a set of key/value pairs, lists, and scalars. The previous example contains text templates, function calls, and an example of potential string footguns in YAML. What could go wrong?!

The joys of mixing and matching these markups and templates is exemplified in the following example:

cv_versions: "{{ (versions.resources | rejectattr('environments') | rejectattr('composite_content_view_ids') |
  rejectattr('published_in_composite_content_view_ids') | map(attribute='version') | map('float') | sort |
  map('string') | reverse | list )[foreman_content_view_version_cleanup_keep:] }}"

Imagine how many times it took to throw filter spaghetti at the wall instead of simply writing that in Python?

I have threatened in more than one forum that I will someday write a book called "Object-Oriented Programming in YAML and Jinja". Practically all text-templating languages have been used and abused just as badly. If anyone thinks that complex YAML is a problem, SaltStack pours salt in the wound (bad pun intended) by adding Jinja to the mix in ways that make Ansible blush. Imagine thousands of lines of this stuff:

base:
   '*':
     - firewalls
     - rtpengine
rtpengine:
{% if grains['nodename'] ==  'voip2-rtpeng' %}
     config:
        MEDIA_IPv4: 10.3.1.138
{% elif grains['nodename'] == 'voip2-rtpeng2' %}
     config:
        MEDIA_IPv4: 10.3.1.139
{% endif %}

Not only can you embed Jinja in statements, you can define entire YAML files in Jinja that are rendered at the worst time imaginable.

Runtime.

    Data failed to compile:
----------
    ID comment in SLS foobar is not a dictionary

You can't even validate this as YAML without firing the renderer (I'll spare you a rant on side effects)! There is nothing quite as fun as teasing apart arcane error messages in the middle of an incident. Testing dynamically-typed, dynamically-interpreted code and config requires its own set of specialized tooling and headaches, and at some point, you start to reach for an expressive programming language.

Weaknesses

  • Most automation tools use a schema-less config format.
  • Dynamic everything.
    • Dynamic types
    • Dynamic values
    • Everything is evaluated at runtime.
    • Murhpy's law says you will hit the error path halfway through a critical deployment.
  • "Programming" in a markup language.
    • There is no standard, few formatting best-practices.
    • It is nearly impossible to lint anything containing text templates.

All your engineers need to become programmers

So you have decided to embrace programming, awesome! The grass is greener over here, I promise!

...

I lied. Of course, it isn't that simple. I often say, "if there was one best way to do it, we would all be doing it." This statement applies to practically every aspect of the "real world". It is squishy yet unforgiving.

Let's tease apart one of those awful Jinja templates:

# cv_versions: "{{ (versions.resources | rejectattr('environments') | rejectattr('composite_content_view_ids') |
#   rejectattr('published_in_composite_content_view_ids') | map(attribute='version') | map('float') | sort |
#   map('string') | reverse | list )[foreman_content_view_version_cleanup_keep:] }}"

versions = {
    'resources': [
        {
            'environments': ['a', 'b', 'c'],
            'composite_content_view_ids': [1, 2, 3],
            'published_in_composite_content_view_ids': [4, 5, 6],
            'version': '3.14159',
        },
        {
            'environments': ['g', 'h'],
            'composite_content_view_ids': [1, 2, 3],
            'published_in_composite_content_view_ids': [4, 5, 6],
            'version': '2.71828',
        }
    ]
}

def filter_cleanup_versions(version_data: SomeSchema) -> list[str]:
    """
    Extract and sort version string from version resources.
    # cv_versions: "{{ (versions.resources | filter_cleanup_versions }}"

    >>> filter_cleanup_versions(versions)
    ['2.71828', '3.14159']
    """
    versions = [
        resource['version'] for resource in version_data['resources']
    ]
    return sorted(versions, key=lambda v: float(v))

I found that snippet via a random search of Github, so I am guessing on the shape of the data. There are a dozen different ways to write that function, but that is beside the point.

Superpowers

Pick your language, pick your style, the bottom line is:

  • Embedded documentation.
    • YAML supports comments, but tools like Sphinx can render docstrings to beautiful HTML docs.
  • Typed (depending on your team norms and language of choice).
  • Testable (outside of the runtime framework, too!).
  • Fully expressive.

It is still complicated

Full-bore software development is a complicated monster. Testing, linting, CI, and deployment don't happen without significant effort. Which language do you choose, which libraries? How do you establish norms around style and testing? I have seen plenty of shops that have a covey of network engineers and one or two developers on a team. This is a great setup in the spirit of embedded SRE's, and this is my preferred approach to an automation team. If the work volume outpaces the developers' capacity to test and code, you may end up with too many chefs in the kitchen. This can lead to lots of duct tape and bailing wire to get tickets out the door. At best, you may end up with thousands of lines of poor-quality code in your repos that turns into a maintenance nightmare. Testing matters. Code coverage matters. Quality matters. There is an inflection point where your "infrastructure as code" becomes technical debt and a maintenance burden. You have traded a team of CLI manipulators for Jinja specialists. It is a zero-sum game that many organizations have fallen prey to. Call your favorite consultant and hope they can help sort out the mess.

I was recently given a task to perform some pre and post upgrade checks for network devices. I could have done most of it with off-the-shelf tools, but I needed enough specialization to justify writing code. JSNAPy is a terrific tool that I looked to for inspiration. The shortcoming is the same as every other YAML-based automation framework. YAML is simply not sufficiently expressive to do what I need to do; there is no meaningful way to abstract the tasks and their dependencies. Think about the problem you are trying to solve before diving straight into code. Just because YAML/other isn't expressive enough to meet your needs doesn't mean that no abstraction will do. A former coworker encouraged me to always be thinking about modularity. Even when writing code, don't loose the framework mindset; where else can you use this approach?

Weaknesses

  • Consistency/Scalability/Maintainability
  • Environmental complexity
  • Team dynamics
  • Loss of generality

Finding balance in the force

Just because your automation tool of choice fails to meet 100% of your needs doesn't mean that you need to rewrite all of your automation in a custom framework! All of these tools, including Terraform and Kubernetes, can be extended in their native programming language. You don't have to choose one approach or another. These days, you can generally have your cake and eat it, too!

You will encounter complexity when trying to automate infrastructure. Embrace the challenge, but do not limit yourself to a single approach. Write code where it makes sense, and leverage existing tools to do the boring stuff!

Guidelines, not rules

  • Strict separation of config and code (this needs to be another blog post).
    • This leads to massive headaches in Salt.
      • Don't render SLS files in Jinja. You will have a bad time™️.
    • Think twice about that to_yaml filter in Ansible.
    • Embrace schemas whenever possible.
  • Are you "coloring outside the lines"?
    • Consider custom plugins and code for your framework of choice.
    • Instead of performing complex actions in Jinja, consider writing a filter (applies to any templating "language").
      • You can take this too far as well! There is a such thing as filter hell in Jinja akin to decorator hell in Python.
      • Do you need to rethink the problem? Is is more appropriate to translate your data before text processing?
  • If you are using Terraform or Kubernetes, check out these awesome projects and ditch the YAML!
  • When the JSON and YAML sprawl hit critical mass, consider alternatives like the myriad Python schema libraries or even TypeScript.
  • If software correctness is on your radar, opt for a statically-typed language.
    • Don't underestimate the amount of work it takes to build sufficient guardrails around Python and Ruby.