Functional Parsing
  |   Source

Parsing with OCaml and Angstrom

All opinions are my own and do not represent those of any employer, past or present.

Network automation is dominated by the Python and Go programming languages. Python is very attractive since it has one of the lowest learning curves of any programming language, and most network automation developers and scripters started out like myself, in traditional network engineering roles. Python has an incredible ecosystem of tooling focused on this domain. Python has some serious drawbacks, too. It is a fat language in many respects. The standard library is large, it is near impossible to build small containers, and it is not terribly performant. I don't want to wade into the dynamic-versus-typed debate, but my opinion is that it is difficult to write "correct" programs in a dynamic language. The typing system in Python is a dumpster fire. Like the continuous stream of new language features, the typing committee is trying to please all masters simultaneously.

Golang is an advantage in each of these situations with the added benefits of an elegant concurrency model and strongly-typed compiler. My main gripe with Go is that it is a strongly-opinionated language, and it is very difficult to color outside the lines. It lacks generics, and the current proposals leave a lot to be desired despite being huge improvements over the current situation.

Functional programming is going through a bit of a renaissance, and it has become popular in fin-tech and crypto. Many other companies large and small are testing the waters as well. I would like to explore functional programming as an alternative for network automation tasks. I will start with a common example then deal with the language details in a future post. I hope showing the "why" inspires interest in the "how".

Parsing

Most network automation developers are intimately familiar regular expressions and libraries like TextFSM. I never did look deeply into subject, and I recently discovered the large rabbit hole that is the world of parsing. At a high level, parsing is simply the act of turning a string of characters into some sort of structured data. Parsers are used in everything from programming languages to the code that translates this markdown into HTML. Several tools have been created to simplify this work. The three broad categories are:

  • regular expressions
  • parser generators
  • parser combinators

If you aren't familiar with regular expressions, you probably aren't interested in reading further. 😂 Wikipedia has great articles on each subject.

Parser generators

Parser generators perform a task called lexicographical analysis that converts a stream of characters into a stream of "tokens". These tokens are assembled in the parsing step to build an abstract syntax tree (AST). Parser generators work best for well-defined grammars. Think programming languages, JSON, YANG, YAML, etc. These are streams of characters that throw errors when we hit unexpected input.

OCaml's Menhir library is a parser generator. David Beazely has spoken and written about the subject using Python.

Parser combinators

Lastly, we consider parser combinators (and the real focus of this post). From a network automation slant, we would consider parser combinators anywhere we have written bespoke regex logic. Textfsm works great in this space, but it has a few drawbacks.

  • It introduces yet another DSL.
  • The DSL isn't as expressive as a programming language.
  • It is untyped (as a result of the Python backend).
  • It is not very extensible (the returned data structures are not configurable).
  • It's run by Google, and they're not exactly engaged with the community.

To be clear, TextFSM is my go-to library most days. Parser-combinators typically support look-ahead, and it is much easier to "throw away" data we're not interested in. A good example of the limitations of the DSL is in this template. It is clunky to deal with repeated data in certain contexts, and what if there is no upper bound on the list length?

Side rant

It is no secret that I love to hate on YAML. Software developers love abstractions; we live in abstractions. We also tend to be a bit too lazy for our own good. "Arrrgggh, I'm sick of writing regex parsers, we need to abstract this into a library!" "This library is too complicated, let's write a DSL!" The next thing you know... standardsSource: Randall Munroe

I am not saying YAML isn't useful. I am saying that if you work with Ansible or Kubernetes that you have probably tried to build Turing-complete logic in YAML. Be cognizant of the limitations of your tools, and go back to your roots when the abstractions start to fall apart. We try to make tools accessible and easy to use for those around us, but it is difficult to replicate the expressiveness of a programming language.

</rant>

Angstrom

Below, I am using OCaml's Angstrom library. The complete source is available in this gist. Let's consider this raw string from Cisco IOS from the show vlan command:

Input

let vlan_raw =
  "VLAN Name                             Status    Ports
---- -------------------------------- --------- -------------------------------
1    default                          active    Gi0/1
10   Management                       active
50   VLan50                           active    Fa0/1, Fa0/2, Fa0/3, Fa0/4, Fa0/5, Fa0/6, Fa0/7, Fa0/8, Fa0/9
                                                Fa0/10, Fa0/11, Fa0/12
60   VLan60                           active    Fa0/13, Fa0/14, Fa0/15, Fa0/16, Fa0/17, Fa0/18, Fa0/19, Fa0/20
                                                Fa0/21, Fa0/22, Fa0/23, Fa0/24
1002 fddi-default                     act/unsup
1003 token-ring-default               act/unsup
1004 fddinet-default                  act/unsup
1005 trnet-default                    act/unsup
VLAN Type  SAID       MTU   Parent RingNo BridgeNo Stp  BrdgMode Trans1 Trans2
---- ----- ---------- ----- ------ ------ -------- ---- -------- ------ ------
1    enet  100001     1500  -      -      -        -    -        0      0
10   enet  100010     1500  -      -      -        -    -        0      0
50   enet  100050     1500  -      -      -        -    -        0      0
60   enet  100060     1500  -      -      -        -    -        0      0
1002 fddi  101002     1500  -      -      -        -    -        0      0
1003 tr    101003     1500  -      -      -        -    srb      0      0
1004 fdnet 101004     1500  -      -      -        ieee -        0      0
1005 trnet 101005     1500  -      -      -        ibm  -        0      0
Remote SPAN VLANs
------------------------------------------------------------------------------
Primary Secondary Type              Ports
------- --------- ----------------- ------------------------------------------"

Data structures

We wish to capture the two tables and transform them to structured data. OCaml has record types. These look like annotated Python dict types, but they are closer conceptually to named tuples. The primary difference is that these types are not keyed mappings.

type vlan = {
  vlan_id : int;
  name : string;
  status : string;
  ports : string list;
}

type vlan_stat = {
  vlan_id : int;
  vl_type : string;
  said : string;
  mtu : int;
  parent : string;
  ring_no : string;
  bridge_no : string;
  stp : string;
  bridge_mode : string;
  trans1 : int;
  trans2 : int;
}

Rules

Next, we need some parsing rules. Your first thought might be, "Why not use regex?". The reason is that this parser-combinator lexes a character at a time. We can match on strings, but this is simply a recursive implementation of the character processor. I followed the naming convention in Angstrom of using a 1 to indicate that a rule requires at least one match.

Speaking of empty matches... this is the primary "gotcha" that I ran into using Angstrom. If your parser hangs and starts chewing through vast amounts of memory and CPU, double check your rules. Angstrom doesn't handle arbitrary conditions well, and happily recurses itself into a corner if you mess up. Be mindful of cases where a string must exist or may exist. Try not to get hung up on the syntax; I will explain the language in much more detail in another post.

let is_whitespace = function '\x20' | '\x09' -> true | _ -> false

let whitespace = take_while is_whitespace

let whitespace1 = take_while1 is_whitespace

let is_newline = function '\x0a' | '\x0d' -> true | _ -> false

let newline = take_while is_newline

let newline1 = take_while1 is_newline

let is_digit = function '0' .. '9' -> true | _ -> false

let digit = take_while is_digit

let digit1 = take_while is_digit

let is_alpha = function
  | '_' | '-' | 'a' .. 'z' | 'A' .. 'Z' -> true
  | _ -> false

let is_iface_char = function '/' | '.' -> true | _ -> false

let is_ident c = is_alpha c || is_digit c || is_iface_char c

let ident = take_while is_ident

let ident1 = take_while1 is_ident

Handwaving and magic

If you are familiar with ML-family languages, this will look a lot less magical. If you come from an imperative language, the following looks like pure chicken scratch. These operators are the "combinator" part of "parser-combinator". We can write a bunch of rules and glue them together in interesting ways. Programmers know about modularity; this is all about modularity!

  • *> is an operator that throws away the preceding output and returns the next expression
  • <* does just the opposite
  • >>= binds two functions together, chaining the results together
  • variable |> int_of_string is a way of applying a type cast to a result
  • return binds the results together given our data structures at the beginning

Hopefully, the flow will make some sense. These build more complex parsers from the simple rules above.

let vl_header =
  Angstrom.string "VLAN" *> take_while1 is_whitespace *> string_ci "name"
  *> take_while1 is_whitespace *> string_ci "status"
  *> take_while1 is_whitespace *> string_ci "ports" *> take_while is_whitespace
  *> newline

let sep_row = Angstrom.string "--" *> take_till is_newline *> skip_many newline1

let vl_stat_header =
  Angstrom.string "VLAN" *> whitespace *> string_ci "type" *> whitespace
  *> string_ci "said" *> whitespace *> string_ci "mtu" *> take_till is_newline
  *> newline

let iface_list = sep_by (string ", ") ident1 <* newline1

(* VLAN Name Status Ports *)
let vl_data_row =
  digit1 <* whitespace1 >>= fun vid ->
  ident1 <* whitespace1 >>= fun name ->
  ident1 <* whitespace >>= fun status ->
  iface_list >>= fun ifaces ->
  many (whitespace1 *> iface_list) >>= fun add_ifaces ->
  return
    {
      vlan_id = vid |> int_of_string;
      name;
      status;
      ports = List.append ifaces (List.flatten add_ifaces);
    }

(* VLAN Type SAID MTU Parent RingNo BridgeNo Stp BrdgMode Trans1 Trans2 *)
let vl_stats_row =
  digit1 <* whitespace1 >>= fun vid ->
  ident1 <* whitespace1 >>= fun vtype ->
  ident1 <* whitespace1 >>= fun said ->
  ident1 <* whitespace1 >>= fun mtu ->
  ident1 <* whitespace1 >>= fun parent ->
  ident1 <* whitespace1 >>= fun ringno ->
  ident1 <* whitespace1 >>= fun brno ->
  ident1 <* whitespace1 >>= fun stp ->
  ident1 <* whitespace1 >>= fun brmode ->
  ident1 <* whitespace1 >>= fun t1 ->
  ident1 <* whitespace1 <* skip_many newline1 >>= fun t2 ->
  return
    {
      vlan_id = vid |> int_of_string;
      vl_type = vtype;
      said;
      mtu = mtu |> int_of_string;
      parent;
      ring_no = ringno;
      bridge_no = brno;
      stp;
      bridge_mode = brmode;
      trans1 = t1 |> int_of_string;
      trans2 = t2 |> int_of_string;
    }

let vlan_info = vl_header *> sep_row *> many vl_data_row <* skip_many newline1

let vlan_stats =
  vl_stat_header *> sep_row *> many vl_stats_row <* skip_many any_char

Tie it all together

The last expression is the final piece of glue to wrap up all of the above work. It returns a tuple of lists of vlan and vlan_stat defined in the beginning.

let p =
  match
    Angstrom.parse_string ~consume:All
      ( vlan_info >>= fun vli ->
        vlan_stats >>= fun vls -> return (vli, vls) )
      vlan_raw
  with
  | Ok v -> v
  | Error msg -> failwith msg

And produce a result

At long last, Angstrom "magic" produces a useful result!

utop # Lib.Vlan.p;;
- : Lib.Vlan.vlan list * Lib.Vlan.vlan_stat list =
([{Lib.Vlan.vlan_id = 1; name = "default"; status = "active";
   ports = ["Gi0/1"]};
  {Lib.Vlan.vlan_id = 10; name = "Management"; status = "active"; ports = []};
  {Lib.Vlan.vlan_id = 50; name = "VLan50"; status = "active";
   ports =
    ["Fa0/1"; "Fa0/2"; "Fa0/3"; "Fa0/4"; "Fa0/5"; "Fa0/6"; "Fa0/7"; "Fa0/8";
     "Fa0/9"; "Fa0/10"; "Fa0/11"; "Fa0/12"]};
  {Lib.Vlan.vlan_id = 60; name = "VLan60"; status = "active";
   ports =
    ["Fa0/13"; "Fa0/14"; "Fa0/15"; "Fa0/16"; "Fa0/17"; "Fa0/18"; "Fa0/19";
     "Fa0/20"; "Fa0/21"; "Fa0/22"; "Fa0/23"; "Fa0/24"]};
  {Lib.Vlan.vlan_id = 1002; name = "fddi-default"; status = "act/unsup";
   ports = []};
  {Lib.Vlan.vlan_id = 1003; name = "token-ring-default"; status = "act/unsup";
   ports = []};
  {Lib.Vlan.vlan_id = 1004; name = "fddinet-default"; status = "act/unsup";
   ports = []};
  {Lib.Vlan.vlan_id = 1005; name = "trnet-default"; status = "act/unsup";
   ports = []}],
 [{Lib.Vlan.vlan_id = 1; vl_type = "enet"; said = "100001"; mtu = 1500;
   parent = "-"; ring_no = "-"; bridge_no = "-"; stp = "-"; bridge_mode = "-";
   trans1 = 0; trans2 = 0};
  {Lib.Vlan.vlan_id = 10; vl_type = "enet"; said = "100010"; mtu = 1500;
   parent = "-"; ring_no = "-"; bridge_no = "-"; stp = "-"; bridge_mode = "-";
   trans1 = 0; trans2 = 0};
  {Lib.Vlan.vlan_id = 50; vl_type = "enet"; said = "100050"; mtu = 1500;
   parent = "-"; ring_no = "-"; bridge_no = "-"; stp = "-"; bridge_mode = "-";
   trans1 = 0; trans2 = 0};
  {Lib.Vlan.vlan_id = 60; vl_type = "enet"; said = "100060"; mtu = 1500;
   parent = "-"; ring_no = "-"; bridge_no = "-"; stp = "-"; bridge_mode = "-";
   trans1 = 0; trans2 = 0};
  {Lib.Vlan.vlan_id = 1002; vl_type = "fddi"; said = "101002"; mtu = 1500;
   parent = "-"; ring_no = "-"; bridge_no = "-"; stp = "-"; bridge_mode = "-";
   trans1 = 0; trans2 = 0};
  {Lib.Vlan.vlan_id = 1003; vl_type = "tr"; said = "101003"; mtu = 1500;
   parent = "-"; ring_no = "-"; bridge_no = "-"; stp = "-";
   bridge_mode = "srb"; trans1 = 0; trans2 = 0};
  {Lib.Vlan.vlan_id = 1004; vl_type = "fdnet"; said = "101004"; mtu = 1500;
   parent = "-"; ring_no = "-"; bridge_no = "-"; stp = "ieee";
   bridge_mode = "-"; trans1 = 0; trans2 = 0};
  {Lib.Vlan.vlan_id = 1005; vl_type = "trnet"; said = "101005"; mtu = 1500;
   parent = "-"; ring_no = "-"; bridge_no = "-"; stp = "ibm";
   bridge_mode = "-"; trans1 = 0; trans2 = 0}])

There is a lot more that could be done with this, especially around data validation. So far, the only thing I check for is that certain values are integers. We could wrap those "-" in Option, STP in a sum type, etc. This is where we start pulling ahead of Python in terms of type safety and data integrity. There are a few corner cases that will break this parser as well as the ntc-templates version. I will leave that as an exercise for the reader.

Going forward

Do I plan to rewrite the ntc-templates library in OCaml? No! It does a splendid job, and has fantastic support.

I would love to make a case for using functional languages in automation tasks, but it's not practical yet. Ideas don't pay my bills, nor do rants about software quality and testing. First and foremost, software must be usable. Haskell is a wonderful language in many aspects, but it is so hard to do "real" work with it. We don't live in a pure data utopia, and I'm no postdoc getting paid by a university. I will go into more details as to "why OCaml?" in another post, but I wanted to whet your appetite something practical and useful.