Parsing with OCaml and Angstrom¶
All opinions are my own and do not represent those of any employer, past or present.
Network automation is dominated by the Python and Go programming languages. Python is very attractive since it has one of the lowest learning curves of any programming language, and most network automation developers and scripters started out like myself, in traditional network engineering roles. Python has an incredible ecosystem of tooling focused on this domain. Python has some serious drawbacks, too. It is a fat language in many respects. The standard library is large, it is near impossible to build small containers, and it is not terribly performant. I don't want to wade into the dynamic-versus-typed debate, but my opinion is that it is difficult to write "correct" programs in a dynamic language. The typing system in Python is a dumpster fire. Like the continuous stream of new language features, the typing committee is trying to please all masters simultaneously.
Golang is an advantage in each of these situations with the added benefits of an elegant concurrency model and strongly-typed compiler. My main gripe with Go is that it is a strongly-opinionated language, and it is very difficult to color outside the lines. It lacks generics, and the current proposals leave a lot to be desired despite being huge improvements over the current situation.
Functional programming is going through a bit of a renaissance, and it has become popular in fin-tech and crypto. Many other companies large and small are testing the waters as well. I would like to explore functional programming as an alternative for network automation tasks. I will start with a common example then deal with the language details in a future post. I hope showing the "why" inspires interest in the "how".
Parsing¶
Most network automation developers are intimately familiar regular expressions and libraries like TextFSM. I never did look deeply into subject, and I recently discovered the large rabbit hole that is the world of parsing. At a high level, parsing is simply the act of turning a string of characters into some sort of structured data. Parsers are used in everything from programming languages to the code that translates this markdown into HTML. Several tools have been created to simplify this work. The three broad categories are:
- regular expressions
- parser generators
- parser combinators
If you aren't familiar with regular expressions, you probably aren't interested in reading further. 😂 Wikipedia has great articles on each subject.
Parser generators¶
Parser generators perform a task called lexicographical analysis that converts a stream of characters into a stream of "tokens". These tokens are assembled in the parsing step to build an abstract syntax tree (AST). Parser generators work best for well-defined grammars. Think programming languages, JSON, YANG, YAML, etc. These are streams of characters that throw errors when we hit unexpected input.
OCaml's Menhir library is a parser generator. David Beazely has spoken and written about the subject using Python.
Parser combinators¶
Lastly, we consider parser combinators (and the real focus of this post).
From a network automation slant, we would consider parser combinators anywhere we have written bespoke regex logic.
Textfsm
works great in this space, but it has a few drawbacks.
- It introduces yet another DSL.
- The DSL isn't as expressive as a programming language.
- It is untyped (as a result of the Python backend).
- It is not very extensible (the returned data structures are not configurable).
- It's run by Google, and they're not exactly engaged with the community.
To be clear, TextFSM is my go-to library most days. Parser-combinators typically support look-ahead, and it is much easier to "throw away" data we're not interested in. A good example of the limitations of the DSL is in this template. It is clunky to deal with repeated data in certain contexts, and what if there is no upper bound on the list length?
Side rant¶
It is no secret that I love to hate on YAML. Software developers love abstractions; we live in abstractions. We also tend to be a bit too lazy for our own good. "Arrrgggh, I'm sick of writing regex parsers, we need to abstract this into a library!" "This library is too complicated, let's write a DSL!" The next thing you know... Source: Randall Munroe
I am not saying YAML isn't useful. I am saying that if you work with Ansible or Kubernetes that you have probably tried to build Turing-complete logic in YAML. Be cognizant of the limitations of your tools, and go back to your roots when the abstractions start to fall apart. We try to make tools accessible and easy to use for those around us, but it is difficult to replicate the expressiveness of a programming language.
</rant>
Angstrom¶
Below, I am using OCaml's Angstrom library.
The complete source is available in this gist.
Let's consider this raw string from Cisco IOS from the show vlan
command:
Input¶
let vlan_raw =
"VLAN Name Status Ports
---- -------------------------------- --------- -------------------------------
1 default active Gi0/1
10 Management active
50 VLan50 active Fa0/1, Fa0/2, Fa0/3, Fa0/4, Fa0/5, Fa0/6, Fa0/7, Fa0/8, Fa0/9
Fa0/10, Fa0/11, Fa0/12
60 VLan60 active Fa0/13, Fa0/14, Fa0/15, Fa0/16, Fa0/17, Fa0/18, Fa0/19, Fa0/20
Fa0/21, Fa0/22, Fa0/23, Fa0/24
1002 fddi-default act/unsup
1003 token-ring-default act/unsup
1004 fddinet-default act/unsup
1005 trnet-default act/unsup
VLAN Type SAID MTU Parent RingNo BridgeNo Stp BrdgMode Trans1 Trans2
---- ----- ---------- ----- ------ ------ -------- ---- -------- ------ ------
1 enet 100001 1500 - - - - - 0 0
10 enet 100010 1500 - - - - - 0 0
50 enet 100050 1500 - - - - - 0 0
60 enet 100060 1500 - - - - - 0 0
1002 fddi 101002 1500 - - - - - 0 0
1003 tr 101003 1500 - - - - srb 0 0
1004 fdnet 101004 1500 - - - ieee - 0 0
1005 trnet 101005 1500 - - - ibm - 0 0
Remote SPAN VLANs
------------------------------------------------------------------------------
Primary Secondary Type Ports
------- --------- ----------------- ------------------------------------------"
Data structures¶
We wish to capture the two tables and transform them to structured data.
OCaml has record types.
These look like annotated Python dict
types, but they are closer conceptually to named tuples.
The primary difference is that these types are not keyed mappings.
type vlan = {
vlan_id : int;
name : string;
status : string;
ports : string list;
}
type vlan_stat = {
vlan_id : int;
vl_type : string;
said : string;
mtu : int;
parent : string;
ring_no : string;
bridge_no : string;
stp : string;
bridge_mode : string;
trans1 : int;
trans2 : int;
}
Rules¶
Next, we need some parsing rules.
Your first thought might be, "Why not use regex?".
The reason is that this parser-combinator lexes a character at a time.
We can match on strings, but this is simply a recursive implementation of the character processor.
I followed the naming convention in Angstrom of using a 1
to indicate that a rule requires at least one match.
Speaking of empty matches... this is the primary "gotcha" that I ran into using Angstrom. If your parser hangs and starts chewing through vast amounts of memory and CPU, double check your rules. Angstrom doesn't handle arbitrary conditions well, and happily recurses itself into a corner if you mess up. Be mindful of cases where a string must exist or may exist. Try not to get hung up on the syntax; I will explain the language in much more detail in another post.
let is_whitespace = function '\x20' | '\x09' -> true | _ -> false
let whitespace = take_while is_whitespace
let whitespace1 = take_while1 is_whitespace
let is_newline = function '\x0a' | '\x0d' -> true | _ -> false
let newline = take_while is_newline
let newline1 = take_while1 is_newline
let is_digit = function '0' .. '9' -> true | _ -> false
let digit = take_while is_digit
let digit1 = take_while is_digit
let is_alpha = function
| '_' | '-' | 'a' .. 'z' | 'A' .. 'Z' -> true
| _ -> false
let is_iface_char = function '/' | '.' -> true | _ -> false
let is_ident c = is_alpha c || is_digit c || is_iface_char c
let ident = take_while is_ident
let ident1 = take_while1 is_ident
Handwaving and magic¶
If you are familiar with ML-family languages, this will look a lot less magical. If you come from an imperative language, the following looks like pure chicken scratch. These operators are the "combinator" part of "parser-combinator". We can write a bunch of rules and glue them together in interesting ways. Programmers know about modularity; this is all about modularity!
-
*>
is an operator that throws away the preceding output and returns the next expression -
<*
does just the opposite -
>>=
binds two functions together, chaining the results together -
variable |> int_of_string
is a way of applying a type cast to a result -
return
binds the results together given our data structures at the beginning
Hopefully, the flow will make some sense. These build more complex parsers from the simple rules above.
let vl_header =
Angstrom.string "VLAN" *> take_while1 is_whitespace *> string_ci "name"
*> take_while1 is_whitespace *> string_ci "status"
*> take_while1 is_whitespace *> string_ci "ports" *> take_while is_whitespace
*> newline
let sep_row = Angstrom.string "--" *> take_till is_newline *> skip_many newline1
let vl_stat_header =
Angstrom.string "VLAN" *> whitespace *> string_ci "type" *> whitespace
*> string_ci "said" *> whitespace *> string_ci "mtu" *> take_till is_newline
*> newline
let iface_list = sep_by (string ", ") ident1 <* newline1
(* VLAN Name Status Ports *)
let vl_data_row =
digit1 <* whitespace1 >>= fun vid ->
ident1 <* whitespace1 >>= fun name ->
ident1 <* whitespace >>= fun status ->
iface_list >>= fun ifaces ->
many (whitespace1 *> iface_list) >>= fun add_ifaces ->
return
{
vlan_id = vid |> int_of_string;
name;
status;
ports = List.append ifaces (List.flatten add_ifaces);
}
(* VLAN Type SAID MTU Parent RingNo BridgeNo Stp BrdgMode Trans1 Trans2 *)
let vl_stats_row =
digit1 <* whitespace1 >>= fun vid ->
ident1 <* whitespace1 >>= fun vtype ->
ident1 <* whitespace1 >>= fun said ->
ident1 <* whitespace1 >>= fun mtu ->
ident1 <* whitespace1 >>= fun parent ->
ident1 <* whitespace1 >>= fun ringno ->
ident1 <* whitespace1 >>= fun brno ->
ident1 <* whitespace1 >>= fun stp ->
ident1 <* whitespace1 >>= fun brmode ->
ident1 <* whitespace1 >>= fun t1 ->
ident1 <* whitespace1 <* skip_many newline1 >>= fun t2 ->
return
{
vlan_id = vid |> int_of_string;
vl_type = vtype;
said;
mtu = mtu |> int_of_string;
parent;
ring_no = ringno;
bridge_no = brno;
stp;
bridge_mode = brmode;
trans1 = t1 |> int_of_string;
trans2 = t2 |> int_of_string;
}
let vlan_info = vl_header *> sep_row *> many vl_data_row <* skip_many newline1
let vlan_stats =
vl_stat_header *> sep_row *> many vl_stats_row <* skip_many any_char
Tie it all together¶
The last expression is the final piece of glue to wrap up all of the above work.
It returns a tuple of lists of vlan
and vlan_stat
defined in the beginning.
let p =
match
Angstrom.parse_string ~consume:All
( vlan_info >>= fun vli ->
vlan_stats >>= fun vls -> return (vli, vls) )
vlan_raw
with
| Ok v -> v
| Error msg -> failwith msg
And produce a result¶
At long last, Angstrom "magic" produces a useful result!
utop # Lib.Vlan.p;;
- : Lib.Vlan.vlan list * Lib.Vlan.vlan_stat list =
([{Lib.Vlan.vlan_id = 1; name = "default"; status = "active";
ports = ["Gi0/1"]};
{Lib.Vlan.vlan_id = 10; name = "Management"; status = "active"; ports = []};
{Lib.Vlan.vlan_id = 50; name = "VLan50"; status = "active";
ports =
["Fa0/1"; "Fa0/2"; "Fa0/3"; "Fa0/4"; "Fa0/5"; "Fa0/6"; "Fa0/7"; "Fa0/8";
"Fa0/9"; "Fa0/10"; "Fa0/11"; "Fa0/12"]};
{Lib.Vlan.vlan_id = 60; name = "VLan60"; status = "active";
ports =
["Fa0/13"; "Fa0/14"; "Fa0/15"; "Fa0/16"; "Fa0/17"; "Fa0/18"; "Fa0/19";
"Fa0/20"; "Fa0/21"; "Fa0/22"; "Fa0/23"; "Fa0/24"]};
{Lib.Vlan.vlan_id = 1002; name = "fddi-default"; status = "act/unsup";
ports = []};
{Lib.Vlan.vlan_id = 1003; name = "token-ring-default"; status = "act/unsup";
ports = []};
{Lib.Vlan.vlan_id = 1004; name = "fddinet-default"; status = "act/unsup";
ports = []};
{Lib.Vlan.vlan_id = 1005; name = "trnet-default"; status = "act/unsup";
ports = []}],
[{Lib.Vlan.vlan_id = 1; vl_type = "enet"; said = "100001"; mtu = 1500;
parent = "-"; ring_no = "-"; bridge_no = "-"; stp = "-"; bridge_mode = "-";
trans1 = 0; trans2 = 0};
{Lib.Vlan.vlan_id = 10; vl_type = "enet"; said = "100010"; mtu = 1500;
parent = "-"; ring_no = "-"; bridge_no = "-"; stp = "-"; bridge_mode = "-";
trans1 = 0; trans2 = 0};
{Lib.Vlan.vlan_id = 50; vl_type = "enet"; said = "100050"; mtu = 1500;
parent = "-"; ring_no = "-"; bridge_no = "-"; stp = "-"; bridge_mode = "-";
trans1 = 0; trans2 = 0};
{Lib.Vlan.vlan_id = 60; vl_type = "enet"; said = "100060"; mtu = 1500;
parent = "-"; ring_no = "-"; bridge_no = "-"; stp = "-"; bridge_mode = "-";
trans1 = 0; trans2 = 0};
{Lib.Vlan.vlan_id = 1002; vl_type = "fddi"; said = "101002"; mtu = 1500;
parent = "-"; ring_no = "-"; bridge_no = "-"; stp = "-"; bridge_mode = "-";
trans1 = 0; trans2 = 0};
{Lib.Vlan.vlan_id = 1003; vl_type = "tr"; said = "101003"; mtu = 1500;
parent = "-"; ring_no = "-"; bridge_no = "-"; stp = "-";
bridge_mode = "srb"; trans1 = 0; trans2 = 0};
{Lib.Vlan.vlan_id = 1004; vl_type = "fdnet"; said = "101004"; mtu = 1500;
parent = "-"; ring_no = "-"; bridge_no = "-"; stp = "ieee";
bridge_mode = "-"; trans1 = 0; trans2 = 0};
{Lib.Vlan.vlan_id = 1005; vl_type = "trnet"; said = "101005"; mtu = 1500;
parent = "-"; ring_no = "-"; bridge_no = "-"; stp = "ibm";
bridge_mode = "-"; trans1 = 0; trans2 = 0}])
There is a lot more that could be done with this, especially around data validation.
So far, the only thing I check for is that certain values are integers.
We could wrap those "-"
in Option
, STP
in a sum type, etc.
This is where we start pulling ahead of Python in terms of type safety and data integrity.
There are a few corner cases that will break this parser as well as the ntc-templates
version.
I will leave that as an exercise for the reader.
Going forward¶
Do I plan to rewrite the ntc-templates
library in OCaml?
No!
It does a splendid job, and has fantastic support.
I would love to make a case for using functional languages in automation tasks, but it's not practical yet. Ideas don't pay my bills, nor do rants about software quality and testing. First and foremost, software must be usable. Haskell is a wonderful language in many aspects, but it is so hard to do "real" work with it. We don't live in a pure data utopia, and I'm no postdoc getting paid by a university. I will go into more details as to "why OCaml?" in another post, but I wanted to whet your appetite something practical and useful.