Monoids for Pythonistas
  |   Source

Monoids for Pythonistas

Bartosz Milewsky has a great series of articles on category theory as it applies to programmers. I'm not going to lie, the material is pretty dense which is par for the course for a lot of mathematics subjects (maybe it's worth mentioning that Bartosz has a PhD in Theoretical Physics 😬). However, many of the core concepts are reasonably simple abstractions, and programmers know all about abstractions! The language can be as big a hurdle as reading mathematics symbols, but these are the tools mathematicians use to convey ideas in an elegant and concise way.

I promise that you know what monoids are - you use them every day as a programmer! I am going to flip the script and try to work my way to the definition instead of the other way around. Hopefully, with a bit of patience, we can uncover the power of these ideas and see how they apply to our day-to-day jobs!

Disclaimer This is not meant to be a mathematically rigorous discussion. Many small details are overlooked in favor of a high-level understanding. However, please make me aware of any egregious offenses!

Sets

As a programmer, you should know what a set is - an unordered collection of unique items of the same type (and if not, maybe a blog post for another day!).

>>> my_pets = {"Lamby", "Dozer", "Onyx", "BlackHen", "RedHen", "NoisyRooster"}
>>> my_pets.add("BuffHen")
>>> "Frank" in my_pets
False
>>> my_pets.add("BuffHen")  # `add` is idempotent and won't create a duplicate
>>> my_pets  # Notice that the order isn't consistent
{'NoisyRooster', 'RedHen', 'BuffHen', 'BlackHen', 'Lamby', 'Onyx', 'Dozer'}
>>>
>>> pets_son_feeds = {"Shayna", "Lamby"}
>>> pets_son_feeds.intersection(my_pets)
{'Lamby'}
>>>

The integers form a set, $\mathbb{I} = \{-1,0,1,2,3,4...\}$. Rational numbers form a set, $\mathbb{Q} = \{-99.42,0,1.1,1.2,1.3...\}$. Character strings even form a set, $\{\text{"a", "b", "c", "foo", "cow", "Bob", ...}\}$. For the purposes of this article, a "set" is a mathematical set, not constrained by physical memory limits of computers.

Symbols

Symbols are the syntax of mathematics.

$\forall$

Means "for all". Given a set, "for every element in the set." In the definition of a monoid, we will use this to assert a property of the items in the set. It is somewhat analogous to a for loop over all the items.

>>> odds = {1, 3, 5, 7, 9}
>>> for odd in odds:  # ∀ in the set...
...   assert odd % 2 == 1  # Assert that an odd divided by 2 has a remainder of 1
...
>>> # Success!

$\in$

means "in" or "an element of". 1 is an element of the set of integers.

>>> some_primes = {2, 3, 5, 7, 11}
>>> 7 in some_primes  # 7 ∈ some_primes == True
True
>>>

More terminology

Binary operation

An operation that works on two (hence binary) values; addition, multiplication, division, etc. Most of the Python numeric dunder methods are binary operations.

Associative

If an operation is associative, the order of composition doesn't matter. This property is very important as we'll see a bit later.

  • Addition is associative. 1 + (2 + 3) is the same as (1 + 2) + 3.
  • String concatenation is associative. "A" + ("B" + "C") == ("A" + "B") + "C"
    • Unlike numeric addition where 1 + 2 == 2 + 1, string concatenation is not commutative! Monoids only depend on the associative property. (See Andrey Mokhov's article for a fascinating deep-dive into this subject)
  • Subtraction is not associative as order of composition matters. 1 - (2 - 3) is not equal to (1 - 2) - 3.

Closed

For a given operation, the input and output types must match for the operation to be closed.

  • The minimum of two integers maps back into the set of integers.
def min_(a: int, b: int) -> int:
    return min(a, b)
  • Multiplication of two integers stays within the set of integers.
>>> type(42 * 7)
<class 'int'>
>>> a = 42
>>> b = 7
>>> type(a)
<class 'int'>
>>> type(b)
<class 'int'>
>>> type(a * b)
<class 'int'>
>>>
  • Floating point division of two integers returns a float, which is outside the set of integers. The input and output types don't match. This is not a closed operation.
>>> type(42 / 33)
<class 'float'>
>>>
  • The chr function in Python takes an integer and returns a str. This operation is not closed for the same reason; the output is of a different type than the input.
>>> type(chr(51))
<class 'str'>
>>>

Axioms

We're almost there! We need to mention identity. The importance of identity is a bit of a rabbit hole, so I'll leave it as an exercise for the reader to dig deeper (abstract algebra rings and rngs).

An identity element is an element that does not change the output when paired with any other input.

Identity

  • Numeric addition: 0 is the identity (works with reals, integers, naturals, etc).
>>> 42 + 0
42
>>>
  • Numeric multiplication: 1 is the identity (gotcha! The identity is dependent on the operation and the type).
>>> 42 * 1
42
>>>
  • List concatenation: the empty list is the identity
>>> [1, 4, 3] + []
[1, 4, 3]
>>>

The reason we need this is a little subtle. We are using a binary operation to make a computation, but what do we do when we only have one value to work with? Many of the use cases of monoids are recursive in nature, so what do you do when you reduce a list to one element? This is fairly representative of the problem (or imagine if functools.reduce didn't take a default argument).

>>> list(map(lambda x, y: x + y, [42]))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: <lambda>() missing 1 required positional argument: 'y'
>>>

We can use the identity! This allows the developer to safely perform a binary operation on a single element without changing the output.

Definition

Hopefully, we now have the tools to unpack this definition...

According to Eric Weisstein on Wolfram, a monoid is "a set that is closed under an associative binary operation and has an identity element $I \in S$ such that $ \forall a \in S, Ia=aI=a$".

I don't like that definition. What the heck does $Ia=aI=a$ mean? It looks like multiplication, but that is one of many possible binary operations. Here's my rephrasing:
A monoid consists of a set, $S$, that is closed under an associative binary operation, $f$, and has an identity element $I \in S$ such that $ \forall a \in S, f(I, a)=f(a, I)=a$.

A monoid has three components.

  • The set $S$, often called the "carrier set". Programmers know this as a type (int, float, double, etc).
  • $I$ is the identity element, which we identified in the previous section
  • A binary operation. Addition, multiplication, list concatenation, etc.

I said in the intro that you already know what a monoid is. Here are some examples:

  • Integers, addition, 0
  • Reals, multiplication, 1
  • Lists, concatenation, [ ]

Why it matters

At this point, you might be thinking about whipping me with a large stick for describing simple arithmetic in such obtuse and abstract terms. It matters, I promise! These definitions allow us to precisely identify objects with certain properties that support certain operations. Enough math, why does this matter to programmers?!

Divide

Remember how I said that associative property would be important later on? Consider the following items... they're all equivalent.

1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + 10
(1 + 2) + (3 + 4 + 5 + 6 + 7 + 8 + 9 + 10)
(1 + (2 + (3 + (4 + (5 + (6 + (7 + (8 + (9 + 10)))))))))
(((((((((1 + 2) + 3) + 4) + 5) + 6) + 7) + 8) + 9) + 10)

Here's another example.

"a" + "b" + "c" + "d" + "e" + "f" + "g" + "h" + "i" + "j"
"ab" + "cd" + "ef" + "gh" + "ij"
"abcd" + "efgh" + "ij"
"abcdefgh" + "ij"
"abcdefghij"

I can write those down on a set of index cards and pass the pairs to two people at a time (think two CPU cores). As long as I track the order of the results, I can recursively continue this process in any order of composition until the task is done. This process succeeds even if the workers have different computational capacities. The answer is the same as if one poor individual toiled away at the tasks independently.

More cores, more computers

If the list is really long or the binary operation is non-trivial, we can safely and easily divide this tasks amongst different cores - or even different computers. Notice that we are talking about CPU-bound tasks, not IO-bound tasks (multiple cores do little to alleviate the latter). Python doesn't have the ability to do this natively, but other languages do.

But be careful
from multiprocessing import Pool, cpu_count
import timeit


def read_data(fname: str):
    """Read Wiki tables"""
    with open(fname) as fp:
        items = fp.readlines()
    return map(lambda item: int(item.split(" ")[3]), items)


def do_it(fname):
    """Read data and perform calculation"""
    data = read_data(fname)
    return sum(data)


def multi_proc_demo() -> None:
    N_PROC = cpu_count()
    # Start the processes
    with Pool(processes=N_PROC) as p:
        results = []
        fnames = [
            "data/pagecounts-20090601-000000",
            "data/pagecounts-20090601-010000",
            "data/pagecounts-20090601-020000"
        ]
        for fname in fnames:
            results.append(p.apply_async(do_it, kwds={'fname': fname}))
        sum([result.get(timeout=5) for result in results])


def single_proc_demo() -> None:
    # cat data/pagecounts* >> all.txt
    sum(read_data("data/all.txt"))


if __name__ == "__main__":
    print("Multi-proc: ", timeit.Timer("multi_proc_demo()", globals=globals()).repeat(5, 10))
    print("Single-proc: ", timeit.Timer("single_proc_demo()", globals=globals()).repeat(5, 10))
python wiki.py
Multi-proc:  [28.17182195998612, 28.688616012994316, 28.539513622003142, 28.077732990990626, 28.294970447997912]
Single-proc:  [75.17859069499536, 73.89702447700256, 74.0759316909971, 74.918352623994, 74.7120841670112]

The example above uses a publicly-available dataset from Wikipedia (linked in the References). I spent way too much time fighting multiprocessing, but I learned a lot along the way. Always RTFM! There are several caveats to the library, and it is essential to read the Programming Guidelines. One of the issues is how Python copies memory to child processes. This serialization overhead is huge, so it is much better for your child processes to retrieve their own data. When in doubt, measure (the timeit library is great for this)! What does this have to do with monoids?

Since addition is a moniod, we can safely split the work up and aggregate the results. It is also commutative which means that we don't even have to track the order of the results. Not all operations are commutative, though! If you are multiplying a bunch of matrices, you can still batch the computation, but you must track the order of the results.

The right tool for the right job

Know your use case. I am an avid Pythonista, but sometimes alternatives might be the better choice. Gloang doesn't read all.txt any faster, but it performs the summation in milliseconds instead of seconds! Julia as native support for multi-core processing, and there are other languages that do as well.

And conquer

This divide-and-conquer concept is critical to well-known large-scale algorithms like MapReduce.

It doesn't mean we can't do it ourselves, though! Libraries like multiprocessing allow Python developers to delegate tasks among different cores or processors. Keep in mind that just like delegating tasks to friends, there is a computational overhead to splitting up tasks. It is important to understand that libraries like Numpy don't get you around the single-core limit even though it is implemented in C. Do your research! Know your data and the problem domain before you start concatenating small strings across processes! 😜

BLS data

Let's look at some data freely available from the Bureau of Labor Statistics. I downloaded the flat files from the link in the references and created some dataclasses to support our operations. This is a quick-and-dirty module. I'm not doing any type checking or validation in an effort to minimize boilerplate and get to the essence of the post.

In [1]:
from functools import partial, reduce
from dataclasses import dataclass, asdict

@dataclass(frozen=True)
class BLSRecord:
    """BLS table entry"""
    series_id: str
    year: str
    period: str
    value: str
    footnote_codes: str

@dataclass(frozen=True)
class BLSSeries:
    """BLS series entry"""
    series_id: str
    area_code: str
    item_code: str
    series_title: str
    footnote_codes: str
    begin_year: str
    begin_period: str
    end_year: str
    end_period: str


def read_with_header(fname: str, rec_type):
    """Read BLS tables with header row"""
    with open(fname) as fp:
        items = fp.readlines()
    # Skip the header row, split, strip, and instantiate the dataclass
    return map(lambda item: rec_type(*map(str.strip, item.split("\t"))), items[1::])

Composition

When dealing with business logic, we are often faced with brittle code when requirements change and that massive if/else block just keeps growing. Even worse is when you realize that similar logic is needed in multiple parts of the code and it's unclear how to "glue" the pieces together.

Here is a much less trivial example. One of the key axioms is that we are dealing with binary operations that map some type to the same type. Does this type have to be a number or a string? Not at all! Let's reconsider sum on integers; it takes two integers and returns an integer. Replace integer with "numeric type", and we can still play ball. Can we go further? Here is the formula for standard deviation:

$$ \sigma = \sqrt{\frac{\sum_{i=1}^N (x_i - \bar{x})^2}{N-1}} $$

Look at the summation $\sum_{i=1}^N (x_i - \bar{x})^2$ and break it down a little. Couldn't we also say $\sum_{i=1}^N func()$? In other words, instead of passing some numeric type to add, I want to pass a function that returns a numeric type to add. This is a perfectly valid construction! The cool thing is that we can compose functions that return numeric types with the add binary operation.

We can use this idea with boolean operations to compose data queries instead of using an endless string of if X and Y and Z where X, Y, and Z tend to get lengthy in my experience.

In [24]:
def is_coffee_lookup(rec: BLSRecord, lookup) -> bool:
    return series_data[rec.series_id].item_code == "717311"

def is_january(rec: BLSRecord) -> bool:
    return rec.period == "M01"

def is_not_1995(rec: BLSRecord) -> bool:
    return rec.year != "1995"

def intersect_identity(rec: BLSRecord) -> bool:
    # The "natural element" for intersections is True
    return True

def intersect(fst_fn, snd_fn):
    return lambda rec: fst_fn(rec) and snd_fn(rec)

# Create a lookup table for series entries
series_data = {row.series_id: row for row in read_with_header("data/ap.series", BLSSeries)}

# Inject our state
is_coffee = partial(is_coffee_lookup, lookup=series_data)

print(f"All records: {len(price_data):,}")
# Filter for entries that are coffee
print(f"Coffee: {len(list(filter(is_coffee, price_data))):,}")
# Filter for entries that are in January
print(f"January: {len(list(filter(is_january, price_data))):,}")

# Compose and profit!
coffee_and_january = reduce(intersect, [is_coffee, is_january], intersect_identity)
print(f"Coffee and January: {len(list(filter(coffee_and_january, price_data))):,}")

# We can compose any number of functions that are valid Moniods!
coffee_and_january_not_1995 = reduce(intersect, [is_not_1995, is_coffee, is_january], intersect_identity)
print(f"Coffee and January and not 1995: {len(list(filter(coffee_and_january_not_1995, price_data))):,}")
All records: 177,894
Coffee: 786
January: 15,042
Coffee and January: 66
Coffee and January and not 1995: 62

Conclusion

I hope I didn't chase too many rabbits; there are plenty to chase on this topic! I also hope that this has sparked some ideas and will help you identify common patterns in software that will make your coding a bit more efficient and clean.

References

Data