Data exploration with Rust and Polars
  |   Source

Data analysis with Rust and Polars

Something different, something new!

I suffer from distractions at times, and I chased a squirrel so hard this time that I am surprised I didn't snap my neck! I am pretty good about going through open tabs and closing things that I know I simply won't get to in a reasonable time. For things I might want to reference later, I use Pocket and tags. I had a week of no coursework last week, so I spent some time poking through my backlog. This whole adventure started off with me thinking about a way to come up with some realistic mock data. Yes, from faker import Faker is trivially easy, but my brain was dragging me down a rabbit hole. Plus Rust. Rust is cool.

Gray squirrel, a little bit of data

I found a list of international airports on Wikipedia. Hmmm, that isn't formatted too horribly. Copy-and-paste and you have yourself a tab-separated file (start with "Africa" and copy to the end of the tables). Not bad!

$ head IntAirports.txt 
Africa
Northern Africa
Algeria
Location        Airport IATA Code
Adrar   Touat-Cheikh Sidi Mohamed Belkebir Airport      AZR
Algiers Houari Boumediene Airport       ALG

Well, not too bad. This isn't exactly structured data, but we can probably work with it. The country is always listed before a header row, so let's ignore the continent and region.

In [ ]:
import csv


with open("IntAirports.txt") as fp, open('IntAirports.csv', 'w', newline='') as csvfile:
    fieldnames = ['country', 'city', 'IATA']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()
    old = ""
    country = ""
    for raw in fp.readlines():
        line = raw.strip()
        if line == "Location	Airport	IATA Code":
            country = old
            continue
        old = line
        parts = line.split("\t")
        if len(parts) != 3:
            continue
        writer.writerow({
            "country": country,
            "city": parts[0],
            "IATA": parts[2],
        })

What? Python?! Heck, yeah! Use the appropriate tool for the job. I am poking at some poorly-formatted data, and it is simply easier and faster to do this in Python.

$ head IntAirports.csv 
country,city,IATA
Algeria,Adrar,AZR
Algeria,Algiers,ALG
Algeria,Annaba,AAE
Algeria,Batna,BLJ

Cool, that looks usable. What can we do with this? What if I want to know population and lattitude/longitude for the cities on this list? Kaggle has a "World Cities" CSV you can download here.

$ head worldcities.csv 
"city","city_ascii","lat","lng","country","iso2","iso3","admin_name","capital","population","id"
"Tokyo","Tokyo","35.6897","139.6922","Japan","JP","JPN","Tōkyō","primary","37732000","1392685764"
"Jakarta","Jakarta","-6.1750","106.8275","Indonesia","ID","IDN","Jakarta","primary","33756000","1360771077"
"Delhi","Delhi","28.6100","77.2300","India","IN","IND","Delhi","admin","32226000","1356872604"

Fox squirrel, data exploration

The first step of data analysis is data exploration. Pandas and Jupyter notebooks are industry standard tools, but I am digging into Polars today. There is an excellent series of blog posts about using Rust for data analysis that you should read if you are interested in digging deeper. The Polars user guide is invaluable as well.

Flying squirrel, a note on notebooks

I started off using the Rust kernel in Jupyter-lab. My first impression is that I don't like it much. Rust is a compiled language, and using it in a notebooks feels clunky and slow because it has to recompile all the time. When it comes to data exploration, I would use Python and Polars in a Jupyter notebook. Figure out what your data looks like and what you want your pipeline to do. Once you have this, it is pretty trivial to write the production implementation in plain Rust.

Arctic ground squirrel, Polars

The complete code listing is posted at the end.

Rust is a strongly-typed language, and even Python has to be given a data type when your CSV is full of quoted cells. I start out by loading the data and declaring the schema.

fn get_cities() -> DataFrame {
    let fields: Vec<Field> = vec![
        Field::new("city", DataType::Utf8),
        Field::new("city_ascii", DataType::Utf8),
        Field::new("lat", DataType::Float32),
        Field::new("lng", DataType::Float32),
        Field::new("country", DataType::Utf8),
        Field::new("iso2", DataType::Utf8),
        Field::new("iso3", DataType::Utf8),
        Field::new("admin_name", DataType::Utf8),
        Field::new("capital", DataType::Utf8),
        Field::new("population", DataType::Float32),
        Field::new("id", DataType::Int64),
    ];
    let schema: Schema = Schema::from_iter(fields);
    let cities_file_path: &Path = Path::new("../worldcities.csv");
    read_data_frame_from_csv_with_schema(cities_file_path, schema)
}

This whole project is a software spike, so please forgive the gratuitous use of unwrap and expect. Polars has a very similar interface to Pandas, so we can load up two different dataframes and compare them just like you would query SQL tables.

let df_cities = get_cities().sort(["city", "country"], true).unwrap();
    let df_airports = get_airports().sort(["city", "country"], true).unwrap();
    let joined = df_cities
        .join(
            &df_airports,
            ["city", "country"],
            ["city", "country"],
            JoinType::Left,
            None,
        )
        .unwrap();

I know there are many more cities in the Kaggle data set than the Wikipedia set, so I need to filter null values. After that, I sort the remaining table by population.

let mask = joined.column("IATA").expect("not null").is_not_null();
    let mut filtered = joined
        .filter(&mask)
        .unwrap()
        .sort(["population"], true)
        .unwrap();

And now we can take a peek at the data.

println!(
        "{}",
        filtered
            .select(["city", "country", "population", "IATA"])
            .unwrap()
            .head(Some(5))
    );
┌───────────┬─────────────┬────────────┬──────┐
│ city      ┆ country     ┆ population ┆ IATA │
│ ---       ┆ ---         ┆ ---        ┆ ---  │
│ str       ┆ str         ┆ f32        ┆ str  │
╞═══════════╪═════════════╪════════════╪══════╡
│ Tokyo     ┆ Japan       ┆ 3.7732e7   ┆ HND  │
│ Jakarta   ┆ Indonesia   ┆ 3.3756e7   ┆ HLP  │
│ Guangzhou ┆ China       ┆ 2.694e7    ┆ CAN  │
│ Mumbai    ┆ India       ┆ 2.4973e7   ┆ BOM  │
│ Manila    ┆ Philippines ┆ 2.4922e7   ┆ MNL  │
└───────────┴─────────────┴────────────┴──────┘

This looks reasonable, and today I learned that Jakarta is a huge city! Lastly, I write out the data to a new CSV.

let mut file = File::create("cities-with-airports.csv").expect("could not create file");
    CsvWriter::new(&mut file)
        .has_header(true)
        .with_delimiter(b',')
        .finish(&mut filtered)
        .expect("failed to write output");

Abert's squirrel, takeaways

Where did this all start? Honestly, I already forgot, but it was an excuse to dig into a new library and write some code. Rust is wicked fast, but the learning curve is infamous. The Polars library has a great API, and I found it intuitive and easy to use.

TL;DR;

  • Do data exploration in Jupyter+Python
  • Write your production pipeline in Rust
  • Profit!

Full source

use polars::prelude::{
    CsvReader, CsvWriter, DataFrame, DataFrameJoinOps, DataType, Field, JoinType, Schema,
    SerReader, SerWriter,
};
use std::fs::File;
use std::path::Path;
use std::sync::Arc;

fn read_data_frame_from_csv_with_schema(csv_file_path: &Path, schema: Schema) -> DataFrame {
    CsvReader::from_path(csv_file_path)
        .expect("Cannot open file.")
        .has_header(true)
        .with_schema(Arc::new(schema))
        .finish()
        .unwrap()
}

fn get_cities() -> DataFrame {
    let fields: Vec<Field> = vec![
        Field::new("city", DataType::Utf8),
        Field::new("city_ascii", DataType::Utf8),
        Field::new("lat", DataType::Float32),
        Field::new("lng", DataType::Float32),
        Field::new("country", DataType::Utf8),
        Field::new("iso2", DataType::Utf8),
        Field::new("iso3", DataType::Utf8),
        Field::new("admin_name", DataType::Utf8),
        Field::new("capital", DataType::Utf8),
        Field::new("population", DataType::Float32),
        Field::new("id", DataType::Int64),
    ];
    let schema: Schema = Schema::from_iter(fields);
    let cities_file_path: &Path = Path::new("../worldcities.csv");
    read_data_frame_from_csv_with_schema(cities_file_path, schema)
}

fn get_airports() -> DataFrame {
    let fields: Vec<Field> = vec![
        Field::new("country", DataType::Utf8),
        Field::new("city", DataType::Utf8),
        Field::new("IATA", DataType::Utf8),
    ];
    let schema: Schema = Schema::from_iter(fields);
    let airports_file_path: &Path = Path::new("../IntAirports.csv");
    read_data_frame_from_csv_with_schema(airports_file_path, schema)
}

fn main() {
    let df_cities = get_cities().sort(["city", "country"], true).unwrap();
    let df_airports = get_airports().sort(["city", "country"], true).unwrap();
    let joined = df_cities
        .join(
            &df_airports,
            ["city", "country"],
            ["city", "country"],
            JoinType::Left,
            None,
        )
        .unwrap();
    let mask = joined.column("IATA").expect("not null").is_not_null();
    let mut filtered = joined
        .filter(&mask)
        .unwrap()
        .sort(["population"], true)
        .unwrap();
    println!(
        "{}",
        filtered
            .select(["city", "country", "population", "IATA"])
            .unwrap()
            .head(Some(5))
    );
    let mut file = File::create("cities-with-airports.csv").expect("could not create file");
    CsvWriter::new(&mut file)
        .has_header(true)
        .with_delimiter(b',')
        .finish(&mut filtered)
        .expect("failed to write output");
}