Exploring Large Datasets with Dash and Parquet

In addition to compressing datasets sometimes up to a tenth (or more) of the original size, the parquet format is also designed for analytical processing. Basically you can load only the columns you want into memory, and optionally, you can also set conditions before loading them. (load columns url, status, and size where the status >= 300 for exmaple.

You end up loading a tiny subset of the original dataset.

This is a simple analytics UI for exploring website / SEO crawl datasets.

  1. Currently only works locally, and assumes your file is already in the parquet format (working on a server file explorer in parallel)
  2. Since parquet allows loading only file meta data, we can also use that. In this case get numeric and non-numeric columns to populate the options of two different dcc.Dropdown components.
  3. The user now has two dropdowns to select from and each produces a different set of charts (based on the data types).

Numeric columns overview:

Text columns overview:


Converting to parquet is generally straightforward with pandas.DataFrame.to_parquet, which has various optimization options.

Feel free to share any improvements, bugs, suggestions.
Code repo: GitHub - eliasdabbas/advertools_crawler_ui: advertools crawler UI

3 Likes

Amazing. I saw this on linkedin as well. How long did it take you to build this, @eliasdabbas, and how did you come up with the idea?

1 Like

Thanks @adamschroeder !

I’ve been using parquet already for a while, so that helped.
It was mainly to create an interface to analyze website crawl datasets, which typically contain 100-200 columns. And you usually want only two or three at a time, so loading the whole thing makes it extremely slow. Parquet seemed like a great fit.

3 Likes