Exploring Large Datasets with Dash and Parquet

eliasdabbas · October 2, 2022, 11:51pm

In addition to compressing datasets sometimes up to a tenth (or more) of the original size, the parquet format is also designed for analytical processing. Basically you can load only the columns you want into memory, and optionally, you can also set conditions before loading them. (load columns url, status, and size where the status >= 300 for exmaple.

You end up loading a tiny subset of the original dataset.

This is a simple analytics UI for exploring website / SEO crawl datasets.

Currently only works locally, and assumes your file is already in the parquet format (working on a server file explorer in parallel)
Since parquet allows loading only file meta data, we can also use that. In this case get numeric and non-numeric columns to populate the options of two different dcc.Dropdown components.
The user now has two dropdowns to select from and each produces a different set of charts (based on the data types).

Numeric columns overview:

Text columns overview:

Converting to parquet is generally straightforward with pandas.DataFrame.to_parquet, which has various optimization options.

Feel free to share any improvements, bugs, suggestions.
Code repo: GitHub - eliasdabbas/advertools_crawler_ui: advertools crawler UI

adamschroeder · October 4, 2022, 2:34pm

Amazing. I saw this on linkedin as well. How long did it take you to build this, @eliasdabbas, and how did you come up with the idea?

eliasdabbas · October 4, 2022, 11:09pm

Thanks @adamschroeder !

I’ve been using parquet already for a while, so that helped.
It was mainly to create an interface to analyze website crawl datasets, which typically contain 100-200 columns. And you usually want only two or three at a time, so loading the whole thing makes it extremely slow. Parquet seemed like a great fit.

Topic		Replies	Views
Handling a large csv file with dash datatable Dash Python question	2	971	February 8, 2023
Parquet Files and Dash - Community Thread Dash Python	1	2402	June 25, 2018
Multipage App with different datasets for each page Dash Python	7	638	July 28, 2023
Slow Load Time on Dataset Dash Python question	7	1448	June 6, 2023
Dash Table Experiments For Large Dataset Dash Python	1	3278	April 7, 2018

Exploring Large Datasets with Dash and Parquet

Related topics