In addition to compressing datasets sometimes up to a tenth (or more) of the original size, the parquet format is also designed for analytical processing. Basically you can load only the columns you want into memory, and optionally, you can also set conditions before loading them. (load columns url
, status
, and size
where the status >= 300
for exmaple.
You end up loading a tiny subset of the original dataset.
This is a simple analytics UI for exploring website / SEO crawl datasets.
- Currently only works locally, and assumes your file is already in the parquet format (working on a server file explorer in parallel)
- Since parquet allows loading only file meta data, we can also use that. In this case get numeric and non-numeric columns to populate the
options
of two differentdcc.Dropdown
components. - The user now has two dropdowns to select from and each produces a different set of charts (based on the data types).
Numeric columns overview:
Text columns overview:
Converting to parquet is generally straightforward with pandas.DataFrame.to_parquet
, which has various optimization options.
Feel free to share any improvements, bugs, suggestions.
Code repo: GitHub - eliasdabbas/advertools_crawler_ui: advertools crawler UI