Is classical State-less architecture suitable for Data Science products?

Vlad · July 10, 2017, 9:38pm

Use-case #1: cross-filter like dashboard (inspired by dc.js: http://dc-js.github.io/dc.js/examples/filtering.html)
At the moment I am building a crossfilter like dashboard with ability to slice data by multiple dimensions (I have tried up-to 9, but I have many more in DWH). It reads a data from DWH (takes a couple of minutes to make a query and transport) and allows you to explore the data interactively. Queried dataframe can have 100+k rows, but I’d like to push it to few millions (the reason why I don’t want to use browser-based dc.js).
As the first iteration, I was caching data to pickle on disk. I have 9 charts: each UI change triggers a change to all 9 charts.
This is, unsurprisingly, already quite slow, as it takes about 120+ms to read pickle and 10ms to do filtering/grouping in pandas.
For fun, I tried doing the same thing with one flask worker and just keep pandas dataframe in memory, without reading it from pickle in callback. Of course it is much faster (each chart takes 10ms to update, instead of 130). Of course, it will break as soon as there will be multiple users.
As for “proper way” - I got best results with using memory-based mysql (I also can do aggregations there), but this complicated the code significantly - I don’t expect my data scientists to do that. It also takes about 70ms per chart, so 7 times slower than keeping it in memory.

Use-case #2: AB testing tool:
I want query relevant data to analyse an experiment (AB test) - inputs are: timerange, experiment ID, relevant user segment and platform. Querying takes few minutes.
Then I want to do “stats” to it and output a bunch of charts. All dash examples I’ve seen treat all charts as independent of each other and dependant only on Inputs. So unless I do something clever - each chart will try to do it’s own querying and do it’s own stats. Even though query should happen only once and one stats model might output multiple charts.
It is, of course, super easy to express in a notebook, but quite hard to implement in dash, so that quering and stats are re-used. (Except that jupyter dashboards are very immature and I find it hard to recommend to everyone.)

Don’t get me wrong - dash is a breakthrough. I believe this may be a go-to tool for python dev (non-web ones) who wants to build a simple web-site quickly. But is it really shiny for python, in a sense that I can tell my die-hard R fan to stop using shiny, since we are 100% python now?

Topic		Replies	Views
Working on large datasets -- comparison with shiny Dash Python	12	28057	November 16, 2019
Practically, how many users can Dash apps concurrently serve, and what is approach to be taken? Dash Python tips-and-tricks , question	0	896	April 19, 2022
Dash as "Other Full-Stack Web Framework" Dash Python	7	4164	December 1, 2022
Dash vs Other Platforms Dash Python	1	2976	April 15, 2021
"Stateful" Dash application Dash Python question	4	1184	September 8, 2022

Is classical State-less architecture suitable for Data Science products?

Related topics