Is classical State-less architecture suitable for Data Science products?

Vlad · July 10, 2017, 6:45pm

All data science products I’ve built were for internal use, so my perspective is biased.

Our data-apps typically have:

few concurrent users (a popular data product will get 10 users per day, with 2-3 concurrent)
slow initialisation or slow computations:
- ui elements (forms), calculate/run button
- loading the data from a data lake
- doing stuff to data (could be as heavy as fitting MCMC)
- data exploration, once data is computed
written by data scientists, who are not web developers

Therefore other shiny-insipired python solutions work quite well: jupyter dashboards and bokeh-server. There is a server session for each client. Writing server is very easy for data scientists in my team - you can just copy-paste code from a notebook, add boiler-plate and UI callbacks.

After initial excitement with Dash (it is super cool to build a web app 100% in Python!), I am concerned that session-less architecture in dash is not well suited for data science products, that have a heavy computation layer (fitting a model, running simulations, accessing data lakes with spark/MR jobs) and several outputs (charts).

If the goal of dash was to make shiny for Python - I am not sure if the underlaying architecture is suitable. While caching in Redis/RDS/disk is possible, I am concerned that it would be too much for a regular data scientists, who just want to copy-paste code from jupyter notebook.

Are there any plans to add server sessions, to mimic workflow that is well understood by data scientists and speed-up interactions? What might be a workaround to add “server session” in the current architecture?

chriddyp · July 10, 2017, 8:43pm

Could you expand more about the requirements of these applications? What is it about “server sessions” that makes writing these applications easier? (Not arguing against you here, just trying to better understand the issues at hand!).

Vlad · July 10, 2017, 9:38pm

Use-case #1: cross-filter like dashboard (inspired by dc.js: http://dc-js.github.io/dc.js/examples/filtering.html)
At the moment I am building a crossfilter like dashboard with ability to slice data by multiple dimensions (I have tried up-to 9, but I have many more in DWH). It reads a data from DWH (takes a couple of minutes to make a query and transport) and allows you to explore the data interactively. Queried dataframe can have 100+k rows, but I’d like to push it to few millions (the reason why I don’t want to use browser-based dc.js).
As the first iteration, I was caching data to pickle on disk. I have 9 charts: each UI change triggers a change to all 9 charts.
This is, unsurprisingly, already quite slow, as it takes about 120+ms to read pickle and 10ms to do filtering/grouping in pandas.
For fun, I tried doing the same thing with one flask worker and just keep pandas dataframe in memory, without reading it from pickle in callback. Of course it is much faster (each chart takes 10ms to update, instead of 130). Of course, it will break as soon as there will be multiple users.
As for “proper way” - I got best results with using memory-based mysql (I also can do aggregations there), but this complicated the code significantly - I don’t expect my data scientists to do that. It also takes about 70ms per chart, so 7 times slower than keeping it in memory.

Use-case #2: AB testing tool:
I want query relevant data to analyse an experiment (AB test) - inputs are: timerange, experiment ID, relevant user segment and platform. Querying takes few minutes.
Then I want to do “stats” to it and output a bunch of charts. All dash examples I’ve seen treat all charts as independent of each other and dependant only on Inputs. So unless I do something clever - each chart will try to do it’s own querying and do it’s own stats. Even though query should happen only once and one stats model might output multiple charts.
It is, of course, super easy to express in a notebook, but quite hard to implement in dash, so that quering and stats are re-used. (Except that jupyter dashboards are very immature and I find it hard to recommend to everyone.)

Don’t get me wrong - dash is a breakthrough. I believe this may be a go-to tool for python dev (non-web ones) who wants to build a simple web-site quickly. But is it really shiny for python, in a sense that I can tell my die-hard R fan to stop using shiny, since we are 100% python now?

chriddyp · July 10, 2017, 10:45pm

This was really helpful, thank you! Let me think through these a little bit more and try to come up with some examples. I think that we can get Dash to work well for Use Case 1. Use case 2 (sharing output results) is not well suited for Dash without some type of caching (which is essentially server state).

Vlad · July 10, 2017, 11:12pm

Thanks Chris.
At the moment, I am trying to control execution graph by placing dummy (hidden gates) UI controls, so that a change to one of my input elements will trigger an update for a hidden gate, and then change to hidden gate (combined with states of input elements) will trigger updates for other charts. This way I ensure that expensive query is happening once and cached. But then I will still to solve a problem of fast cache.

Topic		Replies	Views
Working on large datasets -- comparison with shiny Dash Python	12	28048	November 16, 2019
Practically, how many users can Dash apps concurrently serve, and what is approach to be taken? Dash Python tips-and-tricks , question	0	889	April 19, 2022
Dash as "Other Full-Stack Web Framework" Dash Python	7	4129	December 1, 2022
Dash vs Other Platforms Dash Python	1	2960	April 15, 2021
"Stateful" Dash application Dash Python question	4	1180	September 8, 2022

Is classical State-less architecture suitable for Data Science products?

Related topics