✊🏿 Black Lives Matter. Please consider donating to Black Girls Code today.
🐍Plotly, Coiled CEOs Discuss Partnering & ML Experts show us how Dash + Dask apps work Register for the Webinar.

Is classical State-less architecture suitable for Data Science products?

All data science products I’ve built were for internal use, so my perspective is biased.

Our data-apps typically have:

  • few concurrent users (a popular data product will get 10 users per day, with 2-3 concurrent)
  • slow initialisation or slow computations:
    • ui elements (forms), calculate/run button
    • loading the data from a data lake
    • doing stuff to data (could be as heavy as fitting MCMC)
    • data exploration, once data is computed
  • written by data scientists, who are not web developers

Therefore other shiny-insipired python solutions work quite well: jupyter dashboards and bokeh-server. There is a server session for each client. Writing server is very easy for data scientists in my team - you can just copy-paste code from a notebook, add boiler-plate and UI callbacks.

After initial excitement with Dash (it is super cool to build a web app 100% in Python!), I am concerned that session-less architecture in dash is not well suited for data science products, that have a heavy computation layer (fitting a model, running simulations, accessing data lakes with spark/MR jobs) and several outputs (charts).

If the goal of dash was to make shiny for Python - I am not sure if the underlaying architecture is suitable. While caching in Redis/RDS/disk is possible, I am concerned that it would be too much for a regular data scientists, who just want to copy-paste code from jupyter notebook.

Are there any plans to add server sessions, to mimic workflow that is well understood by data scientists and speed-up interactions? What might be a workaround to add “server session” in the current architecture?

1 Like

Could you expand more about the requirements of these applications? What is it about “server sessions” that makes writing these applications easier? (Not arguing against you here, just trying to better understand the issues at hand!).

Use-case #1: cross-filter like dashboard (inspired by dc.js: http://dc-js.github.io/dc.js/examples/filtering.html)
At the moment I am building a crossfilter like dashboard with ability to slice data by multiple dimensions (I have tried up-to 9, but I have many more in DWH). It reads a data from DWH (takes a couple of minutes to make a query and transport) and allows you to explore the data interactively. Queried dataframe can have 100+k rows, but I’d like to push it to few millions (the reason why I don’t want to use browser-based dc.js).
As the first iteration, I was caching data to pickle on disk. I have 9 charts: each UI change triggers a change to all 9 charts.
This is, unsurprisingly, already quite slow, as it takes about 120+ms to read pickle and 10ms to do filtering/grouping in pandas.
For fun, I tried doing the same thing with one flask worker and just keep pandas dataframe in memory, without reading it from pickle in callback. Of course it is much faster (each chart takes 10ms to update, instead of 130). Of course, it will break as soon as there will be multiple users.
As for “proper way” - I got best results with using memory-based mysql (I also can do aggregations there), but this complicated the code significantly - I don’t expect my data scientists to do that. It also takes about 70ms per chart, so 7 times slower than keeping it in memory.

Use-case #2: AB testing tool:
I want query relevant data to analyse an experiment (AB test) - inputs are: timerange, experiment ID, relevant user segment and platform. Querying takes few minutes.
Then I want to do “stats” to it and output a bunch of charts. All dash examples I’ve seen treat all charts as independent of each other and dependant only on Inputs. So unless I do something clever - each chart will try to do it’s own querying and do it’s own stats. Even though query should happen only once and one stats model might output multiple charts.
It is, of course, super easy to express in a notebook, but quite hard to implement in dash, so that quering and stats are re-used. (Except that jupyter dashboards are very immature and I find it hard to recommend to everyone.)

Don’t get me wrong - dash is a breakthrough. I believe this may be a go-to tool for python dev (non-web ones) who wants to build a simple web-site quickly. But is it really shiny for python, in a sense that I can tell my die-hard R fan to stop using shiny, since we are 100% python now?

1 Like

This was really helpful, thank you! Let me think through these a little bit more and try to come up with some examples. I think that we can get Dash to work well for Use Case 1. Use case 2 (sharing output results) is not well suited for Dash without some type of caching (which is essentially server state).

Thanks Chris.
At the moment, I am trying to control execution graph by placing dummy (hidden gates) UI controls, so that a change to one of my input elements will trigger an update for a hidden gate, and then change to hidden gate (combined with states of input elements) will trigger updates for other charts. This way I ensure that expensive query is happening once and cached. But then I will still to solve a problem of fast cache.