Working on large datasets -- comparison with shiny

Some background: In order for this to work across multiple python processes, we need to store the data somewhere that is accessible to each of the processes. There are 3 places to store this data:
1 - On the disk (e.g. on a file or on a new database)
2 - In a shared memory space like with Redis
3 - In the user’s browser session

For 1 and 2:

  • Easiest to implement with tools like Flask-Caching, see Performance | Dash for Python Documentation | Plotly
  • Data has to be serialized out of python data structures into simpler datastructures like strings and numbers (or just JSON) for storage
  • Data that is cached in this way will be available for every future session.
    • If you open up the app in a new browser window (or if a different viewer opens up the app), the app’s callbacks may retrieve the data from the cache instead of computing it fresh.
  • Since data is stored for all sessions, you could run into memory issues (if using e.g. Redis and not the file system) if you’re storing the output for every single set of parameters of your cache.
    • As such, you need to balanace what you cache and what you compute. If querying the raw data is slow (e.g. from SQL), then you could cache the results of the query once and then perform fast computations (e.g. pandas aggregations) on top of that.

For 3:

  • Implemented by saving the data as part of Dash’s front-end store through methods explained in Sharing a dataframe between plots
  • Data has to be converted to a string like JSON for storage and transport
  • Data that is cached in this way will only be available in the user’s current session.
    • If you open up a new browser, the app’s callbacks will always compute the data. The data is only cached and transported between callbacks within the session.
    • As such, online 1 and 2, this method doesn’t increase the memory footprint of the app.
    • There could be a cost in network transport. If your sharing 10MB of data between callbacks, then that data will be transported over the network between each callback.
    • If the network cost is too high, then compute the aggregations upfront and transport those. Your app likely won’t be displaying 10MB of data, it will just be displaying a subset or an aggregation of it.

Reading and writing JSON isn’t that expensive but sending it over the network could be. To get around this, the last point in the outline above is an option:

If the network cost is too high, then compute the aggregations upfront and transport those. Your app likely won’t be displaying 10MB of data, it will just be displaying a subset or an aggregation of it.

For example:

@app.callback(Output('intermediate-value', 'children'), [Input('dropdown', 'value')])
def clean_data(value):
     # an expensive query step
     cleaned_df = your_expensive_clean_or_compute_step(value)
      
     # a few filter steps that compute the data
     # as it's needed in the future callbacks
     df_1 = cleaned_df[cleaned_df == 'apples']
     df_2 = cleaned_df[cleaned_df == 'oranges']
     df_3 = cleaned_df[cleaned_df == 'figs']
     return {
         df_1: df_1.to_json(orient='split'),
         df_2: df_2.to_json(orient='split'),
         df_3: df_3.to_json(orient='split'),
     }

@app.callback(
    Output('graph', 'figure'),
    [Input('intermediate-value', 'children'])
def update_graph_1(jsonified_cleaned_data):
    dff = pd.read_json(jsonified_cleaned_data['df_1'])
    figure = create_figure_1(dff) 
    return figure

@app.callback(
    Output('graph', 'figure'),
    [Input('intermediate-value', 'children'])
def update_graph_2(jsonified_cleaned_data):
    dff = pd.read_json(jsonified_cleaned_data['df_2'])
    figure = create_figure_2(dff) 
    return figure

@app.callback(
    Output('graph', 'figure'),
    [Input('intermediate-value', 'children'])
def update_graph_3(jsonified_cleaned_data):
    dff = pd.read_json(jsonified_cleaned_data['df_3'])
    figure = create_figure_3(dff) 
    return figure

Your mileage will vary depending on your aggregations and your UI. You could end up reducing a 5M row dataframe into 3 bar graphs each with 100 points each, in which case the transport costs will be really low.

1 Like