Working on large datasets -- comparison with shiny

chriddyp · October 11, 2017, 2:08am

Some background: In order for this to work across multiple python processes, we need to store the data somewhere that is accessible to each of the processes. There are 3 places to store this data:
1 - On the disk (e.g. on a file or on a new database)
2 - In a shared memory space like with Redis
3 - In the user’s browser session

For 1 and 2:

Easiest to implement with tools like Flask-Caching, see Performance | Dash for Python Documentation | Plotly
Data has to be serialized out of python data structures into simpler datastructures like strings and numbers (or just JSON) for storage
Data that is cached in this way will be available for every future session.
- If you open up the app in a new browser window (or if a different viewer opens up the app), the app’s callbacks may retrieve the data from the cache instead of computing it fresh.
Since data is stored for all sessions, you could run into memory issues (if using e.g. Redis and not the file system) if you’re storing the output for every single set of parameters of your cache.
- As such, you need to balanace what you cache and what you compute. If querying the raw data is slow (e.g. from SQL), then you could cache the results of the query once and then perform fast computations (e.g. pandas aggregations) on top of that.

For 3:

Implemented by saving the data as part of Dash’s front-end store through methods explained in Sharing a dataframe between plots
Data has to be converted to a string like JSON for storage and transport
Data that is cached in this way will only be available in the user’s current session.
- If you open up a new browser, the app’s callbacks will always compute the data. The data is only cached and transported between callbacks within the session.
- As such, online 1 and 2, this method doesn’t increase the memory footprint of the app.
- There could be a cost in network transport. If your sharing 10MB of data between callbacks, then that data will be transported over the network between each callback.
- If the network cost is too high, then compute the aggregations upfront and transport those. Your app likely won’t be displaying 10MB of data, it will just be displaying a subset or an aggregation of it.

Reading and writing JSON isn’t that expensive but sending it over the network could be. To get around this, the last point in the outline above is an option:

If the network cost is too high, then compute the aggregations upfront and transport those. Your app likely won’t be displaying 10MB of data, it will just be displaying a subset or an aggregation of it.

For example:

@app.callback(Output('intermediate-value', 'children'), [Input('dropdown', 'value')])
def clean_data(value):
     # an expensive query step
     cleaned_df = your_expensive_clean_or_compute_step(value)
      
     # a few filter steps that compute the data
     # as it's needed in the future callbacks
     df_1 = cleaned_df[cleaned_df == 'apples']
     df_2 = cleaned_df[cleaned_df == 'oranges']
     df_3 = cleaned_df[cleaned_df == 'figs']
     return {
         df_1: df_1.to_json(orient='split'),
         df_2: df_2.to_json(orient='split'),
         df_3: df_3.to_json(orient='split'),
     }

@app.callback(
    Output('graph', 'figure'),
    [Input('intermediate-value', 'children'])
def update_graph_1(jsonified_cleaned_data):
    dff = pd.read_json(jsonified_cleaned_data['df_1'])
    figure = create_figure_1(dff) 
    return figure

@app.callback(
    Output('graph', 'figure'),
    [Input('intermediate-value', 'children'])
def update_graph_2(jsonified_cleaned_data):
    dff = pd.read_json(jsonified_cleaned_data['df_2'])
    figure = create_figure_2(dff) 
    return figure

@app.callback(
    Output('graph', 'figure'),
    [Input('intermediate-value', 'children'])
def update_graph_3(jsonified_cleaned_data):
    dff = pd.read_json(jsonified_cleaned_data['df_3'])
    figure = create_figure_3(dff) 
    return figure

Your mileage will vary depending on your aggregations and your UI. You could end up reducing a 5M row dataframe into 3 bar graphs each with 100 points each, in which case the transport costs will be really low.

Topic		Replies	Views
Dynamic Use of Global Data Dash Python	17	7635	August 13, 2019
Suggestions for large data in global variable? Dash Python	16	426	December 12, 2024
Sharing a dataframe between plots Dash Python	18	28623	November 6, 2020
Show and Tell - Server Side Caching Dash Python show-and-tell , community-components	93	35495	April 30, 2023
How does one modify a dataframe with a callback? [former: circular import errors] Dash Python	42	211	October 28, 2024

Working on large datasets -- comparison with shiny

Related topics