Storing large datasets in dcc.Store

Hello,

I’m having trouble with using large datasets in Dash. My app workflow is like so:

  1. User inputs a filename for a dataset
  2. This data gets loaded and stored into dcc.Store. An image of summary statistics of the data then gets calculated and displayed
  3. When the user clicks on a specific portion of the summary image, “local” summary statistics are re-calculated (which require the original dataset), and new figures are displayed in a neighboring plot

The files that get loaded are generally on the order of gigabytes. The issue is that when (3) occurs, the retrieval of the stored data takes too long and impacts the interactivity of the app (it will take >10 seconds for the app to retrieve the stored data and then subsequently calculate local summary statistics and replot). Pre-calculating the local summary statistics in (3) is out of the question because there are too many options for where a user could possibly click, and moreover, the user-selected field-of-view in the (2) summary image is also taken into account. In short, the calculations from (3) must be done on the fly and cannot be pre-stored.

I’ve narrowed the bottleneck down to being the retrieval of the data in dcc.Store, since when I hard-code the file upload and load the data as global variables, the speed of (3) is not an issue.

I’ve tried to use server side caching and also have implemented the dash_extension.enrich library so that the data in dcc.Store do not have to be switched into json format; I store them directly as dictionaries. However it seems like the dash_extensions.callback library from the server side caching solution and the dash_extension.enrich library (which allows dictionary storage) are not part of the same dash_extensions version, which means they cannot be used simultaneously. Moreover, I am still not completely sure if the solution posed in server side caching will even solve my issue, since I have not yet had a chance to completely implement it due to the aforementioned incompatibility.

Any suggestions or feedback would be greatly appreciated. Thank you so much in advance!

In recent versions of dash-extensions all enrichment functionality, including server side caching, has been moved to the enrich module. You can find an example using current syntax in the examples folder,

If you are already using non-json serialization, that means that the server side caching is working as intended. The default storage used is your local disk. For GB size data that might become a little slow depending on your disk speed. Besides getting a faster disk (say, an PCIe NVMe SSD if you don’t have that already), an alternative could be to switch to in-memory storage using e.g. a Redis server. Here is a small example,

While this solution is faster, it requires you to set up a Redis instance locally, and depending on the amount of data, you might run out of memory (RAM).

Hi! I have been trying to store a large dataset (around 9 Lakh datapoints) on dcc.Store on my multi-page app . I tried using ServerSideOutput for storing the data and then generating a data summary out of it.

Here’s how my app instantiation looks like:

import dash 
import dash_bootstrap_components as dbc
import utils.path_config as path_config
from dash_extensions.enrich import Output, DashProxy, Input, MultiplexerTransform,State,Dash,ServersideOutputTransform
external_stylesheets = [dbc.themes.DARKLY]
app = DashProxy(__name__,external_stylesheets=external_stylesheets,
                suppress_callback_exceptions=True,assets_folder=path_config.ASSET_DIR,transforms=[MultiplexerTransform(),ServersideOutputTransform()],prevent_initial_callbacks=True)
server = app.server

I am trying to upload the dataset and store it using dcc.Store as follows:

index_page = html.Div(children=
                      [
                          dcc.Store(id='stored-data',storage_type="memory"),
                          dcc.Store(id='training-data',storage_type='memory'),
                          dcc.Store(id='intermediate-data',storage_type='memory'),
                          dcc.Store(id='processed-data',storage_type='memory'),
                          dcc.Store(id='model-data',storage_type='memory'),
                          dcc.Location(id='url',refresh=False),
                          html.Div(id='page-content',style={'margin':'0px'}),
                          ])
app.layout = index_page
app.validation_layout = html.Div([
    index_page,
    home_page(),
    data_load_layout(),
    upload_page()
 
 ])

# file upload callback 

@app.callback(ServersideOutput('stored-data','data'),
              Input('upload-data', 'contents'),
              [State('upload-data', 'filename'),
              State('upload-data', 'last_modified')])  
def update_output(list_of_contents, list_of_names, list_of_dates):
    logging.info("upload contents")
    if list_of_contents is not None:
        #print(list_of_contents)
        data = f_util.parse_contents(list_of_contents, list_of_names, list_of_dates) 
        
        dff = pd.DataFrame(data)
        return dff 
##display data summary
@app.callback(Output('data-summary','children'),
              [Input('upload_data_btn','n_clicks'),Input('stored-data','data')]) 
def output_data_summary(n_clicks,data):
    if data is None or n_clicks is None:
        raise PreventUpdate
    else:
        data = pd.DataFrame(data)
        return f_util.uploaded_data_info(df=data)

In case you want to see how the summary function used in uploaded_data_info(df) looks like, here it is:

def data_summary(df):
    summary_table = pd.DataFrame(columns=['Column Name','Number of unique values',"Numerical or nominal","Range of values"])
    summary_table['Column Name'] = df.columns
    summary_table['Number of unique values'] = [df[col].nunique() for col in df.columns]
    summary_table["Numerical or nominal"] = ["Numerical" if is_numeric_dtype(df[col]) else "Nominal" for col in df.columns]
    summary_table["Range of values"] = [(str(min(df[col])) + '-' + str(max(df[col]))) for col in df.columns]
    return summary_table

On running the dash app through command prompt and trying to upload data, this is what happens:


and it goes on as as this loop of strings with the desired output showing up in app in around 15 minutes or more.

Can someone explain what is happening and if there is a solution to cut down on the time?
Thanks in advance!

Could you post a complete example? :slight_smile: