Black Lives Matter. Please consider donating to Black Girls Code today.
https://www.blackgirlscode.com

How to increase speed in sharing complex python instance during callbacks

Hi, I’m new to dash. I’m building an app to visualize result of a machine learning model. I’ve read share-data-between-callbacks and decided to use client side caching.

I saved the model data in a pickle file. I create a dcc.Upload on my app to read pickle data for a specific user and use dcc.Store to store data for that specific user. Since dcc.Store needs json format file, I use jsonpickle to convert my complex python object to json. Then I am using this stored data to generate several graphs (i.e. use in several callbacks). The callbacks are extremely slow. I found the problem possibly in serialization. Serializing my complex python instance to json takes about 10s whereas picking takes less than 1s.

I know pickle has a security issue but my app is designed for internal use. Is there any way to solve the serializing speed issue or improve my pipeline?

Thanks in advance.
Max

Hi Max,

I would recommend server side caching. It will speed up the callbacks due to less data transfer, and it will allow you to use pickle for serialization as well.

Hi Emil,

I use client side caching because the result depends on the user upload file (so different for every user). I don’t have much experience on server caching. Could you please explain in more details?

Thanks for your response!

Well, server side caching just means that you save the data on the server instead of the client. It could be in a file (simple), in a memory cache (fast), in a database or something else. The main point is that you avoid sending the data back and forth between the client and the server. That the data is different between users is not a problem, it can be handled e.g. by using the session id (unique to each user) to identify the data.

I have been working on simplifying the server caching workflow, for starters just using files. Here is a small example app (using plotly dummy data) illustrating my current workflow,

import dash
import dash_core_components as dcc
import dash_html_components as html
import time
import plotly

from dash.dependencies import Output, Input
from dash_extensions.callback import CallbackCache, DiskCache

data_sets = ["gapminder", "tips", "iris", "wind", "election"]
# Create app.
app = dash.Dash(prevent_initial_callbacks=True)
app.layout = html.Div([
    dcc.Dropdown(options=[{"label": ds, "value": ds} for ds in data_sets], id="data_set"),
    html.Div(id="output"), dcc.Loading(dcc.Store(id="store"), fullscreen=True),
])
# Create (server side) disk cache.
cc = CallbackCache(cache=DiskCache(cache_dir="cache_dir"))


@cc.cached_callback(Output("store", "data"), [Input("data_set", "value")])
def fetch_data(key):
    df = plotly.data._get_dataset(key) if key is not None else None
    time.sleep(2)  # sleep to emulate a database call / a long calculation
    return df


@cc.callback(Output("output", "children"), [Input("store", "data")])
def print_head(df):
    return df.iloc[:5].to_json()


cc.register(app)

if __name__ == '__main__':
    app.run_server()

Compared to a standard Dash workflow, there are a few differences

  • A CallbackCache object is created prior to defining callbacks. It takes a cache argument, which is an object that defines how/where to cache the data. Currently, only a DiskCache is implemented (which writes to the data to the cache_dir using pickle), but at some point i might added other options (e.g. in-memory storage, cloud storage, etc.).

  • Callbacks whose output is to be cached are registered on this object with the cached_callback decorator. Note that it is not necessary to convert the output to json when this decorator is used.

  • Callbacks that will be using using the cached data (as input/state) must also be registered on this object (rather than the Dash app itself). Note that the cached data arrives in the same form as they were cached, i.e. you do not need to do any deserialization.

  • After defining the callbacks, the CallbackCache object must be registered on the Dash app.

If you would like to try it out, the CallbackCache class is available in the (0.0.19rc3) version of the dash-extensions package,

pip install dash-extensions==0.0.19rc3

It is still a work in progress, so any feedback is appreciated :slight_smile:

2 Likes

Thank you, that’s really helpful! I’ll try it out on my app. :smiley:

Great! Let me know how it turns out :slight_smile:

Hi Emil, as I’m implementing, I have a question.

Is there a way to update the cached data “store”? I want to write a callback based on user input to update my cached data. This callback need to take the cached data “store” as input. I cannot output to “store” because dash does not allow same input and output.

  • Is there a way to update the cached data “store”?

The data are cached based on (the md5 hash) of the inputs. Hence, if the user changes any of the inputs, new data will be written to the cache. If a user provides the same inputs as previously, the “old” data will be loaded if it is not older than expire_after seconds. It defaults to -1, which means “keep the data forever”. If you always want to always reevaluate the data, you can set expire_after to zero,

cc = CallbackCache(cache=DiskCache(cache_dir="cache_dir"), expire_after=0)

  • I want to write a callback based on user input to update my cached data. This callback need to take the cached data “store” as input. I cannot output to “store” because dash does not allow same input and output

It is not possible in Dash to have the same input and output. But i guess you could achieve what you want like this,

  • Query the “raw” data and save them in a (cached) Store, say with id=“raw_data”
  • Create a callback that takes the raw data as input along with the user selections and outputs the filtered data to another (cached) Store, say with id=“filtered_data”
  • Create callbacks for everything else (graphs, tables, etc.) with the filtered data as input

Thanks for you suggestion. I tried your suggestion, but it seems like cached callback does not work properly the second time. It gives TypeError: Object of type DataFrame is not JSON serializable error, which is an error when using normal callbacks. I modified your example to simulate this.

import dash
import dash_core_components as dcc
import dash_html_components as html
import time
import plotly
import pandas as pd

from dash.dependencies import Output, Input
from dash_extensions.callback import CallbackCache, DiskCache

data_sets = ["gapminder", "tips", "iris", "wind", "election"]
# Create app.
app = dash.Dash(prevent_initial_callbacks=True)
app.layout = html.Div([
    dcc.Dropdown(options=[{"label": ds, "value": ds} for ds in data_sets], id="data_set"),
    html.Div(id="output"),
    html.Div(id="output2"),
    dcc.Loading(dcc.Store(id="store"), fullscreen=True),
    dcc.Store(id="filtered_store"),
    html.Div([
        dcc.Slider(
            id='filter-slider',
            min=0,
            max=100,
            value=100,
            marks={str(num): str(num) + '%' for num in range(10, 100, 10)},
            step=1,
            updatemode='drag'
        ),
    ]
    )
])
# Create (server side) disk cache.
cc = CallbackCache(cache=DiskCache(cache_dir="cache_dir"))


@cc.cached_callback(Output("store", "data"), [Input("data_set", "value")])
def fetch_data(key):
    df = plotly.data._get_dataset(key) if key is not None else None
    time.sleep(1)  # sleep to emulate a database call / a long calculation
    return df


@cc.callback(Output("output", "children"), [Input("store", "data")])
def print_head(df):
    return df.iloc[:5].to_json()

@cc.cached_callback(Output("filtered_store", "data"), [Input("store", "data"), Input("filter-slider", 'value')])
def filter_data(store, filter):
    # simulating updating data, my original code uses raw data and filter, here just use new data for simplicity
    data = {'Name': ['Tom', 'nick', 'krish', 'jack'], 'Age': [20, 21, 19, 18]}
    df = pd.DataFrame(data)
    time.sleep(1)  # sleep to emulate a database call / a long calculation
    return df

@cc.callback(Output("output2", "children"), [Input("filtered_store", "data")])
def print_head2(df):
    return df.to_json()

cc.register(app)

if __name__ == '__main__':
    app.run_server(debug=True)

Ah, yes. In my first take at the implementation, the cached callbacks didn’t support cached inputs. This should be fixed now (0.0.19rc4). If you update to this version, the following filtering example should work,

import dash
import dash_core_components as dcc
import dash_html_components as html
import time
import plotly

from dash.dependencies import Output, Input
from dash_extensions.callback import CallbackCache, DiskCache

data_sets = ["gapminder", "tips", "iris", "wind", "election"]
# Create app.
app = dash.Dash(prevent_initial_callbacks=True)
app.layout = html.Div([
    dcc.Dropdown(options=[{"label": ds, "value": ds} for ds in data_sets], id="data_dd"), dcc.Dropdown(id="filter_dd"),
    html.H1("Raw data"),
    html.Div(id="raw_output"),
    html.H1("Filtered data"),
    html.Div(id="filtered_output"),
    dcc.Loading(dcc.Store(id="raw_store"), fullscreen=True, type="dot"),
    dcc.Loading(dcc.Store(id="filtered_store"), fullscreen=True, type="graph")
])
# Create (server side) disk cache.
cc = CallbackCache(cache=DiskCache(cache_dir="cache_dir"), expire_after=0)


@cc.cached_callback(Output("raw_store", "data"), [Input("data_dd", "value")])
def fetch_data(key):
    df = plotly.data._get_dataset(key) if key is not None else None
    time.sleep(1)  # sleep to emulate a database call / a long calculation
    return df


@cc.callback(Output("raw_output", "children"), [Input("raw_store", "data")])
def print_raw(df):
    return df.iloc[:5].to_json()


@cc.callback([Output("filter_dd", "options"), Output("filter_dd", "value")], [Input("raw_store", "data")])
def update_filter_dd(df):
    return [{"label": col, "value": col} for col in df.columns], None


@cc.cached_callback(Output("filtered_store", "data"), [Input("raw_store", "data"), Input("filter_dd", 'value')])
def filter_data(df, filter):
    time.sleep(1)  # sleep to emulate a database call / a long calculation
    return df if filter is None else df[filter]


@cc.callback(Output("filtered_output", "children"), [Input("filtered_store", "data")])
def print_filtered(df):
    return df.iloc[:5].to_json()


cc.register(app)

if __name__ == '__main__':
    app.run_server()

Hi Emil, Thanks, The cached callbacks with cached inputs works great.

I feel like expire_after does not work properly. The cached files are still there after I set expire_after to 10 seconds.

Great! I have not implemented any clean-up logic, so per default the files will stay forever. I guess you could add a clean up routine server side or use a temp folder, which is cleaned by the OS, if disk space is an issue.