To get an idea of the performance of the cached_callback
versus the default Dash callback
, i have carried out a few benchmarks. For the purpose of these benchmarks i am using a FileSystemCache
and I consider the following case,
- A data frame with a single column of size
n
is created server side (to emulate e.g. a fetch from a database) and inserted into aStore
component with idstore
. Next, the mean of the column is calculated (to emulate a data processing step) in another callback that takes thestore
as input.
I measure the time from just after the data frame creation until just before the mean operation, i.e. it includes serialization as well as the transfer of data from the server to the client and back. For each value of n
i measured 5 times and took the average (and std for error bars). Here are the numbers for my local desktop,
In the first chart we see that the standard callback (blue) works up til around 1 mio. rows, at which point the operation takes roughly 4s. At 10 mio. rows, the browser crashes. The cached callback (yellow) on the other hand just keeps on going. I stopped at 1 bil. rows at which point the operation took around 20s. At this point, the pickle on disk was 8GB (!).
The second chart illustrates the ratio between the runtimes. Rather surprisingly, the cached callback is around 50 times faster for a single element data frame. Maybe this is due to the pandas serialization to/from JSON being slow? At 1 mio. rows, the cached callback is more than 200 (!) times faster.
Now, this is cool and all, but no one uses their local host for deployment. So let’s move to the cloud (Heroku, free tier),
On Heroku, the standard callback (blue) still works up til around 1 mio. rows, but cached callback (yellow) crashed at 100 mio. rows. From the logs i could see that the dyno ran out of memory, i.e. the limit can probably be pushed (much) further by purchasing a more beefy dyno. In the head to head comparison, the cached callback is still faster, but the performance gain has been reduced to a factor of 10 for small data frames and 100 for large ones.
For reference, here is the benchmark code
import datetime
import dash
import dash_core_components as dcc
import dash_html_components as html
import numpy as np
import pandas as pd
from dash.dependencies import Output, Input, State
from flask_caching.backends import FileSystemCache
from dash_extensions.callback import CallbackCache
# region Benchmark data definition
options = [{"label": x, "value": x} for x in [1, 10, 100, 1000, 10000, 100000, 1000000, 10000000, 100000000, 1000000000]]
def make_data(n):
return pd.DataFrame(data=np.random.rand(n), columns=["rnd"])
# endregion
# Create app.
app = dash.Dash(prevent_initial_callbacks=True)
server = app.server
app.layout = html.Div([
# Standard implementation.
html.Button("Run benchmark (no cache)", id="btn"), dcc.Dropdown(id="dd", options=options, value=1),
dcc.Store(id="time"), dcc.Loading(dcc.Store(id="store"), fullscreen=True, type="dot"), html.Div(id="log"),
# Cached implementation.
html.Button("Run benchmark (with cache)", id="btn_wc"), dcc.Dropdown(id="dd_wc", options=options, value=1),
dcc.Store(id="time_wc"), dcc.Loading(dcc.Store(id="store_wc"), fullscreen=True, type="dot"), html.Div(id="log_wc")
])
# region Standard implementation
@app.callback([Output("store", "data"), Output("time", "data")], [Input("btn", "n_clicks")], [State("dd", "value")])
def query(n_clicks, value):
df = make_data(int(value))
tic = datetime.datetime.now().timestamp()
return df.to_json(), tic
@app.callback(Output("log", "children"), [Input("store", "data")], [State("time", "data")])
def calc(data, time):
time = datetime.datetime.fromtimestamp(int(time))
df = pd.read_json(data)
toc = datetime.datetime.now()
mean = df["rnd"].mean()
return "ELAPSED = {}s (and mean is {:.3f})".format((toc - time).total_seconds(), mean)
# endregion
# region Cached implementation
# Create (server side) cache. Works with any flask caching backend.
cc = CallbackCache(cache=FileSystemCache(cache_dir="cache"))
@cc.cached_callback([Output("store_wc", "data"), Output("time_wc", "data")],
[Input("btn_wc", "n_clicks")], [State("dd_wc", "value")])
def query_wc(n_clicks, value):
df = make_data(int(value))
return df, datetime.datetime.now()
@cc.callback(Output("log_wc", "children"), [Input("store_wc", "data")], [State("time_wc", "data")])
def calc_wc(df, time):
toc = datetime.datetime.now()
mean = df["rnd"].mean()
return "ELAPSED = {}s (and mean is {:.3f})".format((toc - time).total_seconds(), mean)
# This call registers the callbacks on the application.
cc.register(app)
# endregion
if __name__ == '__main__':
app.run_server()