Black Lives Matter. Please consider donating to Black Girls Code today.

How to speed up backend data load in Dash?

Hi, I am trying to figure out how to speed up getting data from my sql server when loading my dash app. I currently have a dash app with a couple tabs, where one of the tabs (let’s say tab A), contains dropdowns and shows relevant charts when the user selects an option in the dropdown. When dash initially runs, it retrieves all data I need from the sql server using pyodhc.

The issue is that it takes a couple minutes for this process as there are more than a hundred different queries to the database, puts them respectively into a pandas dataframe, does some very simple calculations, and saves the dfs as global variables (for them to be then called in callbacks). The querying and the dataframe calculations are done in a different file than from where the dash app runs with callbacks. Once dash is fully loaded, going through dropdowns is very fast and the charts are shown immediately when an option is selected.

Is there a solution to optimize the initial load of the data?

Here are some ideas that I’ve thought of:
(1) Only having the most important/prioritized data load first, then having the rest load when user selects the relevant dropdown option. Would this be able to work? If the user selects tab B, then comes back to tab A, I don’t want the data to load again.
(2) After querying, putting it in a pandas dataframe, and doing calculations, I tried converting to json to see if it would help load the page faster. It didn’t make a difference.
(3) I was thinking of saving the data I would need into txt file then having the dash app read off of that upon loading, but not sure of its efficiency.
(4) I looked into utilizing redis and celery but not sure if this is a simple solution.

Thanks.

As a first step, i would try saving the pre processed data to a file, i.e. your option (3). Typically, i use logic along these lines,

import os
import pandas as pd
import time


def some_heavy_calculation():
    time.sleep(2)
    return pd.DataFrame(columns=["a", "b", "c"], data=[[1, 2, 3], [4, 5, 6]])


def get_df(create_df, cache, nuke=None):
    if not nuke and os.path.exists(cache):
        return pd.read_feather(cache)
    df = create_df()
    df.to_feather(cache)
    return df


cached_df = get_df(some_heavy_calculation, "path_to_file")

One general question I would have about this approach is, how “fresh” the data is. If your server runs for weeks, do you regularly need to update the table? if yes, how often?

Saving it to a file is certainly an option. However rather than doing it by hand, I would recommend looking into caching options built into flask (dash uses flask in the background).

https://flask-caching.readthedocs.io/en/latest/

If your server stays live and is just a single process, simple caching in memory is an option. If you have several processes, caching on disk may be a possibility or if you need to share, you can use the REDIS backend. There are several options with varying sophistication that all use the same frontend, so backends are easily switched out. It also supports a timeout, so that eg. the recalculation is done every 12 hours or whatever is appropriate and otherwise the cached result is used.

Thanks for your suggestion. So is some_heavy_calculation() where the sql querying happens get_df() would be where the pandas dataframe from previous function gets converted to feather/txt format?

The data in the sql server gets updated on a daily basis. Not all the data is daily however, but the remaining data still gets updated/imported into the database every night.

I am assuming my server is single process (with multiple users) considering the following code without the multi-processing argument? Sorry, I’m still learning dash.

app.run_server(my_port, debug=True, use_reloader=False)

If you are multi-process, you would use something like gunicorn to start your dash server in order to set up separate workers. This would then have impact on how cashing works (as workers can only share some caching strategies, but not others). In general, I would encourage the use of caching strategies provided in flask as they are well documented and maintained.