How to re-use big dataframe during callback?

Hi

During the callback, I need to read a big dataframe and visulizing a heatmap on dcc.Graph(). Since the dataframe is the same during consecutive callback, so I don’t want to read the dataframe each time callback. So how to store this big dataframe when the first time callback?

The option I am using so far is to use a global variable to save this big dataframe, which can be repeatedly used during callback. However, using global variable has a issue and it will cause out of memory error sometimes. so I am wondering if there are some other more memory efficient ways to store the repeated used dataframe after first call.

Thanks.

Hi @roudan you could use dash-extensions:

More info:

1 Like

Thank you so much AIMPED. I will take a look at it to see how it works. Thanks, I appreciate it.

Hi AIMPED, for Jupyter Dash, how do I do for below? Thanks

app = DashProxy(transforms=[ServersideOutputTransform()])

The one I have is:

app = JupyterDash(name,
external_stylesheets=[
https://stackpath.bootstrapcdn.com/bootswatch/4.5.0/flatly/bootstrap.min.css
],
suppress_callback_exceptions=True
)

I think, dash-extensions are not available for jupyter, are they @Emil ?

1 Like

No, you are correct - I haven’t made an implementation for Jupyter :blush:

2 Likes

hey everyone,
just a heads up that Dash 2.11 and later supports running Dash apps in classic Jupyter Notebooks and in JupyterLab without the need to update the code or use the additional JupyterDash library.

1 Like

You can try caching your dataframe using flask_caching, for example like this:

from flask_caching import Cache

# set up your Cache instance
cache = Cache(app.server,config={'CACHE_TYPE': 'SimpleCache'})

# creating a function that gets a DataFrame
@cache.cached(timeout=3600*24)
def get_df():
    return pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/gapminder2007.csv')

# use in callbacks
@app.callback(
    Output(...),
    Input(...)
)
def create_graph(value):
    df = get_df()
    ...

This will allow the data frame to be loaded only once during the set timeout.


Also, I recommend you to use optimizations at the pandas level, for example

def get_df():
    # load only used columns
    df =  pd.read_csv('titanic.csv', usecols=['survived', 'age', 'class', 'who', 'alone'])

    # Convert to the optimal data format
    df.age = df.age.fillna(0)
    df = df.astype(
        {
            'survived': 'int8',
            'age': 'int8',
            'class': 'category',
            'who': 'category',
            'alone': 'bool'
        }
    )  
    return df
1 Like