Dash app crashes when trying to display more than 2-3 charts with a larger dataset, despite trying different callbacks and reducing dataset size, seeking advice on how to optimize code to handle larger datasets

Hello everyone,

I’m new to programming and currently learning how to use Dash and Python to build a web application. I’ve been working on a project to build a dashboard that displays 5-6 charts populated with data from a CSV file.

The app works fine with smaller datasets of around 100 rows, but when I increase the size of the dataset, the app crashes when trying to display more than 2-3 charts. The callback that paints the figures skyrockets from around 100-200ms with 2 charts to more than 5000ms when I add 1 more chart, and it keeps increasing until the app crashes.

I’ve tried setting 1 callback for all the charts, as well as an individual callback for each chart, but this did not change anything. I’ve also tried reducing the dataset, and this works. However, I was hoping to handle around 5k rows, and the current limit of around 200 rows is not sufficient.

It seems that at the initial stage of loading and painting more than 200 rows, the app staggers so much that it eventually crashes. I could increase the interval time to 2-3 seconds, but 1 second is already the maximum I’m willing to accept.

My main callback that staggers at the initial load looks like this.

#main callback

@app.callback(
            Output("candles", "figure"),
            Output("stock_name", "children"),
            Output("latest_price", "children"),
            Output("latest_price_change", "children"), 
            Output("latest_price_change", "style"),
            
            ...

            #Output("stocks3", "figure"),
            #Output("stock_name3", "children"),
            #Output("latest_price3", "children"),
            #Output("latest_price_change3", "children"), 
            #Output("latest_price_change3", "style"),

            Input("interval", "n_intervals"),
            Input('zoom_store', 'data'),
            Input("candles_slider", "value"),   

            )

And i just paint my charts like this: They are all the same mostly.

#figure update

def update_figure(n_intervals,zoom_data,num_candles):

        #data    
    stock = ['BOIL', 'PYPL', 'INTC', 'AAPL', 'AMZN', 'MSFT', 'TSLA']
    filename = 'stock_data.csv'
    data, latest_price, latest_price_change, volume = read_data(filename, stock[0], [1, 2, 3, 4])         
    data['x_axis'] = list(range(1, len(data['volume_diff']) +1))
    pos = data['open'] - data['close'] < 0
    neg = data['open'] - data['close'] > 0
    data['x_axis'] = list(range(1, len(data['volume_diff']) +1))
    ymax = data['volume_diff'].max()
    ystd = data['volume_diff'].std()

    # Main candlestick chart
    candles = go.Figure(go.Candlestick(x=data.index,open=data['open'],high=data['high'],low=data['low'],close=data['close'],name=stock[0],showlegend=False))
    with candles.batch_update():
        data = data.tail(num_candles)
        candles.update_layout(
                height=400,
                margin=dict(l=0, r=0, t=10, b=20),
                xaxis_rangeslider_visible=False,   
                yaxis_range=[data['low'].min()- data['high'].std()*0.5,data['high'].max()+data['high'].std()*3 if not math.isnan(data['high'].max()) and data['high'].max() !=0 else [] ]               
            )
        
    ...

    # Extra chart 3
    #d4, latest_price3, latest_price_change3, volume = read_data('stock_data.csv', stock[3], [1, 11, 12, 13])        
    #stocks3 = go.Figure(go.Scatter(x=d4.index, y=d4['close'],showlegend=False,marker={'color': 'White'}, name=stock[3]))
    #with stocks3.batch_update():
        #stocks3.update_layout(
            #height=82,
            #margin=dict(l=0, r=0, t=10, b=20),
            #xaxis_rangeslider_visible=False,            
        #)

    if zoom_data is not None:
        candles.update_xaxes(range=zoom_data),

    return candles,stock[0], latest_price, latest_price_change,{'color': '#18b800' if latest_price_change[0] == '+' else '#ff3503'}, \
             ... 
            #stocks3,stock[3], latest_price3, latest_price_change3,{'color': '#18b800' if latest_price_change3[0] == '+' else '#ff3503'},\

I’ve searched through the documentation and found some options like cache loading, clientside callbacks, and data re-parsers, but I’m overwhelmed by the options and not sure how to proceed or what would be a solution that works for me.

I could use some pointers on how to parse large datasets to the callback and reduce the load to a manageable state of around 1000ms. Any help or advice would be greatly appreciated.

I have also created a new MWS that i loaded into pastebin:
import pandas as pdimport plotly.graph_objects as gofrom plotly.subplots imp - Pastebin.com

Thank you in advance for your time and assistance.

Looking at your code you dropped in Pastebin, the big thing jumping out at me is that you’re calling your read_data function at the start of each callback. Reading and serialising data from a CSV into a DataFrame is generally a slow operation, so generally speaking, you want reduce the number of times you call that function. There’s a few ways you can do that.

One is to put a functools.cache decorator on top of read_data then its contents will only be executing once for each distinct combination of argument values its invoked with, and the rest of the time the already-computed return value will just be fetched from the cache.

But in this case, I think the simplest might just be to create all DataFrames needed by the app at init time:

STOCK0_DF =  read_data('stock_data.csv', stock[0], [1, 2, 3, 4])
STOCK1_DF =  read_data('stock_data.csv', stock[1], [1, 2, 3, 4])
STOCK2_DF =  read_data('stock_data.csv', stock[2], [1, 2, 3, 4])

And just use the relevant one as needed in the callbacks.

The other thing you can do after that, which will improve initial app load time (not for each callback), is to save your prepared DataFrames ahead of time as a parquet file using DataFrame.to_parquet(), and then load those at Dash-app init time with pd.read_parquet(). Reading parquet is much faster than reading CSV, as parquet files contains the data’s schema (ie the column types), so you don’t have to spend compute on wrangling and inferring the schema.

3 Likes

Thank you for taking the time to address my post and suggest solutions to my problem. I tried implementing your changes this morning and was able to read more than 200 rows, which is a huge improvement! I noticed that the app now loads much faster and performs more smoothly.

I also took your suggestion to save prepared DataFrames as a parquet file, and it seems like a good solution for handling larger datasets in the future.

Regarding the functools.cache decorator, I’m definitely interested in learning more about it and will look into it further.

Again, thank you for your help - I appreciate it!

2 Likes