Black Lives Matter. Please consider donating to Black Girls Code today.

Dash Datatable inflating my data size during callback? Causing slow performance / loading

Hi Everyone, long time listener, first time caller. I love dash, and I’ve learned so much from the forums, but I’m encountering a problem that I just cannot make sense of.

I think Dash is inflating the data that I’m using in my callbacks, and I don’t understand why.

I have an app/page that has two datatables, an “underlying data” table (id:my-datatable, and an “overview” data table (id:summary-datatable).

I have a callback that takes in the “data” of my-datatable as well as the “derived-virtual-data” of my-datatable, converts that data into pandas data frames, and then does some manipulations (remove certain duplicates) and calculations (conditional count, conditional sum of specific columns) via pandas, and then outputs those results to summary-datatable

Running on my computer locally, everything is fine. But when I have it running through nginx, I got the following error: [“413 Request Entity Too Large”]

I looked at the size that it was trying to send - it was ~76MB! The data from the data table is only 6-10MB on any given day, so even if I was trying to send in both the data and virtual data, I would think it would only be 12-20 MB max.

I increased the max size for nginx, which allowed it to load. But now it’s loading super slowly, and I think the ~76MB transfers are the reason why.

Please help if you have any thoughts as to why the data transfer of ‘data’ and ‘derived_virtual_data’ would be so inflated?
Note:I excluded the layout divs from the code snippet to minimize the code you have to look at. Below should everything relevant to the problem. Let me know If you think there’s more I need to add.

#query runs and converts to dataframe on initial page load and browser refresh
def get_data():
    sql = open('OAD_Display.sql','r').read()
    df = pd.read_sql(sql, engine)
    return df
df1 = get_data()

dash_table.DataTable(
     id='my-datatable', 
     columns=
     [{'name': ['Context',i], 'id': i, 'editable': False} for i in df1.columns[0:1]] +
     [{'name': ['Context',i], 'id': i, 'editable': False, 'hideable': True} for i in df1.columns[1:12]] +
     [{'name': ['Impact',i], 'id': i, 'editable': False} for i in df1.columns[12:15]] + 
     [{'name': ['Impact',i], 'id': i, 'editable': False, 'type': 'numeric','format': FormatTemplate.money(0)} for i in df1.columns[15:18]] + 
     [{'name': ['Resolution',i], 'id': i, 'editable': True} for i in df1.columns[18:19]] + 
     [{'name': ['Resolution',i], 'id': i, 'editable': True, 'presentation': 'dropdown'} for i in df1.columns[19:20]],
     export_format= 'xlsx',
     merge_duplicate_headers=True,
     data= get_data().to_dict('records'))

dash_table.DataTable(
    id='summary-datatable',
    columns=
    [{"name": ['Cost',"Metric"], "id": 'Cost-Metric', 'editable':False,}]+ 
    [{"name": ['Cost',"Amount"], "id": 'Cost-Amount', 'editable':False,'type': 'numeric','format': FormatTemplate.money(0)}]+
    [{"name": ['Cost',"Percent"], "id": 'Cost-Percent', 'editable':False,'type': 'numeric','format': FormatTemplate.percentage(1)}]+
    [{"name": ['Stats',"Metric"], "id": 'Stats-Metric', 'editable':False,}]+
    [{"name": ['Stats',"Quantity"], "id": 'Stats-Quantity', 'editable':False,}]+
    [{"name": ['Stats',"Percent"], "id": 'Stats-Percent', 'editable':False,'type': 'numeric','format': FormatTemplate.percentage(1)}],
    merge_duplicate_headers=True,
    data= [{}],
    export_format= 'xlsx'
) #summary table is blank on initial load

dcc.Store(id='query_store')# just used to make sure that virtual data pulled into callback is virtual data after filter query is entered by user, not before

@app.callback(
    Output('query_store','data'),
    [Input('my-datatable','filter_query')]
)
def update_query_store(filter_query):
    return filter_query

@app.callback(
    Output('summary-datatable', 'data'), 
    [Input('query_store', 'data')],
    [State('my-datatable', 'derived_virtual_data'),
    State('my-datatable', 'data')])
def update_summary(filter_query, derived_virtual_data,data):
     if derived_virtual_data is None or derived_virtual_data == []:
        return [{}]
    else:
        return_data =[]
        df_user_view_data = pd.DataFrame.from_dict(derived_virtual_data)
        df_original_data = pd.DataFrame.from_dict(data)

        uv_unique_WOID_df = df_user_view_data.drop_duplicates(subset='WarpObjectID',keep='first')
        og_unique_WOID_df = df_original_data.drop_duplicates(subset='WarpObjectID',keep='first')

        uv_Total_Orphan_Cost = uv_unique_WOID_df['Orphan Cost'].sum(axis=0)
        og_Total_Orphan_Cost = og_unique_WOID_df['Orphan Cost'].sum(axis=0)

        uv_Unique_Orphan_Instances = uv_unique_WOID_df['WarpObjectID'].count()
        og_Unique_Orphan_Instances = og_unique_WOID_df['WarpObjectID'].count()

        return_data.append({'Cost-Metric': 'Total Actionable Orphan Cost', 'Cost-Amount': uv_Total_Orphan_Cost, 'Cost-Percent': (uv_Total_Orphan_Cost/ og_Total_Orphan_Cost), 'Stats-Metric': 'Unique Actionable Orphan Instances','Stats-Quantity': uv_Unique_Orphan_Instances, 'Stats-Percent': (uv_Unique_Orphan_Instances/og_Unique_Orphan_Instances)})

        object_types = ['PO Line', 'Work Order', 'Inventory', 'Master Schedule', 'RFPO', 'Released Planned Order']

        for ot in object_types:
                uv_TOC_ot = uv_unique_WOID_df.loc[(uv_unique_WOID_df['Type'] == ot), 'Orphan Cost'].sum(axis=0)
                uv_UOI_ot = uv_unique_WOID_df.loc[(uv_unique_WOID_df['Type'] == ot), 'WarpObjectID'].count()
                return_data.append({'Cost-Metric': 'Total Actionable Orphan Cost - '+ot,'Cost-Amount': uv_TOC_ot,'Cost-Percent':(uv_TOC_ot/og_Total_Orphan_Cost) ,'Stats-Metric': 'Unique Actionable Orphan Instances - '+ot,'Stats-Quantity': uv_UOI_ot,'Stats-Percent':(uv_UOI_ot/og_Unique_Orphan_Instances)})

        return return_data

The query_store is just a workaround to ensure that the derived_virtual_data is the data after the filter has been applied. Without it, the derived virtual data is always one filter query behind. I don’t think this has anything to do with it.

The update_summary function takes the virtual data and original data, creates a few additional dataframes and then performs the same operations to each data frame. Then I add it as a table row to the summary results.

I was thinking that the problem was these additional dataframes, but that doesn’t make sense to me - those data frames aren’t a part of the input or the output of the callback. The input is the original and virtual data (and the filter query string), and the output is a tiny table!

Thanks for your help in advance!

Hi @lbuscher welcome to the forum! Nothing obvious comes to my mind, but here are a couple of suggestions to help you debugging:

  • are you sure the callback is not called more times than what you think, which could result in some back and forth transfer of data?
  • did you open the network tab of your browser developer tools to check that there is indeed a transfer of 76 Mo?

Last but not least it’s possible (not sure I would need to check) that the transfer of the data is not using a different data type than the original pandas object, for example a JSON string of a pandas dataframe can be a larger object than the original dataframe.

I would also suggest printing out the length of the input arguments to update_summary e.g. print(len(derived_virtual_data)) and the length of len(return_data) before you return it to make sure that it is similar to what you expect.

As you mention, if these input arguments are only 6-10MB and there are 3 of them, then I would expect it to be 18-30MB, maybe a little larger because of the JSON serialization as Emma mentions. However, we also gzip things, so that should make the request smaller.

1 Like

Normally, we’d suggest using pagination (https://dash.plotly.com/datatable/callbacks) so that you only transfer and display e.g. 20 rows at a time.

Now that’s harder in your case because your data is regenerated on page load, so we really have session-based data. This means that we can’t refer to a global df variable, we have to keep the data associated with the client.

To keep data associated with the session, we usually rely on the client as you’ve done in your example. This works great for a few MB but starts to slow down when we start to transfer lots of data back and forth over the network via callbacks.

In the longer term, we’d like to formalize a session-based server side storage interface on some shared memory backend like Redis or Plasma. If your organization has a software budget, this would be a great sponsored project: https://plot.ly/products/consulting-and-oem/