How to make Dash app run faster if its slowed by large data imported

I have a large data stored in CSV file (1.7G)

My Dash app is too slow , take minutes to upload and refresh and run :grimacing: :grimacing:

I tried dcc.store But when I change value in my dropdown , it takes a lot of time to change my graphs

Is there any solution ? ?? ( I have lot of pages in my app ! )

so i had the same problem and one thing helped quite a bit was a function i found online, u just have to give your dataframe as an argument and assign it to a new variable.



def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() / 1024**2
    #print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() / 1024**2
    
    
    return df


u could also try read your csv with the engine set to c like this


same_bed_dataframe = pd.read_csv(values, header=None, delimiter='\t', engine='c')

and my last tip is to use for example Scattergl instead of normal Scatter if that’s the case .

2 Likes

You can make the initial data load faster if you save the data using a binary format such as feather or parquet rather than CSV. CSV is text based so when you load it, pandas has to parse all of the strings and convert to numeric values which takes time. Should be as easy as saving your data using df.to_parquet or df.to_feather and then replacing pd.read_csv with pd.read_parquet or pd.read_feather respectively.

As for speeding up the plotting function, it depends what you’re doing. If you provide some more details we might be able to offer some suggestions. You might be able to pre-compute some of the things you want to plot and save some time that way if you’re computing them from the raw data in the callback each time.

2 Likes

Hey @nuno5645 thank you for reply …
I use your code :

individu = pd.read_csv(DATA_PATH.joinpath("individu_final.csv",))
# Reduce memory usage :
def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() / 1024**2
    #print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() / 1024**2
    
    
    return df
reduce_mem_usage(individu)

I cant use engine C with DATA_PATH.joinpath

My App still working slow

could you show your code so i can better see the issue

1 Like

This is the structure of my App
I’ve 12 pages in rapports_adultes , rappoets_agees …

I save dataa using pd.read_csv in individus and I share it in other pages

1 Like

if you add it this way wouldn’t it work?


individu = pd.read_csv(DATA_PATH.joinpath("individu_final.csv"), engine = 'c')



1 Like

It’s working ! , But when I reload the app in browser , its soo slow :smiling_face_with_tear:

thank you @tcbegley , I trying to use your idea (df.to_parquet)

in myApp , I have a lot of pages having plots … Plots called by callbacks in using dropdowns or radoiItems

if you u want to speed things up u could try and use something like dask, to read your csv alot faster than pandas. it has the same syntax as dask just a little less features

if u want some help i could analyze your code and suggest some performance tips