Dash & polars; RAM-use keeps increasing

I have made a local dash-app to allow students to efficiently find/study their (measurement) data. To allow fur further development, I tried to make a transition from pandas & duckdb to polars. After weeks of work to integrate it into this extensive app, I realized that I have run into a major problem.

The app was stable before, but now with polars, the RAM-footprint (of the pythonw.exe process) balloons with each successive callback. While the app starts out around 100 MB; each callback adds something like 5MB. I doesn’t seem to stabilize; at 500 MB it was still growing.

I’m sort of stuck and would really appreciate some pointers how to resolve this.

Hi @PerovskiteCell and welcome to the Dash community!

It sounds like an interesting app. Just curious, does it use tabs? There was a recent issue posted in GItHub [BUG] Switching back and forth between dcc.Tabs doesn't seem to release memory in the browser · Issue #2882 · plotly/dash · GitHub.

If not, would it be possible to create a minimal example? It would be helpful to add to that post in Github in case it’s a related issue.

Thanks for the suggestion about tabs! However, I don’t think that is an issue yet. In my minimum example (see below), I don’t have any tabs and it still accumulates RAM.

import pathlib, polars as pl

from dash import Dash, dcc, html, Input, Output
import plotly.graph_objects as go

#Constants
folder = pathlib.Path(r’C:\Logs (example)')

#Initialization
app = Dash()
app.layout = html.Div([dcc.Graph(id=‘graph’),
html.Button(‘update’, id=‘button’)])

#Callbacks
@app.callback(
Output(‘graph’, ‘figure’),
Input(‘button’, ‘n_clicks’))

def plot(_):

lf = (pl.scan_csv(f_path, 
                  separator=';').last() for f_path in folder.iterdir())

lf = pl.concat(lf, how='vertical_relaxed')

lf = lf.select(pl.col('Date').str.to_date('%m-%d-%Y '),
               pl.col('Time').str.to_time('%H:%M:%S'))

df = lf.collect()

fig = go.Figure()
trace = go.Scattergl(x = df['Date'], y=df['Time'], mode='lines+markers')

fig.add_trace(trace)

return fig

if name == ‘main’:
app.run(debug=False, port = 8050)

The folder in the example below contains 75 csv-files with each 25 columns. In total, it amounts to 459245 rows.

Hi @PerovskiteCell

Since there is no data, I didn’t run it… but can you isolate it to Polars? If you run the same app with Pandas, do you see the same issue?

Yes. I can isolate it to polars. I included a new minimum example that you should be able to run.

If I run it with “polars_check=True”, then I start with 98MB and after 100 iterations it has become 261 MB. If I do it with “polars_check”=False (i.e. pandas), then I start and end with 98MB.

This is beyond my current skill level to figure out further. Any idea who/which community could figure out a fix?

import pathlib, os, shutil
import polars as pl, pandas as pd, numpy as np, datetime as dt

from dash import Dash, dcc, html, Input, Output
import plotly.graph_objects as go


#Check-input
polars_check = True ### Whether the example returns with polars or with pandas.

if polars_check: #To accomdate the slower data retrieval with pandas.
    interval_time = 5E2
else:
    interval_time = 3E3

#Constants
folder = pathlib.Path(r'C:\PerovskiteCell example')

n_files = 100 #Number of files in folder
n_lines = 500000 #Number of total lines in folder
n_cols = 25


#Generating sample data in example folder (Only once).
if not folder.exists():

    size = int(n_lines / n_files)
    col = np.linspace(-1E3, 1E3, num=size)

    df = pl.DataFrame({f'col{n}': col for n in range(n_cols)})

    # Creating folder & files
    os.makedirs(folder)

    f_path0 = folder.joinpath('0.csv')
    df.write_csv(f_path0)

    for n in range(1, n_files):
        shutil.copy2(f_path0, folder.joinpath(f'{n}.csv'))


#Functions
def pl_data():
    """Retrieves data via the polars route"""
    
    lf = (pl.scan_csv(folder.joinpath(f'{n}.csv'),
                                schema={f'col{n}': pl.Float64 for n in range(n_cols)})
                            
                            .select(pl.all().get(n)) for n in range(n_files))
    
    lf = pl.concat(lf)
    lf = lf.select('col0', 'col1')

    return lf.collect()


def pd_data():
    """Retrieves data via the pandas route"""

    dfs = (pd.read_csv(folder.joinpath(f'{n}.csv'), usecols=['col0', 'col1']).iloc[n:n+1]
                                         for n in range(n_files))
    
    return pd.concat(dfs, ignore_index=True)



#App (initialization)
app = Dash()
app.layout = html.Div([dcc.Graph(id='graph'),
                        dcc.Interval(id = 'check', 
                                        interval = interval_time,
                                        max_intervals = 100)])


@app.callback(
    Output('graph', 'figure'),
    Input('check', 'n_intervals'))

def plot(_):

    #Data retrieval
    if polars_check:
        df = pl_data()
    else:
        df = pd_data()

    #Plotting
    fig = go.Figure()
    trace = go.Scattergl(x = list(df['col0']), y=list(df['col1']), mode='lines+markers')

    fig.add_trace(trace)
    fig.update_xaxes(title = str(dt.datetime.now()))

    return fig


if __name__ == '__main__':
    app.run(debug=False, port = 8050)

Hi @PerovskiteCell

Try increasing the interval time. Currently, it may not be enough time to read and process the files.

@AnnMarieW,

I don’t think the interval time is an issue.

  1. The callback takes (on average) ~250ms. An interval of 500ms should be enough to do the processing.

  2. With an interval of 3s for both pandas and polars; I get the same result.

(Sorry for the delayed response; I was on holiday)

There is a possibly related issue posted for Polars:

1 Like

@davidharris

Thanks for weighing in.

However, I do not think this is related to that post. I’m using windows; and the poster is not having the issues in windows (only in Linux).

I think my issue is similar to this recent post on StackOverflow. That poster has an issue with using polars with Flask (which dash also uses).

I think you’re correct that the StackOverflow post is highlighting the root cause here. Unfortunately, if that’s the case there’s not much we can do on the Dash side to help (it seems most likely that this needs to be fixed in Polars).

My recommendation would be opening a ticket in the polo-rs/polars repository highlighting the memory leak issues of Polars in Flask (and therefore, Dash).

3 Likes

@nathandrezner

Thanks for you input, Nathan.

I will open a ticket at their repository.

Found the solution by discussing the issue with the people at the polars-github.

Add

os.environ[‘MIMALLOC_ABANDONED_PAGE_RESET’] = ‘1’

before importing polars.

Hi @PerovskiteCell

Thanks for following up and posting the solution here! :tada: