Dash & polars; RAM-use keeps increasing

I have made a local dash-app to allow students to efficiently find/study their (measurement) data. To allow fur further development, I tried to make a transition from pandas & duckdb to polars. After weeks of work to integrate it into this extensive app, I realized that I have run into a major problem.

The app was stable before, but now with polars, the RAM-footprint (of the pythonw.exe process) balloons with each successive callback. While the app starts out around 100 MB; each callback adds something like 5MB. I doesnā€™t seem to stabilize; at 500 MB it was still growing.

Iā€™m sort of stuck and would really appreciate some pointers how to resolve this.

Hi @PerovskiteCell and welcome to the Dash community!

It sounds like an interesting app. Just curious, does it use tabs? There was a recent issue posted in GItHub [BUG] Switching back and forth between dcc.Tabs doesn't seem to release memory in the browser Ā· Issue #2882 Ā· plotly/dash Ā· GitHub.

If not, would it be possible to create a minimal example? It would be helpful to add to that post in Github in case itā€™s a related issue.

Thanks for the suggestion about tabs! However, I donā€™t think that is an issue yet. In my minimum example (see below), I donā€™t have any tabs and it still accumulates RAM.

import pathlib, polars as pl

from dash import Dash, dcc, html, Input, Output
import plotly.graph_objects as go

#Constants
folder = pathlib.Path(rā€™C:\Logs (example)')

#Initialization
app = Dash()
app.layout = html.Div([dcc.Graph(id=ā€˜graphā€™),
html.Button(ā€˜updateā€™, id=ā€˜buttonā€™)])

#Callbacks
@app.callback(
Output(ā€˜graphā€™, ā€˜figureā€™),
Input(ā€˜buttonā€™, ā€˜n_clicksā€™))

def plot(_):

lf = (pl.scan_csv(f_path, 
                  separator=';').last() for f_path in folder.iterdir())

lf = pl.concat(lf, how='vertical_relaxed')

lf = lf.select(pl.col('Date').str.to_date('%m-%d-%Y '),
               pl.col('Time').str.to_time('%H:%M:%S'))

df = lf.collect()

fig = go.Figure()
trace = go.Scattergl(x = df['Date'], y=df['Time'], mode='lines+markers')

fig.add_trace(trace)

return fig

if name == ā€˜mainā€™:
app.run(debug=False, port = 8050)

The folder in the example below contains 75 csv-files with each 25 columns. In total, it amounts to 459245 rows.

Hi @PerovskiteCell

Since there is no data, I didnā€™t run itā€¦ but can you isolate it to Polars? If you run the same app with Pandas, do you see the same issue?

Yes. I can isolate it to polars. I included a new minimum example that you should be able to run.

If I run it with ā€œpolars_check=Trueā€, then I start with 98MB and after 100 iterations it has become 261 MB. If I do it with ā€œpolars_checkā€=False (i.e. pandas), then I start and end with 98MB.

This is beyond my current skill level to figure out further. Any idea who/which community could figure out a fix?

import pathlib, os, shutil
import polars as pl, pandas as pd, numpy as np, datetime as dt

from dash import Dash, dcc, html, Input, Output
import plotly.graph_objects as go


#Check-input
polars_check = True ### Whether the example returns with polars or with pandas.

if polars_check: #To accomdate the slower data retrieval with pandas.
    interval_time = 5E2
else:
    interval_time = 3E3

#Constants
folder = pathlib.Path(r'C:\PerovskiteCell example')

n_files = 100 #Number of files in folder
n_lines = 500000 #Number of total lines in folder
n_cols = 25


#Generating sample data in example folder (Only once).
if not folder.exists():

    size = int(n_lines / n_files)
    col = np.linspace(-1E3, 1E3, num=size)

    df = pl.DataFrame({f'col{n}': col for n in range(n_cols)})

    # Creating folder & files
    os.makedirs(folder)

    f_path0 = folder.joinpath('0.csv')
    df.write_csv(f_path0)

    for n in range(1, n_files):
        shutil.copy2(f_path0, folder.joinpath(f'{n}.csv'))


#Functions
def pl_data():
    """Retrieves data via the polars route"""
    
    lf = (pl.scan_csv(folder.joinpath(f'{n}.csv'),
                                schema={f'col{n}': pl.Float64 for n in range(n_cols)})
                            
                            .select(pl.all().get(n)) for n in range(n_files))
    
    lf = pl.concat(lf)
    lf = lf.select('col0', 'col1')

    return lf.collect()


def pd_data():
    """Retrieves data via the pandas route"""

    dfs = (pd.read_csv(folder.joinpath(f'{n}.csv'), usecols=['col0', 'col1']).iloc[n:n+1]
                                         for n in range(n_files))
    
    return pd.concat(dfs, ignore_index=True)



#App (initialization)
app = Dash()
app.layout = html.Div([dcc.Graph(id='graph'),
                        dcc.Interval(id = 'check', 
                                        interval = interval_time,
                                        max_intervals = 100)])


@app.callback(
    Output('graph', 'figure'),
    Input('check', 'n_intervals'))

def plot(_):

    #Data retrieval
    if polars_check:
        df = pl_data()
    else:
        df = pd_data()

    #Plotting
    fig = go.Figure()
    trace = go.Scattergl(x = list(df['col0']), y=list(df['col1']), mode='lines+markers')

    fig.add_trace(trace)
    fig.update_xaxes(title = str(dt.datetime.now()))

    return fig


if __name__ == '__main__':
    app.run(debug=False, port = 8050)

Hi @PerovskiteCell

Try increasing the interval time. Currently, it may not be enough time to read and process the files.

@AnnMarieW,

I donā€™t think the interval time is an issue.

  1. The callback takes (on average) ~250ms. An interval of 500ms should be enough to do the processing.

  2. With an interval of 3s for both pandas and polars; I get the same result.

(Sorry for the delayed response; I was on holiday)

There is a possibly related issue posted for Polars:

1 Like

@davidharris

Thanks for weighing in.

However, I do not think this is related to that post. Iā€™m using windows; and the poster is not having the issues in windows (only in Linux).

I think my issue is similar to this recent post on StackOverflow. That poster has an issue with using polars with Flask (which dash also uses).

I think youā€™re correct that the StackOverflow post is highlighting the root cause here. Unfortunately, if thatā€™s the case thereā€™s not much we can do on the Dash side to help (it seems most likely that this needs to be fixed in Polars).

My recommendation would be opening a ticket in the polo-rs/polars repository highlighting the memory leak issues of Polars in Flask (and therefore, Dash).

3 Likes

@nathandrezner

Thanks for you input, Nathan.

I will open a ticket at their repository.

Found the solution by discussing the issue with the people at the polars-github.

Add

os.environ[ā€˜MIMALLOC_ABANDONED_PAGE_RESETā€™] = ā€˜1ā€™

before importing polars.

Hi @PerovskiteCell

Thanks for following up and posting the solution here! :tada: