I have made a local dash-app to allow students to efficiently find/study their (measurement) data. To allow fur further development, I tried to make a transition from pandas & duckdb to polars. After weeks of work to integrate it into this extensive app, I realized that I have run into a major problem.
The app was stable before, but now with polars, the RAM-footprint (of the pythonw.exe process) balloons with each successive callback. While the app starts out around 100 MB; each callback adds something like 5MB. I doesnāt seem to stabilize; at 500 MB it was still growing.
Iām sort of stuck and would really appreciate some pointers how to resolve this.
Thanks for the suggestion about tabs! However, I donāt think that is an issue yet. In my minimum example (see below), I donāt have any tabs and it still accumulates RAM.
import pathlib, polars as pl
from dash import Dash, dcc, html, Input, Output
import plotly.graph_objects as go
Yes. I can isolate it to polars. I included a new minimum example that you should be able to run.
If I run it with āpolars_check=Trueā, then I start with 98MB and after 100 iterations it has become 261 MB. If I do it with āpolars_checkā=False (i.e. pandas), then I start and end with 98MB.
This is beyond my current skill level to figure out further. Any idea who/which community could figure out a fix?
import pathlib, os, shutil
import polars as pl, pandas as pd, numpy as np, datetime as dt
from dash import Dash, dcc, html, Input, Output
import plotly.graph_objects as go
#Check-input
polars_check = True ### Whether the example returns with polars or with pandas.
if polars_check: #To accomdate the slower data retrieval with pandas.
interval_time = 5E2
else:
interval_time = 3E3
#Constants
folder = pathlib.Path(r'C:\PerovskiteCell example')
n_files = 100 #Number of files in folder
n_lines = 500000 #Number of total lines in folder
n_cols = 25
#Generating sample data in example folder (Only once).
if not folder.exists():
size = int(n_lines / n_files)
col = np.linspace(-1E3, 1E3, num=size)
df = pl.DataFrame({f'col{n}': col for n in range(n_cols)})
# Creating folder & files
os.makedirs(folder)
f_path0 = folder.joinpath('0.csv')
df.write_csv(f_path0)
for n in range(1, n_files):
shutil.copy2(f_path0, folder.joinpath(f'{n}.csv'))
#Functions
def pl_data():
"""Retrieves data via the polars route"""
lf = (pl.scan_csv(folder.joinpath(f'{n}.csv'),
schema={f'col{n}': pl.Float64 for n in range(n_cols)})
.select(pl.all().get(n)) for n in range(n_files))
lf = pl.concat(lf)
lf = lf.select('col0', 'col1')
return lf.collect()
def pd_data():
"""Retrieves data via the pandas route"""
dfs = (pd.read_csv(folder.joinpath(f'{n}.csv'), usecols=['col0', 'col1']).iloc[n:n+1]
for n in range(n_files))
return pd.concat(dfs, ignore_index=True)
#App (initialization)
app = Dash()
app.layout = html.Div([dcc.Graph(id='graph'),
dcc.Interval(id = 'check',
interval = interval_time,
max_intervals = 100)])
@app.callback(
Output('graph', 'figure'),
Input('check', 'n_intervals'))
def plot(_):
#Data retrieval
if polars_check:
df = pl_data()
else:
df = pd_data()
#Plotting
fig = go.Figure()
trace = go.Scattergl(x = list(df['col0']), y=list(df['col1']), mode='lines+markers')
fig.add_trace(trace)
fig.update_xaxes(title = str(dt.datetime.now()))
return fig
if __name__ == '__main__':
app.run(debug=False, port = 8050)
I think youāre correct that the StackOverflow post is highlighting the root cause here. Unfortunately, if thatās the case thereās not much we can do on the Dash side to help (it seems most likely that this needs to be fixed in Polars).
My recommendation would be opening a ticket in the polo-rs/polars repository highlighting the memory leak issues of Polars in Flask (and therefore, Dash).