Dash & polars; RAM-use keeps increasing

PerovskiteCell · July 23, 2024, 2:30pm

I have made a local dash-app to allow students to efficiently find/study their (measurement) data. To allow fur further development, I tried to make a transition from pandas & duckdb to polars. After weeks of work to integrate it into this extensive app, I realized that I have run into a major problem.

The app was stable before, but now with polars, the RAM-footprint (of the pythonw.exe process) balloons with each successive callback. While the app starts out around 100 MB; each callback adds something like 5MB. I doesn’t seem to stabilize; at 500 MB it was still growing.

I’m sort of stuck and would really appreciate some pointers how to resolve this.

AnnMarieW · July 23, 2024, 5:52pm

Hi @PerovskiteCell and welcome to the Dash community!

It sounds like an interesting app. Just curious, does it use tabs? There was a recent issue posted in GItHub [BUG] Switching back and forth between dcc.Tabs doesn't seem to release memory in the browser · Issue #2882 · plotly/dash · GitHub.

If not, would it be possible to create a minimal example? It would be helpful to add to that post in Github in case it’s a related issue.

PerovskiteCell · July 25, 2024, 1:18pm

Thanks for the suggestion about tabs! However, I don’t think that is an issue yet. In my minimum example (see below), I don’t have any tabs and it still accumulates RAM.

import pathlib, polars as pl

from dash import Dash, dcc, html, Input, Output
import plotly.graph_objects as go

#Constants
folder = pathlib.Path(r’C:\Logs (example)')

#Initialization
app = Dash()
app.layout = html.Div([dcc.Graph(id=‘graph’),
html.Button(‘update’, id=‘button’)])

#Callbacks
@app.callback(
Output(‘graph’, ‘figure’),
Input(‘button’, ‘n_clicks’))

def plot(_):
lf = (pl.scan_csv(f_path, 
                  separator=';').last() for f_path in folder.iterdir())

lf = pl.concat(lf, how='vertical_relaxed')

lf = lf.select(pl.col('Date').str.to_date('%m-%d-%Y '),
               pl.col('Time').str.to_time('%H:%M:%S'))

df = lf.collect()

fig = go.Figure()
trace = go.Scattergl(x = df['Date'], y=df['Time'], mode='lines+markers')

fig.add_trace(trace)

return fig
if name == ‘main’:
app.run(debug=False, port = 8050)

The folder in the example below contains 75 csv-files with each 25 columns. In total, it amounts to 459245 rows.

AnnMarieW · July 25, 2024, 3:34pm

Hi @PerovskiteCell

Since there is no data, I didn’t run it… but can you isolate it to Polars? If you run the same app with Pandas, do you see the same issue?

PerovskiteCell · July 26, 2024, 11:52am

Yes. I can isolate it to polars. I included a new minimum example that you should be able to run.

If I run it with “polars_check=True”, then I start with 98MB and after 100 iterations it has become 261 MB. If I do it with “polars_check”=False (i.e. pandas), then I start and end with 98MB.

This is beyond my current skill level to figure out further. Any idea who/which community could figure out a fix?

import pathlib, os, shutil
import polars as pl, pandas as pd, numpy as np, datetime as dt

from dash import Dash, dcc, html, Input, Output
import plotly.graph_objects as go


#Check-input
polars_check = True ### Whether the example returns with polars or with pandas.

if polars_check: #To accomdate the slower data retrieval with pandas.
    interval_time = 5E2
else:
    interval_time = 3E3

#Constants
folder = pathlib.Path(r'C:\PerovskiteCell example')

n_files = 100 #Number of files in folder
n_lines = 500000 #Number of total lines in folder
n_cols = 25


#Generating sample data in example folder (Only once).
if not folder.exists():

    size = int(n_lines / n_files)
    col = np.linspace(-1E3, 1E3, num=size)

    df = pl.DataFrame({f'col{n}': col for n in range(n_cols)})

    # Creating folder & files
    os.makedirs(folder)

    f_path0 = folder.joinpath('0.csv')
    df.write_csv(f_path0)

    for n in range(1, n_files):
        shutil.copy2(f_path0, folder.joinpath(f'{n}.csv'))


#Functions
def pl_data():
    """Retrieves data via the polars route"""
    
    lf = (pl.scan_csv(folder.joinpath(f'{n}.csv'),
                                schema={f'col{n}': pl.Float64 for n in range(n_cols)})
                            
                            .select(pl.all().get(n)) for n in range(n_files))
    
    lf = pl.concat(lf)
    lf = lf.select('col0', 'col1')

    return lf.collect()


def pd_data():
    """Retrieves data via the pandas route"""

    dfs = (pd.read_csv(folder.joinpath(f'{n}.csv'), usecols=['col0', 'col1']).iloc[n:n+1]
                                         for n in range(n_files))
    
    return pd.concat(dfs, ignore_index=True)



#App (initialization)
app = Dash()
app.layout = html.Div([dcc.Graph(id='graph'),
                        dcc.Interval(id = 'check', 
                                        interval = interval_time,
                                        max_intervals = 100)])


@app.callback(
    Output('graph', 'figure'),
    Input('check', 'n_intervals'))

def plot(_):

    #Data retrieval
    if polars_check:
        df = pl_data()
    else:
        df = pd_data()

    #Plotting
    fig = go.Figure()
    trace = go.Scattergl(x = list(df['col0']), y=list(df['col1']), mode='lines+markers')

    fig.add_trace(trace)
    fig.update_xaxes(title = str(dt.datetime.now()))

    return fig


if __name__ == '__main__':
    app.run(debug=False, port = 8050)

AnnMarieW · July 26, 2024, 2:54pm

Hi @PerovskiteCell

Try increasing the interval time. Currently, it may not be enough time to read and process the files.

PerovskiteCell · August 5, 2024, 11:16am

@AnnMarieW,

I don’t think the interval time is an issue.

The callback takes (on average) ~250ms. An interval of 500ms should be enough to do the processing.
With an interval of 3s for both pandas and polars; I get the same result.

(Sorry for the delayed response; I was on holiday)

davidharris · August 5, 2024, 3:49pm

There is a possibly related issue posted for Polars:

github.com/pola-rs/polars

Possibly memory leak on Linux

opened 02:39PM - 17 Apr 24 UTC

ngandrewhh

bug python needs triage

### Checks - [X] I have checked that this issue has not already been reported. …- [X] I have confirmed this bug exists on the [latest version](https://pypi.org/project/polars/) of Polars. ### Reproducible example ```python import os os.environ['POLARS_MAX_THREADS'] = '1' from functools import wraps import numpy as np import pandas as pd import polars as pl import psutil def trace(f): @wraps(f) def wrap(*args, **kwargs): print(f"pid #{psutil.Process().pid}: {psutil.Process().memory_info().rss / 1024 / 1024:,.1f} MB") result = f(*args, **kwargs) print(f"pid #{psutil.Process().pid}: {psutil.Process().memory_info().rss / 1024 / 1024:,.1f} MB") return result return wrap @trace def transform_data(pl_data, i): out = {} expr = [pl.col(col).rolling_mean(window_size=f"{i}i").over('ID') for col in list(set(pl_data.columns) - {'TS', 'ID'})] pl_data = pl_data.with_columns(expr) for col in list(set(pl_data.columns)): # leaks # out[col] = pl_data.select(col) # out[col] = pl_data.get_column(col).to_frame() # no leak # out[col] = pl_data.select(col).to_pandas() # out[col] = pl_data.get_column(col).to_pandas() return out @trace def do_something(data): pl_data = pl.from_pandas(data, include_index=True) out = {} for i in range(5, 10): out |= transform_data(pl_data, i) # return pl.concat(out.values(), how='horizontal').to_pandas().set_index(['TS', 'ID']) # return pd.concat(out.values(), axis=1).set_index(['TS', 'ID']) @trace def main(): data = pd.DataFrame(columns=['TS', 'ID', *list(range(30))]) np.random.seed(1234) data['TS'] = pd.date_range("2020-01-01 00:00", "2020-06-30 23:59", freq='5min')[:50000] data['ID'] = np.random.randint(5, size=50000) data = data.set_index(['TS', 'ID']) data = data.map(lambda e: np.random.rand()) for _ in range(10000): data = do_something(data) if __name__ == "__main__": main() ``` ### Log output _No response_ ### Issue description Memory saturates in Windows but possibly leaks in Linux. While I have read about memory allocator, interestingly when we convert back to pandas df, no memory increase is observed. Memory increase is seen when performing pl.concat(..., how='horizontal') on pl.select(...) or pl.get_column(...).to_frame(). I have read some discussion on the memory allocators on relevant threads, but not sure entirely relevant here. Windows: hover around 140 MB for conversion to and from pandas, hover around 180 MB for polars Linux: hover around 140 MB for conversion to and from pandas, goes up to 500 / 600 MB for polars and gradually increasing ### Expected behavior Polars should take less memory than Pandas at idle, and memory footprint should not be increasing. ### Installed versions Windows <details> ``` --------Version info--------- Polars: 0.20.10 Index type: UInt32 Platform: Windows-10-10.0.19045-SP0 Python: 3.11.8 | packaged by Anaconda, Inc. | (main, Feb 26 2024, 21:34:05) [MSC v.1916 64 bit (AMD64)] ----Optional dependencies---- adbc_driver_manager: <not installed> cloudpickle: <not installed> connectorx: <not installed> deltalake: <not installed> fsspec: <not installed> gevent: <not installed> hvplot: <not installed> matplotlib: <not installed> numpy: 1.26.4 openpyxl: <not installed> pandas: 2.2.1 pyarrow: 15.0.0 pydantic: <not installed> pyiceberg: <not installed> pyxlsb: <not installed> sqlalchemy: <not installed> xlsx2csv: <not installed> xlsxwriter: <not installed> ``` </details> Linux <details> ``` --------Version info--------- Polars: 0.20.15 Index type: UInt32 Platform: Linux-5.14.0-362.24.1.e19_3.x86_64-x86_64-with-glibc2.34 Python: 3.11.8 (main, Feb 26 2024, 21:34:05) [GCC 11.2.0] ----Optional dependencies---- adbc_driver_manager: <not installed> cloudpickle: <not installed> connectorx: <not installed> deltalake: <not installed> fsspec: 2024.3.1 gevent: <not installed> hvplot: <not installed> matplotlib: <not installed> numpy: 1.26.4 openpyxl: 3.1.2 pandas: 2.1.4 pyarrow: 14.0.2 pydantic: 2.7.0 pyiceberg: <not installed> pyxlsb: <not installed> sqlalchemy: <not installed> xlsx2csv: <not installed> xlsxwriter: <not installed> ``` </details>

PerovskiteCell · August 6, 2024, 12:01pm

@davidharris

Thanks for weighing in.

However, I do not think this is related to that post. I’m using windows; and the poster is not having the issues in windows (only in Linux).

I think my issue is similar to this recent post on StackOverflow. That poster has an issue with using polars with Flask (which dash also uses).

nathandrezner · August 6, 2024, 4:03pm

I think you’re correct that the StackOverflow post is highlighting the root cause here. Unfortunately, if that’s the case there’s not much we can do on the Dash side to help (it seems most likely that this needs to be fixed in Polars).

My recommendation would be opening a ticket in the polo-rs/polars repository highlighting the memory leak issues of Polars in Flask (and therefore, Dash).

PerovskiteCell · August 7, 2024, 8:53am

@nathandrezner

Thanks for you input, Nathan.

I will open a ticket at their repository.

PerovskiteCell · August 19, 2024, 8:04am

Found the solution by discussing the issue with the people at the polars-github.

Add

os.environ[‘MIMALLOC_ABANDONED_PAGE_RESET’] = ‘1’

before importing polars.

AnnMarieW · August 19, 2024, 2:41pm

Hi @PerovskiteCell

Thanks for following up and posting the solution here!

Topic		Replies	Views
Analyzing Memory Footprint of a Dash App Dash Python	4	3143	November 29, 2022
Memory Leak - Idle app memory usage goes up Dash Python	0	606	December 14, 2022
Memory consuming Dash Python	0	1020	June 13, 2020
Memray - Memory profiling tool for Python - Anyone used this in Dash yet? Dash Python	4	1113	July 8, 2024
Callback information stored permanently, RAM simply increases? Dash Python question	6	676	October 20, 2022

Dash & polars; RAM-use keeps increasing

Related topics