Loading large datasets into dash app

Hi, I am trying to read in a large amount of dataset - around 40 million rows in total into the dash app which is taking a long period of time.

I am currently resorting to reading in a pickled dataframe but that still takes around 2 minutes or so. Is there another way to read in a large dataset?

Hi @ettan
One way to do this is to use Dash with Vaex.
Also, if you’re trying to plot the data, Plotly-resampler component might be useful.

1 Like

Thanks a lot @adamschroeder . From what I can tell, Vaex is used for computation of the data after it has been ingested. However, I am referring more towards the initial ingestion of the data which is taking a long time.

For example, the way I am currently structuring my code which is 1 app per tab, i.e. apps/app1 and apps/app2 - each app has to ingest the original data source once at the start which is taking quite a bit of time for the initial loading of the app. Is there a better way to do this?

Could you post a few more details on the use case, and preferably some example code demonstrating your current approach? The “best” approach is case dependent, but here are a few general points,

If the data is static, i.e. it doesn’t change during the lifetime of the app, you could load the data on app initialization, which would mean that the two minutes of load time would only happen when your server starts.

If the data is dynamic, you’ll (obviously) need to load it on demand. If you don’t need all the data, you could speed up the loading by dividing it into (pre processed) chunks, and then load only the chunk(s) needed. If your access pattern is complex, you might consider a database instead. If you do need all the data, you might consider using a fast caching mechanism, e.g. an in-memory Redis cache.

1 Like

Hi, @Emil, I just came across this post and I have a similar problem to the one you are talking about.

I want to plot a scatterplot of [500K-1M] points from a csv file. The points never vary.

My question is: from what I read, I understand that it is possible to have a preprocessed version of a graph stored in some file that is only computed the first time the server starts, and the rest of the time it is loaded from that file. Is that correct?

Do you know of any online example in order to check this out?

I have also tried to boost it with Vaex, but the performance is almost the same as with Pandas.

If the points are static, you can simply load them into a global variable in the app and access them on demand. If also the figure is static, you can create a variable with the figure (or rather, it’s JSON representation) in the global scope. This way, the data processing is only performed on app start. I wouldn’t recommend using Vaex for this kind of usecase.

1 Like

Hi @Emil. Thanks for the feedback. I’ll try to ellaborate a bit more my scenario and then ask about what I am not getting.

I have an endpoint where on first load, the same graph is always loaded. This graph is a genome with the corresponding 24 chromosomes. On the first load, it is always loaded with a highlight on the first chrosome, as follows:

  fig.add_vrect(x0=x_min, x1=x_max, fillcolor="green",
              opacity=0.25, line_width=0)

This is a highlighted area of a scatterplot, which has some update_layout options, which is as follows:

fig = px.scatter(df_in, x="x_col", y="y_col",
                title=title,
                labels={
                    "x_col": "x"
                },
                color="color_col",
                hover_data = {
                    "y_col": True,
                    "x_col": ":,",
                    "color_col": False
                })

This graph just updates dynamically changing the highlighted green area to the given section of the scatterplot. Currently the update operations of the plot are quite fluid, but the first load takes around 15-25 seconds.

The plots are already preprocessed and stored on a csv, which is embedded with the corresponding pandas dataframe. If I understood you correctly, what you mean is to, instead of inserting to the scatterplot a pandas dataframe, transform the input csv to json with the given x and y points and pass this as the input data? Is it therefore something similar to this? I mean, is the issue regarding slow plotting related to transforming the pandas dataframe to json?

a11

Thanks a lot.

Another question that araises me is: is it possible to programmatically obtain the ouput json of the graph and insert it as the input json to this graph? Because if I have understood it right, this will save all the computation time and it will then be just be the neccesary time for Plotly to plot the input json.

Something like this:

https://raw.githubusercontent.com/plotly/plotly.js/master/test/image/mocks/10.json

Yes, that’s what I mean by ‘create a variable with the figure’. So, basically you will just bind the fig variable in the global scope and return it directly on the initial load :slight_smile:

1 Like