Suggestions for large data in global variable?

Hello. We have a dash-based application that uses some very large datasets. An initial design used global python variables for storing some data in memory, which I know is a no-no. In another case, we’re reading a very large file from disk every time we need to calculate some derived data for display. I have implemented a redis flask cache and have played around with @cache.memoize, but I’m just realizing that redis has a maximum size of 512 MB for values, which we will surpass. Does anyone have any suggestions for handling these cases? Logically, keeping the values in memory (which we can do) seems like it would help a lot, but how does one do that in dash in 2024? I have seen references to plasma, arrow, ray and other tools. Thank you.

Hello,

I think that this article can be useful to you : How Polars Can Help You Build Fast Dash Apps for Large Datasets as it exposes how you can use Polars to compute derived variables from your huge dataset. :slight_smile:

If you have the possibility, you can also create a proper database (e.g. PostgreSQL) and shift all the computation in database. That way, Dash is only in charge to make the proper queries to calculate the derivate data.

I guess it depend on your use case!

Interesting, thank you. I hadn’t heard about Polars. Looks like that is also a plus for arrow too, since Polars is built on it. We are not currently using parquet files, BTW. Complicated stuff!

Polars is not limited to parquet files (see : https://docs.pola.rs/api/python/stable/reference/io.htm). And if not Polars, other solutions exist like you mentioned.

Here are some questions to orient the choice.

What are the types of files you have?
Is the data changing? (e.g new generated datasets)
How do you perform your computations?(I.e using pandas?)

For background, I’m a software guy helping some scientists, and I’m still learning some of this (the project and the tools).

We have various files, like TIFF.

Much of the data is static for a given run, like the one that prompted this question. It’d be nice to cache that in memory. But some is not, or it can be updated.

I think we mostly do computations in python using numpy.

Thank you for your assistance!

ETA: We do also make use of pandas.

A bit more detail about my specific goal at the moment…

We are loading a large static TIFF with skimage.io.imread() and getting back an ndarray, which I would like to cache since it’s unchanging and called multiple times. Redis is too limiting in terms of size, and I see that plasma has been deprecated. It doesn’t feel like Polars is right for this. This is all server-side at the moment and on a single node, so would using python’s built-in shared memory libraries make sense?

Thanks again.

I have not yet tried anything, but Ray seems like a promising tool:

https://docs.ray.io/en/master/ray-core/objects/serialization.html#numpy-arrays

If anyone has any experience with it, I’d be happy to hear about it.

I have typically used one of the following approaches,

  1. Load data into memory (global variable on load)
    (+) Fastest option
    (-) Data must be static and fit in memory

  2. Use Redis
    (+) Very fast. Scales well
    (-) Requires some setup, data must fit in memory, can be expensive as IaS

  3. Cache (parsed) data on disk, e.g. using pickle
    (+) Fast, especially if you have a decent SSD
    (-) May require a bit of manual book keeping, some protocols (e.g. pickle) pose a security risk

It seems to me that you should try out (3).

Hello, Emil. I am basically trying to do #1, though I’m avoiding the use of python global variables as recommended by the docs. Our data is apparently too big for redis, and the data we’re reading is on disk already. They are fast disks, I’m sure, but it seems senseless to often be loading it into memory. Thanks.

Global variables are OK, if the data is read on application initialization and never modified (and fits in memory, obviously).

There can be a significant difference in speed depending on how the data is stored/read. Try writing the final (i.e. after any procesing, filtering etc.) data structure to disk using pickle, and read it back. Depending on how your data is currently stored, the difference in speed can be significant.

Hm, I had gotten it into my head that global variables are always bad in a dash app. I hadn’t thought much about why or why this case might be okay. In this case, we are reading TIFF files that aren’t modified, but they also aren’t read at initialization; there could be many of them, and they’re read as needed. In cases where they’re re-used, I figure storing them in memory would help. We may not be able to store all of them, so a caching scheme would be useful. Given that context, I’m not sure it’s still a good candidate for a global variable.

Given these are TIFF files, I was assuming that reading them using imread() is about as fast as it will get, but maybe I should question that assumption.

Thanks for the info!

As a rule of thumb, they are bad. However, in certain cases (i.e. when the data is static, you don’t care about the increased startup time of the app, you need the performance), they can be OK. For your usecase, it doesn’t sound like thats the case (unless you load all files up front, the data structure wont be static).

Could you provide an example of a tiff file that loads (too) slow? That would make it easier for people to help you out with possible performance improvements :slight_smile:

I will keep that in mind about globals. Maybe a situation like that will arise one day.

Unfortunately, I can’t share the data I’m working with. But to give you an idea of the scale, the ones I’m using are about 450 MB. Others could probably be larger. And “too slow” is, of course, a relative thing; that actually loads on the order of seconds, so it’s not terrible, but you know how users are! I was originally hoping that the redis we were already using could easily be used to knock that down a bit.

For a (parsed) data size of ~ 500 MB, I would expect loading to take ~ 0.1-0.2s on a modern system. I believe that would be acceptable to most users. Here is a small snippet to measure the speed on your HW,

from datetime import datetime
import numpy as np
import pickle


def create_sample_data():
    """
    Create sample data file of approx 500 MB.
    """
    fn = "sample.pkl"
    sample = np.random.randint(0, 255, (50000, 10000)).astype(np.uint8)
    with open("sample.pkl", "wb") as f:
        pickle.dump(sample, f)
    return fn


fn = create_sample_data()
tic = datetime.now()
with open(fn, "rb") as f:
    data = pickle.load(f)
toc = datetime.now()
# Time taken to load sample.pkl: 0.130858 seconds
print(f"Time taken to load {fn}: {(toc - tic).total_seconds()} seconds")

Wow, drives really have gotten fast. That’s impressive. (I see similar times on my development laptop.)

That’s certainly an argument for caching with pickled files. I’ll propose that as a possible improvement. (I’m still looking at Ray, too.)

Thanks again for your help.

Well, this is rather embarrassing but I need to post it here in case anyone gets the wrong idea.

I’ve been experimenting with a lot of things lately, and I thought I had seen that the imread function to read a tiff file was taking a few seconds. In reality, it appears to be just as fast as reading a pickle file, however. Apologies to everyone!

2 Likes