Using dcc.Upload with flask-cache

Hi all,

I have an app where I would like the user to be able to upload a tabular data file, and draw/tweak various graphs based on the data. The dcc.Upload component solves the upload step. I am currently using hidden divs to store the data itself, but I’d like to move away from that.

The file could potentially be quite large, so that transferring it over the network and deserialising it could be a bottleneck. In that case, there is no way to avoid a delay on the upload step (which is fine), but I want to make sure to avoid that delay when the user is merely changing a graph setting. The graph will need to be redrawn, but hopefully the data need not be re-transferred from the browser (hence I want to avoid hidden data divs). I was initially thinking of caching the data server-side using flask-cache, as per Example 4 at https://dash.plot.ly/sharing-data-between-callbacks .

I also noticed the thread Upload Component - file uploaded multiple times , which has example code for writing the dataset to disk using a unique session ID. I think this would solve my problem, but if possible I would like to use memory instead of disk for added speed.

Here’s where I’m a bit confused: I don’t understand how to use flask-cache to “store” the data from dcc.Upload(). If I apply @cache.memoize() to a function that deserialises the data, then Input(‘upload-data’, ‘contents’) will need to be an input to that function, and it sounds like merely passing the input in (even if memoization means it isn’t used) could be an expensive step and could involve re-retrieving the data over the network?

What I am currently considering doing is:

  • Writing the data to disk, identified by a session_id, as per Upload Component - file uploaded multiple times
  • Reading the data from disk in any callback that needs the data, but memoizing the disk-reading function, so that up to some cache threshold we actually won’t need to read from disk at all. The disk-reading function could take the session_id and Input(‘upload-data’, ‘last_modified’) as input, so that memoization can rapidly give us the same DataFrame back whenever the data is unchanged on disk.

This means that when the data is first uploaded and callbacks triggered, it will get written to disk and then immediately read back, which seems a bit ridiculous, but at least only occurs at upload time rather than repeatedly.

Also, doing it this way, the cache threshold can potentially be set to something quite moderate, as if there are multiple users active and we overflow the cache it’s only going to trigger a disk read.

Is this a sensible approach or have I missed a simpler option? I feel that it should be possible to use flask-cache directly on the outputs of dcc.Upload, but can’t get my head around how to do so. Possibly I’ve just misunderstood one of the moving parts.

Thanks!

2 Likes

Hi @clare, were you able to find a solution?
I have the exact same issue (with another callback though) and I’ve been struggling to figure it out since a week.

Hi @srikar_1996 ,

I ended up using basically the solution I outlined, i.e. I’m generating a random session ID, storing the session ID in a div, and writing the data to disk using the session ID as part of the filename. Then any callback that wants the data takes the session ID and the upload timestamp as input and reads the file from disk itself.

In addition (but depending on your performance needs it might not be necessary), I’m using flask-caching’s @cache.memoize() to memoize the disk read/write functions. If the user uploads a new file, the upload timestamp changes, so the memoized function will read the new data from disk. But if the timestamp and session have both been seen before, the data cached in memory can be used.

I was also discussing some of the details of writing to disk in this thread: https://community.plotly.com/t/working-on-large-datasets-comparison-with-shiny

While I was figuring out the method, I made a toy app to try the approach out - much easier to read than my actual app! I’ve turned it into a gist if that helps: https://gist.github.com/claresloggett/6bce7526c450d3fbc5e64df9806769d0

5 Likes

Hi @clare,

Thank you for the reply. I’ve used a few inputs that you’ve outlined.

In my case, I have a URL component from which I need to extract the pathname and accordingly fetch the data. So I’m using @cache.memoize() and storing the dataframe in the filesystem using the session ID. To fetch the data from the file system I use both the session ID and the unique pathname.

This saves a lot of network costs for me and is much faster. I also don’t need to convert the data into JSON for movement between callbacks as it’s on the filesystem which was one of my major concerns.

And the gist was really helpful!