Hi all,
I have an app where I would like the user to be able to upload a tabular data file, and draw/tweak various graphs based on the data. The dcc.Upload component solves the upload step. I am currently using hidden divs to store the data itself, but I’d like to move away from that.
The file could potentially be quite large, so that transferring it over the network and deserialising it could be a bottleneck. In that case, there is no way to avoid a delay on the upload step (which is fine), but I want to make sure to avoid that delay when the user is merely changing a graph setting. The graph will need to be redrawn, but hopefully the data need not be re-transferred from the browser (hence I want to avoid hidden data divs). I was initially thinking of caching the data server-side using flask-cache, as per Example 4 at https://dash.plot.ly/sharing-data-between-callbacks .
I also noticed the thread Upload Component - file uploaded multiple times , which has example code for writing the dataset to disk using a unique session ID. I think this would solve my problem, but if possible I would like to use memory instead of disk for added speed.
Here’s where I’m a bit confused: I don’t understand how to use flask-cache to “store” the data from dcc.Upload(). If I apply @cache.memoize()
to a function that deserialises the data, then Input(‘upload-data’, ‘contents’)
will need to be an input to that function, and it sounds like merely passing the input in (even if memoization means it isn’t used) could be an expensive step and could involve re-retrieving the data over the network?
What I am currently considering doing is:
- Writing the data to disk, identified by a
session_id
, as per Upload Component - file uploaded multiple times - Reading the data from disk in any callback that needs the data, but memoizing the disk-reading function, so that up to some cache threshold we actually won’t need to read from disk at all. The disk-reading function could take the
session_id
andInput(‘upload-data’, ‘last_modified’)
as input, so that memoization can rapidly give us the same DataFrame back whenever the data is unchanged on disk.
This means that when the data is first uploaded and callbacks triggered, it will get written to disk and then immediately read back, which seems a bit ridiculous, but at least only occurs at upload time rather than repeatedly.
Also, doing it this way, the cache threshold can potentially be set to something quite moderate, as if there are multiple users active and we overflow the cache it’s only going to trigger a disk read.
Is this a sensible approach or have I missed a simpler option? I feel that it should be possible to use flask-cache directly on the outputs of dcc.Upload, but can’t get my head around how to do so. Possibly I’ve just misunderstood one of the moving parts.
Thanks!