We’re looking to set up a simple dashboard via Dash.
Our ‘data lake’ consists of raw time series data in a binary format, which is uploaded to our S3 server on a steady basis (e.g. one new file every 5 min). To display the data in Dash requires basic processing, which is time consuming - for example to process the last 24 hours of data may take 20 seconds .
One way to set this up could be to create pre-processed SQL databases, e.g. one per day. When new data is uploaded to our server, the latest SQL DB would be update accordingly. Our Dash app could then load the SQL databases based on what time period a user selects.
However, we are wondering if there is a smarter way to achieve this - for example using some form of caching functionality in Dash? Ideally we’d like to eliminate the SQL “middle step” from the equation.
If you have any suggestions for what we might look at here, it would be greatly appreciated.
I guess it depends on the type of queries that the clients are doing. Are they typically similar? If so, a caching solution could work. The client would have to wait the first time a new query is made, but subsequent calls will be fast. If they are mostly different on the other hand, a preprocessing pipeline would probably be a better choice.
Hi Emil, thanks for your reply.
I believe the queries would be fairly similar.
To illustrate, an example dashboard would be as follows:
- The plot would show 1 time series parameter at a time, e.g. vehicle speed
- The user may via a drop down select another parameter to plot instead, e.g. engine speed
- Via another dropdown, the user may select a different time period, split by days (so e.g. to show the last 7 days, last 3 days, …) e.g. up to the last 30 days
So there would be a limited set of time series views that would need to be displayed/prepared for this type of interaction, i.e. 20 parameters x 30 days.
Of course, we could pre-process the data e.g. every 30 minutes so that we create a simple database that contains the 20 parameters for the last 30 days. The Dash frontend would then use this database as the source.
However, we’re thinking if there’s some built-in mechanism in Dash to handle this step. For example, if one can define a limited set of “views” or snapshots that may be “cached” by Dash and updates every X minutes. The idea would be that the cached data would be managed as part of built-in Dash functionality, rather than us “re-inventing the wheel” with some custom database pre-processing.
To be clear, we also want to ensure that the “initial load” is fast - so we do not want the dashboard to compute things on-request for the first time, as it may take e.g. 20 seconds to compute a single view. Rather, we’re looking for a methodology similar to ‘sharing data between callbacks’, but where the pre-processing is also handled as part of the Dash functionality.
Would be very happy to hear your thoughts on this - hope my outline makes sense.
If the initial load must be fast, i would say that pre processing (external of the Dash app) is the way to go. Dash is a frontend framework, but since you want to process data before any requests are made from the client, you should rather look at setting up a scheduled job e.g. using schedule.
Depending on the amount of data, the filtering capabilities needed and/or other needs (e.g. access from other non-Dash clients), it might be sufficient to store your preprocessed data in a few object pickled to files rather than setting up a whole database.
Thanks for the suggestion.
So, would it make sense for this type of workflow e.g. to do the following?
- Set up a separate “batch script” that loads raw data (from a local disk or e.g. S3 bucket)
- The data is loaded into a dask dataframe and processed
- The processed result is saved into some format (e.g. SQL, CSV, …) and stored on the disk or S3
- The output will be saved in files separated by day
- The script runs e.g. every 30 min. If new data is uploaded, the output file for the current day is updated (the older days are not changed in this case)
The Dash script is separate and runs in parallel with the scheduled batch script. The Dash script and will load data from the processed output files. By default, the file loads the current day by loading the latest output file from the S3 bucket.
A few questions:
A) Does the above process flow make sense overall - or would one do it in a different way? For example, some examples I’ve seen put the output in an S3 DynamoDB. I guess this might be faster, but it would not be as agnostic (we need the same code to work also for data on local disk and on non-AWS S3 servers)
B) If the batch script is in the middle of updating the latest output file, will that cause issues with e.g. locking the file to the Dash script, so that end users will not be able to see the data in the front-end? It’s probably a basic thing, but I wonder if there’s a “right” way to handle that aspect.
C) Would e.g. SQLite be a sensible format for the output files or would you recommend something else?
D) I guess what I’m trying to understand is if there’s some Python module that helps set this up.
E) Even if we pre-process the data, we still face a challenge with the amount of data. For example, to allow a user to analyze a time series in a 10 min interval, it may require that the data resolution is 1 observation per 0.1 second. But if that’s the general resampling frequency, it yields ~1 mn points for 24 hours and ~30 mn for 1 month. We’re a bit unsure how to best handle this type of situation in Dash and the backend. I’ve looked at datashader, but not sure if it’s the right fit for this. I would expect this to be a fairly general challenge, though.