Thanks for the suggestion.
So, would it make sense for this type of workflow e.g. to do the following?
- Set up a separate “batch script” that loads raw data (from a local disk or e.g. S3 bucket)
- The data is loaded into a dask dataframe and processed
- The processed result is saved into some format (e.g. SQL, CSV, …) and stored on the disk or S3
- The output will be saved in files separated by day
- The script runs e.g. every 30 min. If new data is uploaded, the output file for the current day is updated (the older days are not changed in this case)
The Dash script is separate and runs in parallel with the scheduled batch script. The Dash script and will load data from the processed output files. By default, the file loads the current day by loading the latest output file from the S3 bucket.
A few questions:
A) Does the above process flow make sense overall - or would one do it in a different way? For example, some examples I’ve seen put the output in an S3 DynamoDB. I guess this might be faster, but it would not be as agnostic (we need the same code to work also for data on local disk and on non-AWS S3 servers)
B) If the batch script is in the middle of updating the latest output file, will that cause issues with e.g. locking the file to the Dash script, so that end users will not be able to see the data in the front-end? It’s probably a basic thing, but I wonder if there’s a “right” way to handle that aspect.
C) Would e.g. SQLite be a sensible format for the output files or would you recommend something else?
D) I guess what I’m trying to understand is if there’s some Python module that helps set this up.
E) Even if we pre-process the data, we still face a challenge with the amount of data. For example, to allow a user to analyze a time series in a 10 min interval, it may require that the data resolution is 1 observation per 0.1 second. But if that’s the general resampling frequency, it yields ~1 mn points for 24 hours and ~30 mn for 1 month. We’re a bit unsure how to best handle this type of situation in Dash and the backend. I’ve looked at datashader, but not sure if it’s the right fit for this. I would expect this to be a fairly general challenge, though.