Black Lives Matter. Please consider donating to Black Girls Code today.

Data structures for live-updating dashboard

Hi I’m working on my first ‘production’ dashboard app and wanted to get opinions on how to structure the data pipeline… I have experience in data science, but I’ve almost always worked from cleaned up csv files.

I’m planning on analyzing and displaying metrics based on live sensor data paired with relatively static user/project data. The metrics will probably reach up to 50 users and contain up to 10,000 sensor -user pairs summarized into basic statistics.

At the moment I’m planning on using mysql RDS to store the raw data, and write a python script to query the database every 10 minutes to create the transforms/metrics I’d like to display and store them as csv files in a project directory. I’m using AWS, gunicorn, and nginx to serve the app currently.

Then I’d reload the csv files every 10mins in pandas, filter the data based on reactive inputs, and output them with dash.

My reasoning is that I’d think performance would be vastly improved if the data transforms were handled by another process instead of being generated on the fly with a sql query in dash.

Do you think this is a good way to handle data for a ‘live’ updating dash app? Am I missing something? Is there a better way to handle it?

Thanks. I appreciate your feedback.

One thought I have is that it could be worth starting with an initial prototype that queries your SQL directly. If it turned out that the overheads of querying the DB and performing transforms weren’t that bad, then you’ve just saved yourself a lot of work. Also, if you see a lot of the same queries, caching the callback functions (with an expiry value on the cache) will save time on at least some requests.

But if it turns out that there is too much latency or other costs associated with this, then periodically storing the transformed summary statistics is probably a good idea. And yes, having another process is probably a good idea, otherwise you will block the process running Dash (or at least one of the worker processes) from handling requests while it’s busy. You could setup a cron job to run some other script to do this, or use a scheduler like Celery. There’s a discussion on this topic in this thrad. I also suggest there the perhaps somewhat hacky approach of just spinning up another thread to do the batch querying and transforms. Probably not as performant as the other approaches but if you happen to want to keep everything in the same script, it could do the job.

I’m no expert, so I would be interested if somebody believes my approach is flawed.

I have made an app that requires processed data from raw data that I have stored in a Heroku Postgresql database. I use a separate worker process to update a processed_data table in the database. I then query that table directly within the Dash app using cached callback functions.

1 Like

That seems sensible to me. There’s no reason why summary statistics need to be kept in CSV files. A separate table(s) in your RDS that contains transformed information is a common approach.

1 Like

Would really like to learn and reference the code to work on a similar app that I am working on. Is it possible for you to share the git link