Running calculation-heavy process in background?

brokenAlgorithm · December 27, 2018, 11:09am

I have inherited a dash app which calls a calculation intensive back-end. In order isolate user states and to provide intermediate results, I plan to have the following set-up:

-Execute calculation heavy process after button click
-Gradually write results of the process into an SQLite db (User ID to ensure unique concurrent states in case of several users)
-Have dash read the DB contents at regular intervals in order to provide intermediate results.

I am new to dash. Is this at all possible with dash? Is it possible to implement this without running 2 totally separated python scripts (see this question: Update CSV From A Different Script)? Is it possible to implement this in an elegant way with dash at all. If not, what would the work-around be? I am currently bound to dash due to time restrictions.

dustyatx · December 27, 2018, 12:36pm

There is a a lot of missing information so it’s not possible to tell you what the best approach will be for your problem. Can you add in more such how much data you’re processing, what type of system it’s running on (cores, mem, cloud provider), what type of calculations you are running. What modules you’re using (other than Dash) anything that gives the full picture.

IMO this doesn’t sound like a Dash issue and you’ll make a mess if you try to use Dash to solve it. Dash is a web framework that is managing the presentation and inter-actions. What you’re describing is a data processing problem, which is not going to be handled well by any web framework. At best what you can implement with Dash is a async workflow process that says, we’re processing the data we’ll send you an alert/notification when it’s done. Acceptable in some cases but ideal.

Here is where I’d look to solve the problem.
Don’t let this list overwhelm you… Most of these are less then a days work to setup even if you don’t have much exposure to them. The tricky part identifiying what is the best solution for your problem. This list is the order, I’d investigate in.

Review your query / data manipulation for items that are slowing it down. For instance are you running an aggregation on a set of data that can be filtered first?
Partition your data so you’re only processing what you need to. If you have a file with 10 years of time series but you typically only use 3 years. Break it up by year. Then sub-partition it based on dimensions. Columns with low unique values are a good place to look for that. So an example might be, year/state/zipcode/company or company/year/local/department/project etc… The partitioning really depends on what you need to accomplish.
You are going to get a a lot of time penalty costs reading and writing to a SQLite file (file I/O is the slowest of all). Depending on the size of data you might need to use a RDMS database or data platform to offload processing to. Dask is a nice drop in replacement for Pandas that is easy to vertically scale (more cpu cores = faster) it can do horizontal but it’s complicated to setup and not as good as Spark. Spark/pySpark is faster than Dask but more work to setup (Databricks makes it instantly accessible and uses notebooks!) and it has a little bit of learning curve. Google BigQuery is a blazing fast analytics data warehouse. PostGres, MemSQL also excellent.
Batch processing! Look for opportunities to preprocess the data and do so. CRON jobs or a data platform with a scheduler is the way to go. A common design is data refreshes at midnight, 1am aggregations run one after another (often forming a multi-stage pipeline) depositing data in to a repository (database, cloud bucket, filesystem) by 8am all data has landed and is ready for consumption.
Use a message queue and distribute the job to background workers. I prefer RabbitMQ but Celery is ok too. The cloud providers all have their own. This is the most scalable but it’s more complex. You need to write consumers to subscribe to the queue and keep everything from breaking. A bit of overkill for most but if someone reading this does decide this is the right fit, I highly recommend investigating the Serverless framework (AWS Lambda, Google Functions, etc) it has a fairly steep learning curve but it’s actually simple once you “get” it.

brokenAlgorithm · December 27, 2018, 2:45pm

Hi, first of all thanks for the quick reply!

Some background: The bottleneck is really complexity more than space (heavy financial sim). It uses python & C and has optional multiprocessing implemented. So not much gain in terms of back-end optimizations to be made. It is currently meant to be a “simple” prototype with a web interface for presentation purposes, hence no full-stack overnight batch processes, cloud computations etc. Optimizations of this kind will come at a later stage.

Goal
Built a simple web-based front-end for the model

Prerequisites

I have a working version. Set parameters, press a query button, compute.
However: It takes about 15 minutes to compute the entire result. It would be better if the user receives some visual progress indicator.
I have a function to retrieve gradual results by multiple method calls (imagine the final time series slowly being filled with data points with each call)
The Model & Dash app runs on a Linux server with Gunicorn. Max 5 users. Constraints: No Redis (Big corp, complicated IT, you know the deal…). Simple caching or (hence…) DIY SQLLite workaround on Server.

Current Idea
User presses query. Calls the “compute” method. Compute method calls my “gradualResult” method multiple times. Have the results be written into some form of cache. Have Dash regularly query the cache and display the contents to the user. The user will see the final result gradually build up over the 15 minutes computation time. Since same parameters produce same results, the cache should be shared among users.

I’m specifically interested if Dash can handle this kind of query natively (is there a commonly used architecture for this…?). If not, how could I achieve something similar? In the end, its all about User feedback. Trigger a dash query, have the user see intermediate steps of the calculation.

I understand async would be the most natural way to go (have the compute method signal dash each time a new datapoint was delivered), but unfortunately dash doesn’t support this as far as I’m aware…!

EDIT
Formatting

dustyatx · December 27, 2018, 3:36pm

Got what you’re trying to accomplish. This doesn’t sound like a prototype to me. Prototype is the least possible thing you can do to get to proof of concept. If you need a POC then drop all the bells and whistles, you really don’t need callbacks and interactivity. If you want to put a few in to showcase capability that’s cool. What you’re trying to solve for shouldn’t be considered until you get sign-off and product/user requirements. Otherwise you can easily waste a lot of time solving an issue that doesn’t need solving.

But to give you more insight. I think where you are going to run in to an issue is you are trying to use short term actions for long running jobs. This isn’t really reactive programming, you can probably get it to work but it’ll be a hack and breaks UX best practices as i understand them.

Think of the callback update as immediate feedback that guides the user as to what to do next (form validation error), provides delight (I changed a thing and other things changed too!) or gives them the value they were looking for quickly (I filtered data and the table updated!). Typically we want all three happening at the same time if possible.

Interact with the app, wait 15 mins then see a change. See the problem there?
Most likely what a user would do is. Start the process… start doing something else, check back later to see if it’s done. Until you can figure out where this fits in the users workflow, you’re going to guess as to what they need.

IMO when the time is right, you already have the correct solution, this should be handled with messaging. If you can’t use Redis, Celery is python based alternative. Start the job, send them an email and/or update the view when the job has finished.

It does look like someone has gotten async working with Dash…

github.com

plotly/dash-recipes/blob/master/dash-asynchronous.py

import dash
from dash.dependencies import Input, Output, Event
import dash_core_components as dcc
import dash_html_components as html

import datetime
import time


class Semaphore:
    def __init__(self, filename='semaphore.txt'):
        self.filename = filename
        with open(self.filename, 'w') as f:
            f.write('done')

    def lock(self):
        with open(self.filename, 'w') as f:
            f.write('working')

    def unlock(self):

This file has been truncated. show original

Another thing to consider is can you do batch processing? can you preprocess the data so it’s ready before the user hits the process button?

brokenAlgorithm · December 27, 2018, 3:40pm

Just as a reply to my own comment, to further clear things up:

I see 2 possible solutions:

Have the server run 2 parallel python sessions. One for the dash app & gunicorn workers, one for the model calculation. A button press sends the model parameters to the model process (I don’t know how yet, but there should be a solution). the model process has a queue of incoming calls and runs them through, writing intermediate and final results into some DB (granted, must not be SQLLite).

Have just one python process which runs both the dash app & gunicorn workers. A button press triggers a function which continuosly runs in the background while writing gradual results to a db. The question is: Can Dash handle a process which runs in the background? Can dash query results from a DB while another part of its code is computing results? Or do I have to implement some strange callback loop?

dustyatx · December 27, 2018, 4:14pm

It’s pretty common to send a job to a queue (celery) and spawn a sub-process to run the job in the background. I’m sure you can find information on how to do this with Flask.
Yes possible but this is going to get complicated. You wouldn’t want to couple processing the data, depositing it to a database and updating the view. That’s a lot of work for little return.

It’s tough to tell you the “right” way here because it’s dependent on your data and what your user needs to do with it.

I think you’re on the right track with the background workers and the database. But since this is a long running process I wouldn’t try to automatically populate the dashboard.

I will give you a real world version of this problem. I had a Node.js dashboard connected to hundreds of datasets with anywhere from 500 to 30 million records each. We needed to run aggregations and calculations to populate the dashboards but they took anywhere from 5 secs to an 25 mins to run. That would cause the dashboards with the most data to fail to render.

So we moved the processing from a user triggered action to a batch job. We gathered query time and file size data and then used this to create a prioritized queue. This queue was sorted by our most important customers, than by largest dashboards. We fired off a batch job to process these dashboards and cached the results in Redis, cloud bucket, BigQuery (depending on use case and files size). I was able to scale this up because I could assign a worker for each dashboard that need to be built (AWS Batch, running a few hundred jobs), these workers got their jobs from a RabbitMQ queue.

That way when the user got to the dashboard the core data was ready. We then used the frontend & backend to filter that data and update the view in real time.

nedned · December 28, 2018, 2:57pm

You’ll probably find the new loading states API, which is still in pre-release, helpful here.

As for the the long running task side of things, there’s examples of people using libraries like Celery if you search in the forums (I think) for this kind of thing. But this is less a Dash-specific issue, and rather one that you’d expect to see cropping up in any web framework with an incoming web request triggers a long running task. So you could google around for doing this while handling Flask requests. The challenge is going to be updating the app based on the results of your long running task completing. It’s not elegant, but the only way that’s coming to mind right now, is to use a dcc.Interval component to poll for any updates. You’ll have to manage session IDs somehow to do that though.

If you can get away with running your app with more workers and allow for one of them be occupied for a while, that’s certainly going to be simpler, as but may not be feasible.

Using the lru_cache decorator from functools (or functools32 for Python 2), as described in the Dash docs, is also an easy win for caching the results of callbacks.

nedned · December 29, 2018, 3:11am

Ooh, this project from the Dash Show and Tell thread looks like it could be helpful:

Topic		Replies	Views
Is processing data using python function and calling it with Dash only frontend? Dash Python	1	434	March 14, 2020
Subject: how to process calculations with a queue (Celery, RQ, etc.) and display results immediately after completion Dash Python question	4	660	March 29, 2024
Incremental Data Load Progress Bar Dash Python	2	1668	January 6, 2020
[Solved] Updating server side app data on a schedule Dash Python	20	24895	June 9, 2023
Custom progress information on each loop iteration in app.callback Dash Python	13	4757	April 14, 2021

Running calculation-heavy process in background?

Related topics