📔 Dash Enterprise Fieldnotes #4 - Job Queues

This is the 4th post in my series on Dash Enterprise. We kicked off the Dash Enterprise Fieldnotes series with Why Dash Enterprise?, followed by an essay on the App Manager and then one on Dash Design Kit.
The next essay in the series will be on the motivation and people behind Dash Enterprise’s HPC capabilities.

One of the companies on the Dash Enterprise product advisory board is an investment bank (over $2B in AUM). When I was writing Dash in 2016, we worked with this organization to retool their portfolio analysis stack for their institutional investor customer base.

The complete portfolio analysis routine took an hour to run. It fetched data from a dozen financial APIs and ran a set of proprietary financial optimization routines written in Python. Eventually, hundreds of institutional investment advisors would use this tool on a daily basis to analyze their portfolios.

The infrastructure and libraries we built for this client in 2016 became the foundation of the same Job Queue infrastructure that almost all of our Dash Enterprise customers use today. As our customers bring Dash applications into production, we find that 80% of Dash Enterprise customers transition to this Job Queue architecture.

That is, our customer’s Dash app deployments often start like this:

Then, to scale in performance and reliability, they evolve to this:

Long-Running Tasks

For Dash applications with long-running tasks and hundreds (or thousands) of end users, you’re going to need the Dash Enterprise Job Queue. Anytime a computation takes longer than 15 seconds, it should be sent to the Job Queue to run the task asynchronously.

If you don’t use the Job Queue, then these these long running tasks will consume the web processes that run the callbacks and the Dash application will become unresponsive. No fun for your Dash app end users.

Refreshing Data in the Background

In addition to Dash application performance, the second use case for a Job Queue is running periodic jobs in the background. If periodically fetching juicy, fresh data for your Dash application takes longer than 3-5 seconds (it likely will!), then you’ll want to fetch that data in a background job that runs periodically. The schedule is configurable: it might run every minute, every 15 minutes, every hour, every morning, every Sunday at sunset, etc.

Historically, periodic scheduling of Python jobs has been done with cron or Apache Airflow, but the Dash Enterprise Job Queue aims for a new level of no-configuration simplicity: one or two lines of Python code inside your Dash application and boom - your job is scheduled.

When the periodic task runs in the background, it will save the result to Dash Enterprise’s onboard Postgres or Redis data cache. Then, when the job is finished, the Dash app reads directly from that storage instead of the original data source.

Reliability

In addition to periodic job scheduling and pedal-to-the-metal app performance, the Dash Enterprise Job Queue is also a win for Dash application uptime. We designed Dash Enterprise to have zero downtime deploys that don’t lose any in-flight computations. On Dash Enterprise, your Dash application callbacks are like babies to us - they will never be dropped.

With the Dash Enterprise Job Queue, the information about the job (which function it should call, the input arguments of the function, etc) is stored in a Redis cache with persistent backup storage. When a job is submitted, this information is stored in Redis, where it is only cleared once the job has successfully finished. With the job meta information saved, the job can be retried if there’s a hiccup (API service failure, memory crash, etc).

During a deployment, we hot-swap the containers that run the Job Queue code. Since the meta data about the jobs are stored in a persistent database, no data or memory is lost when killing the previous job queue container.

In addition to the Job Queue input parameters, the outputs of the Job Queue are also stored in persistent storage. This way, the Dash app web processes can read from this database without depending on the uptime of the Job Queue containers. The Dash app web processes and worker processes don’t communicate directly to each other: they communicate through reliable, persistent database caches. This allows us to hot-swap containers during deploys, restarts, or when server configuration settings change.

Scalability & High Availability

The stateless design of Dash makes scaling an infrastructure problem; not a software problem. When you need to process more tasks, simply scale up the number of Job Queue containers. Remember that each Job Queue reads from a database and each Dash app submits jobs to the same database, so there are no communication problems when scaling up Dash Enterprise VMs. Since every aspect of this architecture is isolated, we can run these containers on different VMs (even in different regions). The Job Queue was designed with out stateless, Kubernetes-based architecture in mind.

You’re going to need a Job Queue

Analytic Apps like Dash apps are different from most Web apps that you use on a daily basis. Most Web apps, when they make a request to the server, are requesting something simple by design: login a new user, do a database lookup, load a new page, load the next song, sign out, etc. Common Web app requests like these takes less than a few seconds and are processed synchronously.

The same goes for dashboards (like Tableau or PowerBI) that display simple charts and summary statistics. When the dashboard loads, summary statistics can be computed on the fly because they’re simple, subseecond calculations on commodity CPU hardware.

With Analytic Apps, the story is different. Analytic Apps are the face of an ML, AI, or data science model. The HTTP requests to the server that kicks off one of these process may last several seconds, minutes, or hours, even on HPC hardware. For this, my friends, you will need a Job Queue to run the process in the background, save the process result to a cache, and scale these processes horizontally.

Hard Tech

If you’d enjoy making gnarly engineering solutions like Job Queue available to the world’s blue chip companies, you may enjoy work-life at Plotly.

In the next Dash Enterprise Field Notes essay, I’ll cover the story behind Dash Enterprise’s HPC capabilities and the types of advanced Python analytics that they’re enabling.

The Dash Enterprise Fieldnotes Series:

  1. Why Dash Enterprise?
  2. App Manager
  3. Dash Design Kit
  4. Job Queues (you are here)
9 Likes