Gunicorn >10x slower

I just finished going through the docs about using gunicorn, redis, celery and flower. I’ve managed to set everything up where I can see the tasks being queued in Flower. Cool. Dev environment is an M2 with 8 cores. uvloop is installed. orjson is installed. Python versions I tested with: 3.10.0, 3.11.2. Dash version is the latest one.

I have an app that runs locally, a portfolio optimizer that runs some heavy computations and returns a JSON to a dcc.Store. Some plotting callbacks then pull data from there and graph stuff out. Using tqdm on the loops inside the app, I see that for scenario X, I have 50 iterations/second. I’m using polars dataframes so I can see all cores are at 100%.

Adding server=app.server in app.py and running the server with gunicorn -w 2 app:server I can see the performance drop to a maximum of 2 iterations/second. CPUs are mostly idle. Changing the number of workers to 2*cores + 1 (17) yields the exact same results. I guess it’s important to mention here that I don’t return anything until the end of the computation.

To test this differently I created a main.py that spawns a FastAPI() server. I have a GET method there that just replicates some compute logic from the Dash app. I call that method from my dash app with a background_callback since it takes longer than 30s.

Using uvicorn main:app --log-level info --workers 17 --port 8001 for FastAPI(), the iterations are the same as before i.e. ~50 and everything runs error-free, CPUs at full blast.

Using the gunicorn production server gunicorn -w 17 -k uvicorn.workers.UvicornWorker -b '127.0.0.1:8001' main:app as suggested in their production-ready documentation (using gunicorn with Uvicorn workers), this again slows down to a max of 2 iterations per second, not to mention I get timeouts and errors like resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown.

I’ve tried all sorts of combinations from this forum and SO with workers types (including gevent), threads, and the results are the same. I’m sure I’m missing something here.

I understand from docs and forums that the dev servers are not configured to have workers and they just use everything that’s available but I don’t see a performance increase/drop by fiddling with the -w command in gunicorn.

Hello @alex.dumitriu,

Thank you for the details. :slight_smile: And welcome to the community!

Do you also have nginx working in front of the gunicorn?

I think the difference may be that FastAPI operates async while Flask operates sync, the workers should be picking up the difference, but I am wondering if it just is getting tied to your instance. 50 iterations is pretty quick, have you thought about possibly using websockets instead of performing post requests?

Hi there. Thanks for the reply.

No nginx. FastAPI was merely another test to see if performance dropped when using gunicorn with different colors (uvicorn workers). Ideally I would like to avoid FastAPI altogether.

The iterations I’m referring to are not requests but rather portfolio scenarios (just math operations visible through tqdm) and depending on the number of stocks in a portfolio they range from a few iterations to hundreds. I just set the portfolio to a fixed number of assets and started testing.

At this point I’ve reduced everything to a background_callback that takes a button as an Input, sends a request to a FastAPI function (no async in the signature, no need to) and runs some CPU bound logic. After a few minutes I get a JSON back from the request and store it in dcc.Store. Before going the API route this logic was just another plain old function in another module within the dash project. As soon as I start these 2 with gunicorn I can feel my beard growing. :smiling_face_with_tear:

I’m at the point where I would just like to see similar results between using the development server and gunicorn on my local machine. Same code, sync, nothing fancy.

Thanks for the support!

Do you have a basic MRE:

This is very interesting. :slight_smile:

You can replace the FastAPI calls with using a websocket, put out by dash-extensions.

Ok so here’s an example I put together this morning:

app.py:

import os
import secrets

import diskcache
import ujson as json
from celery import Celery
from dash import Dash
from dash import DiskcacheManager, CeleryManager, Input, Output, html, dcc

from computations.computations import compute

secret = secrets.token_urlsafe(32)

if 'REDIS_URL' in os.environ:
    # Use Redis & Celery if REDIS_URL set as an env variable
    celery_app = Celery(__name__,
                        broker='redis://127.0.0.1:6379/0',
                        backend='redis://127.0.0.1:6379/1')

    background_callback_manager = CeleryManager(celery_app)

else:
    # Diskcache for non-production apps when developing locally
    cache = diskcache.Cache("./cache")
    background_callback_manager = DiskcacheManager(cache)

app = Dash(__name__, background_callback_manager=background_callback_manager)

app.layout = html.Div(
    [
        dcc.Store(id="intermediate-value", storage_type="session"),
        html.Button('Submit', id='submit-button'),
        html.Button('Cancel', id='cancel-button'),
    ]
)

app.secret_key = secret
server = app.server


@app.callback(
    output=Output("intermediate-value", "data"),
    inputs=[Input("submit-button", "n_clicks")],
    background=True,
    prevent_initial_call=True,
    running=[(Output("submit-button", "disabled"), True, False)],
    cancel=Input("cancel-button", "n_clicks"),
)
def gen_results(n_clicks: int) -> json:
    print(f"n_clicks: {n_clicks}")
    result = compute()
    print(f"result: {result}")
    return json.dumps(result)


if __name__ == "__main__":
    app.run(port=8050, debug=True)

computations.py - dummy code to simulate work:

from tqdm import tqdm
import random

def compute() -> dict:
    L = range(1_000_000)
    amount = 1_000_000
    random_array = [random.choice(L) for _ in range(amount)]
    
    for _ in tqdm(random_array):
        file = open("testfile.txt", "w")
        file.write(f"Hello World{str(_)}")
        file.close()

    return {"job": "finished"}

Start server with: gunicorn app:server
Start celery with: celery -A app.celery_app worker --loglevel=INFO

In this example it seems I am getting the same speeds for both werkzeug and gunicorn. Weirdly enough I’m seeing the tqdm progress bar inside celery and not the server terminal. I think I need a break :melting_face:. I don’t understand how only changing one line of code from python app.py to gunicorn app:server can cut so much power.

One question for this example though: Should I see performance differences when fiddling with the number of workers in gunicorn? Right now regardless if I put in 1 or 17 workers the number of iterations/sec I get from compute() is the same.

Quick update. I recently changed to polars from pandas. The gains in speeds on the DF operations themselves in my case have been significant (>10x), but loading up an old branch that used pandas, it seems everything runs fine in terms of server behaviour. Identical speeds for werkzeug and gunicorn. Nothing changes by tuning the number of workers though.

The reduced performance when switching to gunicorn seems to be happening only with polars frames. Chopping up the code I see there’s also some weird behavior with background_callbacks and polars.

This hangs without any errors.

@app.callback(
    Output("intermediate-value", "data"),
    Input("submit-button", "n_clicks"),
    background=True,
    prevent_initial_call=True,
    running=[(Output("submit-button", "disabled"), True, False)],
    cancel=Input("cancel-button", "n_clicks"),
)
def test(n_clicks: int):
    data = pl.read_parquet("file.parquet")
    return data.write_json()

This doesn’t. I’ve added parallel='none' to read_parquet():

@app.callback(
    Output("intermediate-value", "data"),
    Input("submit-button", "n_clicks"),
    background=True,
    prevent_initial_call=True,
    running=[(Output("submit-button", "disabled"), True, False)],
    cancel=Input("cancel-button", "n_clicks"),
)
def test(n_clicks: int):
    data = pl.read_parquet("file.parquet", parallel='none')
    return data.write_json()

Making this a regular callback however works with the default parallel='auto':

@app.callback(
    output=Output("intermediate-value", "data"),
    input=Input("submit-button", "n_clicks"),
)
def test(n_clicks: int):
    data = pl.read_parquet("file.parquet")
    return data.write_json()

After reading the files with parallel='none', I join the DFs using .join and once again in a background callback it chokes without errors. The task in celery is active but the processor is idle. I have to cancel the task.

Something to be mindful of. I’m not sure how to get around this. Since FastAPI also uses gunicorn I’m not sure how I can continue using polars. Any suggestions? How are you guys using polars in dash?

I actually don’t use polars, because I handle a bunch of sql queries. When last I looked, it seemed that reading from a sql query required the variables to be inserted into the string.

This is a big no-no to avoid when taking things that can be user defined. Because it can lead to sql injection attacks.

FastAPI can also use uvicorn.