Announcing Dash Bio 1.0.0 🎉 : a one-stop-shop for bioinformatics and drug development visualizations.

Line Plot for Large Dataset

Hello,

I would like to use plotly+Dash to inspect a large dataset of vectors describing some spectral data, but it does not seem like px.line is suitable for such plotting moderately large datasets. Is there anything that can be done to speed up examining large datesets to the point where callbacks will be functional, or do I need to learn a new tool? Below a toy example that generates a dataset large enough to make plotly unusable, you can set n_rows to something small to verify the script is working.

import plotly.express as px
import numpy as np
import pandas as pd


n_vars = 4000
x_range = np.linspace(0, 5*np.math.pi, n_vars)

n_rows = 2500
total_x = np.empty_like(x_range, shape = (n_rows, n_vars))
classes = np.ones_like('hello_world', shape = (n_rows))
ids = np.ones_like('hello_world', shape = (n_rows))
#could vectorize the loop below by generating all of the `choice` at once, then using `np.where`
#generating the data is plenty fast without this though
for i in range(n_rows):
    choice = np.random.randint(low = 0 , high = 2, size = 1)
    if choice == 1:
        response = (np.random.randint(1,10) * np.random.rand()) + np.sin(x_range)
        c = 'sin'
    else:
        response = (np.random.randint(1,10) * np.random.rand()) + np.cos(x_range)
        c = 'cos'
    total_x[i, :] = response
    classes[i] = c
    ids[i] = 'sample' + "_" + str(i)

df = pd.DataFrame(total_x)
df['class'] = classes
df['ids'] = ids
df_m = df.melt(id_vars = ['class', 'ids'], var_name = 'index', value_name = 'response' )

px.line(data_frame = df_m, x = 'index', y = 'response', line_group = 'ids', color = 'class')

Any help is appreciated.

Thank you

Are you sure plotly is the issue and not pandas?

Plotly should not have issues with large datasets. See " WebGL with many traces" in Webgl Vs Svg

I would guess pandas is the issue. If so, consider using vaex instead:

It seems like the issue is ploty . I re-implemented the plot using scattergl and an option to use numpy to create the filtered vectors added with fig.add_trace() during the trace-addition. That loop was actually slightly faster using Pandas (use_numpy = False), but I do not think that is relevant. Regardless of how fast I make the filtering during the trace-addition loop, once all of the traces are added and fig.show() is called, if the plot does not include callbacks, my ‘frontend’ Python is out of the picture, right? The vectors added with fig.add_trace() can be deleted or over-ridden so it seems that the data structure used to filter out each vector is not relevant. I am fairly certain this also means that vaex will not solve my problem, I tried to get it to work but my environment does not have the correct dependency structure – perhaps you suggested it because you thought the issue was caused by my PC running out of RAM? However, I am on a workstation-level PC with plenty RAM.


import plotly.express as px
import numpy as np
import pandas as pd
import plotly.graph_objects as go
from tqdm import tqdm
import vaex


n_vars = 4000
x_range = np.linspace(0, 5*np.math.pi, n_vars)
n_rows = 5
total_x = np.empty_like(x_range, shape = (n_rows, n_vars))
classes = np.ones_like('hello_world', shape = (n_rows))
ids = np.ones_like('hello_world', shape = (n_rows))
#could vectorize the loop below by generating all of the `choice` at once, then using `np.where`
#generating the data is plenty fast without this though
for i in range(n_rows):
    choice = np.random.randint(low = 0 , high = 2, size = 1)
    if choice == 1:
        response = (np.random.randint(1,10) * np.random.rand()) + np.sin(x_range)
        c = 'sin'
    else:
        response = (np.random.randint(1,10) * np.random.rand()) + np.cos(x_range)
        c = 'cos'
    total_x[i, :] = response
    classes[i] = c
    ids[i] = 'sample' + "_" + str(i)
df = pd.DataFrame(total_x)
df['class'] = classes
df['ids'] = ids
df_m = df.melt(id_vars = ['class', 'ids'], var_name = 'index', value_name = 'response' )
vals = df_m.values
use_numpy = False
try:
    df_v = vaex.from_pandas(df_m)
except TypeError as e:
    print(f"\nVaex failed with:\n{e}\n")
fig = go.Figure()
for i in tqdm(np.unique(ids)):
    if use_numpy:
        plot_ar = vals[vals[:, 1] == i]
        x = plot_ar[:,2]
        y = plot_ar[:,3] 
    else:
        plot_df = df_m.loc[df_m['ids'] == i, :]
        x = plot_df['index'].values
        y = plot_df['response'].values
    fig.add_trace(
            go.Scattergl(
                x = x,
                y = y
            )
        )
    #delete x to demonstrate plotly does not reference the 
    #input objs when ploting, but instead consolidates and plots the 
    #the traces added to the `_data` object maintained by `BaseFigure`
    del(x)
    del(y)
fig.update_layout(showlegend=False)
fig.show()

Hi again,

What do you consider too be too slow? I just tested your code and the plot was rendered in my browser in a second. is one second too slow for you? This should be no problem at all for showing graphs through dcc.Graph. See my example.
FYI, i use plotly 5.5.0. Additional speed can be gained by installing orjson:

Animation

Thanks for your reply, the info that the code ran quickly for you was very useful. Your .gif is of just a couple traces, I am guessing that it ran reasonably well with many traces as well? My issue turned out to be that WebGL is somehow not finding my GPU (linux). I’ve been trying the solutions in some threads and blogs, but haven’t been able to resolve it yet. When I rebooted the machine into Windows, WebGL worked out of the box, and the same code was reasonably fast with 800 traces. I guess this is resolved as far as plotly goes, but I’ll update this comment once I figure out how to solve my linux blues.