Huge size of histogram and Jupyter Notebook file

I have a Juypter Notebook (.ipynb) file which consists of a Matplotlib barplot and histogram. The file is about 100 KB in size. The same plots done with Plotly Express are 25.5 MB in size.

The JSON export of the histogram is 3 MB while the barplot export is about 7 KB.

My questions:

  • Why is the Plotly histplot JSON more than 400 times bigger than the barplot?

  • Why is the Plotly notebook more than 8 times bigger than the two exported JSON images together?

  • Why is the Plotly notebook more than 250 times bigger than the Matplot notebook?

Hi @thorsten,

My first guess would be that you are plotting a histogram with a fairly large number of points. The thing is that when you plot a histogram via plotly, it stores all the orginal data in the json file and makes the bins and counts on the javascript side. If you want to shrink the size of you plot you can do calculate bins and counts via numpy, before you feed the data to a bar plot. Like this

import plotly.express as px
import numpy as np

df = px.data.tips()
# create the bins
counts, bins = np.histogram(df.total_bill, bins=range(0, 60, 5))
bins = 0.5 * (bins[:-1] + bins[1:])

fig = px.bar(x=bins, y=counts, labels={'x':'total_bill', 'y':'count'})
len(fig.data[0].x)

hope this helps, Alex-

1 Like

@Alexboiboi thanks, very interesting and helpful.

You are right, the dataset is pretty large (1.5 million values). Doing the counting in Python (NumPy or Pandas) decreases the notebook size from 25.5 to 3.5 MB and the time to do the β€œhistograming” from 7.5 sec to 0.5. (Matplotlib takes about 2.3 seconds to do the counting)

Nevertheless the exported Plotly (JSON) file is 3 MB in size while the notebook is 22 MB. I would expect the notebook to be JSON + JavaScript (maybe 4 MB), so that still doesn’t add up.

The reason for the size difference seems to be that Plotly minifies the JSON export (verified by removing the JavaScript code from the notebook file and minifying the rest).

Thanks again, Thorsten

1 Like

Sorry for the necropost, but this was really helpful to me, thanks @Alexboiboi !

I wanted to do something similar to your example, but also wanted to use the β€˜color’ parameter of the px.histogram function, so I wrote my own function to bin the histogram manually, but plot multiple colors as a stacked bar plot. Posting it here in case it helps anyone in the future:

def plotly_histogram(df: pd.DataFrame, data_column: str, colour_column: str, nbins: int) -> go.Figure:
    """
    Create a stacked histogram with Plotly using data from a DataFrame.

    Args:
        df (pd.DataFrame): The DataFrame containing the data.
        data_column (str): The column containing the data to plot.
        colour_column (str): The column used to distinguish groups for stacking.
        nbins (int): Number of bins for the histogram.

    Returns:
        go.Figure: A Plotly Figure object representing the histogram.
    """
    fig = go.Figure()
    groups = df.groupby(colour_column)
    
    # Calculate the bin width for consistency across all traces
    data_bins = np.linspace(min(df[data_column]), max(df[data_column]), nbins)
    width = data_bins[1] - data_bins[0]
    
    # Loop through each group (colour)
    for _, (colour, group) in enumerate(groups):
        counts, bins = np.histogram(group[data_column], bins=data_bins)
        
        # Add trace for each group with corresponding color
        fig.add_trace(go.Bar(
            x=bins[:-1],  # Take the left bin edges for plotting
            y=counts,
            name=f'Group {colour}',
            width=width
        ))

    # Formatting
    fig.update_layout(
        barmode="stack",
        title="Stacked Histogram",
        xaxis_title=data_column,
        yaxis_title="Frequency",
    )
    fig = fig.update_traces(marker_line_width=0)

    return fig