Black Lives Matter. Please consider donating to Black Girls Code today.

Boxplot quartile seem's wrong

I realised that quartile calculated by plotly boxplot were not correct. Here my python code:
array = [0,2,3,5,8,9,10]
a = np.array(array)
print("Q1: " + str(np.percentile(a,25)))
print("median: " + str(np.percentile(a,50)))
print("Q3: " + str(np.percentile(a,75)))

trace = Box (
            y= array,
            boxpoints='all',
            jitter=0.3,
            pointpos=-1.8,
            name = "test"
            )
layout = Layout(
    title="title",
    yaxis=dict(
        title='Error in %',
        nticks=20
    )
)
data = [trace]
fig = Figure(data=data, layout=layout)

iplot(fig)

I declared a simple array of data and use numpy to calculate Q1 and Q3, then i use plotly box plot to see the results. Here the output:

My question now is why plotly values doesn’t correspond with the results calculated with numpy ?

This is a rather contentious topic, as it turns out… see eg: http://www.amstat.org/publications/jse/v14n3/langford.html - there are at least 15 different ways you might choose to calculate quartiles!

They all generally give quartiles that are close to each other, and I don’t think it’s possible to say that one is right and the others are wrong. The one we chose is based on the idea that often statistics are based on a sample from a much larger distribution. In this case the larger distribution would be a uniform spread from 0.5 to 5.5. The quartiles for that would then be at 0.5+5.0/4 = 1.75, and 0.5+5.0*3/4 = 4.25.

  • citing Alex J.

Is it possible to provide a parameter to decide on quartile calculation formula?

This is an old thread, however I am having the exact same issue. Were you able to find either a solution or workaround. I have to recreate someone else’s work in plotly and they have used a different quarterly calculation.

Hi, I didn’t really look further, sorry. I just saw this other post that might help you Box Plots - manually supply median and quartiles (performance for alrge sample sizes).

I have been able to recreate the shapes themselves through lines, rectangles and points, so it basically looks and works like I want it to. Though it is missing the ability for everything to display when the mouse is hovered over like in the Boxplot (Which I would really like to find How to have hover text show all like with Boxplot)

I understand the various methods to calculate the quartiles - but how can one extract the quartiles computed by Plotly? Perhaps another way around the same question: what existing python functions will calculate quartiles in the exact same way as Plotly does?

I tried all possible interpolation methods in np.percentile(), scipy.stats.iqr(), and pandas.quantile(). None seemed to match what Plotly plots.

Thanks!

To give an updated answer to @OliverBrace, the box trace now supports manually providing parameters: https://plotly.com/python/box-plots/#box-plot-with-precomputed-quartiles

To answer @ahmedhosny, check out this part of the docs for a description of the default box-plot quartile calculation method, as well as the two built-in variants: https://plotly.com/python/box-plots/#choosing-the-algorithm-for-computing-quartiles

1 Like

Many thanks @nicolaskruchten - things are clear now. I guess my confusion came due to the naming. The linear in scipy/numpy is referring to the interpolation method, while the “linear” in Plotly seems to refer to the higher-level quantile calculation method (which in itself also uses linear interpolation despite producing a different result compared to scipy/numpy). Anyway, I have included a snippet below to highlight this for anyone in the future. Great work on this library it really makes data viz much more enjoyable!

import numpy as np
from scipy.stats import iqr

def plotly_linear_quantiles(y, quantile):
    """
        Based on #10 here: http://jse.amstat.org/v14n3/langford.html
        METHOD 10 (“H&L-2”): The Pth percentile value is found by taking that 
        value with #(np + 0.5). If this is not an integer, take the interpolated
        value between 'the floor' and 'the ceiling of that value'. As an example, 
        if S5 = (1, 2, 3, 4, 5) and p = 0.25, then #(np + 0.5) = #(1.75) and so Q1 = 1.75.
        
        args: 
        y: list to calculate quantile for
        quantile: requested quantile value between 0 and 1
    """
    # -1 because becuase we count starting at 0
    interp_val_x = len(y)*quantile + 0.5 - 1 
    if interp_val_x.is_integer():
        # int() to remove decimal
        return sorted(y)[int(interp_val_x)]
    else:
        return np.interp(interp_val_x, [x for x in range(len(y))], sorted(y))

    
def plotly_linear_IQR(y):
    return plotly_linear_quantiles(y, 0.75) - plotly_linear_quantiles(y, 0.25)


# linear by default in numpy and scipy, but included for clarity
l='linear' 

y = [1,2,3,4]
plotly_linear_IQR(y) # 2.0
iqr(y, interpolation=l) # 1.5
np.percentile(y, 75, interpolation=l) - np.percentile(y, 25, interpolation=l) # 1.5

y = [1,2,3,4,5]
plotly_linear_IQR(y) # 2.5
iqr(y, interpolation=l) # 2.0
np.percentile(y, 75, interpolation=l) - np.percentile(y, 25, interpolation=l) # 2.0

y = [1,2,3,4,5,6]
plotly_linear_IQR(y) # 3.0
iqr(y, interpolation=l) # 2.5
np.percentile(y, 75, interpolation=l) - np.percentile(y, 25, interpolation=l) # 2.5

y = [1,2,3,4,5,6,7]
plotly_linear_IQR(y) # 3.5
iqr(y, interpolation=l) # 3.0
np.percentile(y, 75, interpolation=l) - np.percentile(y, 25, interpolation=l) # 3.0