Spikes in reversed ecdf

Hello,
I’m using reversed ECDF for data analysis and the resulting diagrams contain a sort of strange spikes.
Looks like a bug.

Any idea or suggestion?
Thanks a lot.

1 Like

Hi @enric.pastor welcome to the forums.

This could be anything. Maybe you could add some more information concerning the chart? How does the data look like, how did you create the chart?

The first thing you could check is the dtype of the x and y axes, specially the x axis

It is hard to add the data as its is generated through a complex process.
The strange thing is that the spikes only appear when the reversed distribution is requested.

The code per se is fairly simple:

fig = px.ecdf(df3, group_labels, ecdfmode="reversed", marginal="box")
fig.update_layout(margin_b= 0, margin_l= 0, margin_r= 0, width= 1000)
fig.update_xaxes(title_text='Time(s)', range=[0, 80])
fig.update_yaxes(title_text='CDF')
fig.update_layout(title_text='Spurious Caution Alert')
fig.update_layout(legend=dict(
    yanchor="top",
    y=0.99,
    xanchor="right",
    x=0.99
))
1 Like

Did you check the dtype for the columns of df3?

Sorry. I forgot about that.
Time to CPA and the durations are int64.
The predictions are float64.

I have extracted the dataframe as is not large in this test case.

group_labels = [‘Time to CPA’, ‘Spurious Caution duration’, ‘First Caution duration’, “Predicted Time to LoWC”, “First Predicted Time to LoWC”]

[[ 13. 4. 4. 38.52282079 38.52282079]
[ 13. 4. 4. 38.52282079 38.52282079]
[ 21. 2. 2. 36.40388629 36.40388629]
[ 21. 2. 2. 36.40388629 36.40388629]
[ 2. 6. 6. 43.17394253 43.17394253]
[ 2. 6. 6. 43.17394253 43.17394253]
[-19. 49. 5. 23.15611685 39.88214533]
[-19. 49. 5. 23.15611685 39.88214533]
[ 28. 44. 4. 44.64795822 38.85908162]
[ 28. 44. 4. 44.64795822 38.85908162]
[ -3. 3. 3. 25.68102444 25.68102444]
[ -3. 3. 3. 25.68102444 25.68102444]
[ 21. 5. 5. 34.04343412 34.04343412]
[ 21. 5. 5. 34.04343412 34.04343412]
[ 12. 18. 1. 36.11286997 28.71333269]
[ 12. 18. 1. 36.11286997 28.71333269]]

I believe i found the bug.

The issue seems to come from the way plotly actually “reverse” the data.

The example in the ecdf documentation (Empirical cumulative distribution plots in Python) uses x=[1,2,3,4] as data. This is misleading because if we use x=[1,2,2,2,2,3,4] instead we encounter the bug discussed here as shown below:

Code:

fig_reversed=px.ecdf(df, x=[1,2,2,2,2,3,4], markers=True, ecdfmode="reversed",
              title="ecdfmode='reversed' (Y=fraction at or above X value)")
fig_reversed.show()

Output:

Let’s take a look at what goes on under the hood with both the “standard” and “reversed” option.

Code:

fig_standart = px.ecdf(df, x=[1,2,2,2,2,3,4], markers=True, ecdfmode="standart",
              title="ecdfmode='standard' (Y=fraction at or below X value, this the default)")

print(fig_standart.data[0],fig_reversed.data[0])

Output:

Scatter({
‘hovertemplate’: ‘x=%{x}
probability=%{y}’,
‘legendgroup’: ‘’,
‘line’: {‘dash’: ‘solid’, ‘shape’: ‘hv’},
‘marker’: {‘color’: ‘#636efa’, ‘symbol’: ‘circle’},
‘mode’: ‘lines+markers’,
‘name’: ‘’,
‘orientation’: ‘v’,
‘showlegend’: False,
‘x’: array([1, 2, 2, 2, 2, 3, 4]),
‘xaxis’: ‘x’,
‘y’: array([0.14285714, 0.28571429, 0.42857143, 0.57142857, 0.71428571, 0.85714286,
1. ]),
‘yaxis’: ‘y’
})
Scatter({
‘hovertemplate’: ‘x=%{x}
probability=%{y}’,
‘legendgroup’: ‘’,
‘line’: {‘dash’: ‘solid’, ‘shape’: ‘vh’},
‘marker’: {‘color’: ‘#636efa’, ‘symbol’: ‘circle’},
‘mode’: ‘lines+markers’,
‘name’: ‘’,
‘orientation’: ‘v’,
‘showlegend’: False,
‘x’: array([1, 2, 2, 2, 2, 3, 4]),
‘xaxis’: ‘x’,
‘y’: array([1. , 0.42857143, 0.57142857, 0.71428571, 0.85714286, 0.28571429,
0.14285714]),
‘yaxis’: ‘y’
})

The interesting parts are the "y"s attributes of the scatter graph objects. I will call the y attribute from the plot with the standard option and reversed option y_std and y_rev respectively. So we have :

    y_std=[0.14285714, 0.28571429, 0.42857143, 0.57142857, 0.71428571, 0.85714286,1.]

    y_rev=[1., 0.42857143, 0.57142857, 0.71428571, 0.85714286, 0.28571429,0.14285714]

As we can see they both contain the same data but the order of y_rev doesn’t quite make it the reversed of y_std as it should be. The consequence is the the line in the scatter will connect the dot at (1,1) to the dot at (2,0.42857143) instead of (2, 0.85714286) as it should. I.m.o this is a mistake and is misleading regarding the interpretation that one get from this representation as the graph point to the wrong proportion of data being at or above 2. The real proportion is 6/7=0.85714286.

So what we have to do now is to have y_rev being the actual reversed of y_std. This can be obtained
with np.flip(y_std). To have a figure with the correct reverse representation we have to manually change the data:

fig_reversed.data[0]=np.flip(y_std)

then calling fig_reversed.show() gives the right result :

There should probably be a pull request on that topic.

hi @Cbr1
Thank you for bringing this up.

Can you please report this alleged bug as a new issue in the plotly.py repo?

My Goodness.

Thanks a lot for tracking this. Never got time to dig into the problem, but the code I have that uses this function is still being used… and in fact I’m currently performing further analysis.

I will try your workaround and let you know.

Again, my infinite appreciation.

1 Like

I am fairly new to coding so i have never done such a thing. If you give me some time i will try to report this bug. I have actually find the solution to it without the mentioned workaround. In short within the packages/python/plotly/plotly/express/_core.py file , line 2258, you need to add within the if statement group=group.groupby(by=‘x’,group_keys=False).apply(lambda x: x.iloc[::-1]). Again, i will try to find time to both report the bug with detailed explanation of its cause and make a pull request.

1 Like

Glad i could help !!

Quick question. Given i am planning on making the pull request myself is it necessary that i still report the bug where you suggested ? Seems like i am already going to give an in-depth explanation of it at some point during the pull request.

Thx

hi @Cbr1
You should be fine with a pull request only.

Thank you.

Perfect, thank you.