Spikes in reversed ecdf

enric.pastor · March 21, 2023, 11:17pm

Hello,
I’m using reversed ECDF for data analysis and the resulting diagrams contain a sort of strange spikes.
Looks like a bug.

Any idea or suggestion?
Thanks a lot.

AIMPED · March 21, 2023, 11:31pm

Hi @enric.pastor welcome to the forums.

This could be anything. Maybe you could add some more information concerning the chart? How does the data look like, how did you create the chart?

The first thing you could check is the dtype of the x and y axes, specially the x axis

enric.pastor · March 22, 2023, 9:32am

It is hard to add the data as its is generated through a complex process.
The strange thing is that the spikes only appear when the reversed distribution is requested.

The code per se is fairly simple:

fig = px.ecdf(df3, group_labels, ecdfmode="reversed", marginal="box")
fig.update_layout(margin_b= 0, margin_l= 0, margin_r= 0, width= 1000)
fig.update_xaxes(title_text='Time(s)', range=[0, 80])
fig.update_yaxes(title_text='CDF')
fig.update_layout(title_text='Spurious Caution Alert')
fig.update_layout(legend=dict(
    yanchor="top",
    y=0.99,
    xanchor="right",
    x=0.99
))

AIMPED · March 22, 2023, 10:34am

Did you check the dtype for the columns of df3?

enric.pastor · March 22, 2023, 11:12am

Sorry. I forgot about that.
Time to CPA and the durations are int64.
The predictions are float64.

I have extracted the dataframe as is not large in this test case.

group_labels = [‘Time to CPA’, ‘Spurious Caution duration’, ‘First Caution duration’, “Predicted Time to LoWC”, “First Predicted Time to LoWC”]

[[ 13. 4. 4. 38.52282079 38.52282079]
[ 13. 4. 4. 38.52282079 38.52282079]
[ 21. 2. 2. 36.40388629 36.40388629]
[ 21. 2. 2. 36.40388629 36.40388629]
[ 2. 6. 6. 43.17394253 43.17394253]
[ 2. 6. 6. 43.17394253 43.17394253]
[-19. 49. 5. 23.15611685 39.88214533]
[-19. 49. 5. 23.15611685 39.88214533]
[ 28. 44. 4. 44.64795822 38.85908162]
[ 28. 44. 4. 44.64795822 38.85908162]
[ -3. 3. 3. 25.68102444 25.68102444]
[ -3. 3. 3. 25.68102444 25.68102444]
[ 21. 5. 5. 34.04343412 34.04343412]
[ 21. 5. 5. 34.04343412 34.04343412]
[ 12. 18. 1. 36.11286997 28.71333269]
[ 12. 18. 1. 36.11286997 28.71333269]]

Cbr1 · August 27, 2024, 3:17pm

I believe i found the bug.

The issue seems to come from the way plotly actually “reverse” the data.

The example in the ecdf documentation (Empirical cumulative distribution plots in Python) uses x=[1,2,3,4] as data. This is misleading because if we use x=[1,2,2,2,2,3,4] instead we encounter the bug discussed here as shown below:

Code:

fig_reversed=px.ecdf(df, x=[1,2,2,2,2,3,4], markers=True, ecdfmode="reversed",
              title="ecdfmode='reversed' (Y=fraction at or above X value)")
fig_reversed.show()

Output:

Let’s take a look at what goes on under the hood with both the “standard” and “reversed” option.

Code:

fig_standart = px.ecdf(df, x=[1,2,2,2,2,3,4], markers=True, ecdfmode="standart",
              title="ecdfmode='standard' (Y=fraction at or below X value, this the default)")

print(fig_standart.data[0],fig_reversed.data[0])

Output:

Scatter({
‘hovertemplate’: ‘x=%{x}
probability=%{y}’,
‘legendgroup’: ‘’,
‘line’: {‘dash’: ‘solid’, ‘shape’: ‘hv’},
‘marker’: {‘color’: ‘#636efa’, ‘symbol’: ‘circle’},
‘mode’: ‘lines+markers’,
‘name’: ‘’,
‘orientation’: ‘v’,
‘showlegend’: False,
‘x’: array([1, 2, 2, 2, 2, 3, 4]),
‘xaxis’: ‘x’,
‘y’: array([0.14285714, 0.28571429, 0.42857143, 0.57142857, 0.71428571, 0.85714286,
1. ]),
‘yaxis’: ‘y’
})
Scatter({
‘hovertemplate’: ‘x=%{x}
probability=%{y}’,
‘legendgroup’: ‘’,
‘line’: {‘dash’: ‘solid’, ‘shape’: ‘vh’},
‘marker’: {‘color’: ‘#636efa’, ‘symbol’: ‘circle’},
‘mode’: ‘lines+markers’,
‘name’: ‘’,
‘orientation’: ‘v’,
‘showlegend’: False,
‘x’: array([1, 2, 2, 2, 2, 3, 4]),
‘xaxis’: ‘x’,
‘y’: array([1. , 0.42857143, 0.57142857, 0.71428571, 0.85714286, 0.28571429,
0.14285714]),
‘yaxis’: ‘y’
})

The interesting parts are the "y"s attributes of the scatter graph objects. I will call the y attribute from the plot with the standard option and reversed option y_std and y_rev respectively. So we have :

    y_std=[0.14285714, 0.28571429, 0.42857143, 0.57142857, 0.71428571, 0.85714286,1.]

    y_rev=[1., 0.42857143, 0.57142857, 0.71428571, 0.85714286, 0.28571429,0.14285714]

As we can see they both contain the same data but the order of y_rev doesn’t quite make it the reversed of y_std as it should be. The consequence is the the line in the scatter will connect the dot at (1,1) to the dot at (2,0.42857143) instead of (2, 0.85714286) as it should. I.m.o this is a mistake and is misleading regarding the interpretation that one get from this representation as the graph point to the wrong proportion of data being at or above 2. The real proportion is 6/7=0.85714286.

So what we have to do now is to have y_rev being the actual reversed of y_std. This can be obtained
with np.flip(y_std). To have a figure with the correct reverse representation we have to manually change the data:

fig_reversed.data[0]=np.flip(y_std)

then calling fig_reversed.show() gives the right result :

There should probably be a pull request on that topic.

adamschroeder · August 27, 2024, 8:23pm

hi @Cbr1
Thank you for bringing this up.

Can you please report this alleged bug as a new issue in the plotly.py repo?

enric.pastor · August 27, 2024, 9:55pm

My Goodness.

Thanks a lot for tracking this. Never got time to dig into the problem, but the code I have that uses this function is still being used… and in fact I’m currently performing further analysis.

I will try your workaround and let you know.

Again, my infinite appreciation.

Cbr1 · August 28, 2024, 7:04am

I am fairly new to coding so i have never done such a thing. If you give me some time i will try to report this bug. I have actually find the solution to it without the mentioned workaround. In short within the packages/python/plotly/plotly/express/_core.py file , line 2258, you need to add within the if statement group=group.groupby(by=‘x’,group_keys=False).apply(lambda x: x.iloc[::-1]). Again, i will try to find time to both report the bug with detailed explanation of its cause and make a pull request.

Cbr1 · August 28, 2024, 7:07am

Glad i could help !!

Cbr1 · August 29, 2024, 11:19am

Quick question. Given i am planning on making the pull request myself is it necessary that i still report the bug where you suggested ? Seems like i am already going to give an in-depth explanation of it at some point during the pull request.

Thx

adamschroeder · August 29, 2024, 12:14pm

hi @Cbr1
You should be fine with a pull request only.

Thank you.

Cbr1 · August 29, 2024, 12:25pm

Perfect, thank you.

Topic		Replies	Views
Histogram with cumulative frequency plot Plotly R	1	4476	May 24, 2017
I have just published number 2 in my series Creating stunning visulisations in Plotly: A beginners guide to Part-of-Whole charts 📊 Plotly Python	3	421	January 5, 2024
Plot the Empirical CDF 📊 Plotly Python	1	6114	September 21, 2019
Time Axis Bug in Scatter Plots? 📊 Plotly Python	4	514	January 3, 2023
Cumulative Histogram 📊 Plotly Python	3	6382	February 16, 2024

Spikes in reversed ecdf

Related topics