Issues with ff.create_distplot()

ursus · April 9, 2019, 9:03am

I have a number of issues with ff.create_distplot():

1. Consider the following data:

import numpy as np

import plotly.graph_objs as go
import plotly.figure_factory as ff

m = np.random.normal(loc=0.08, scale=0.0008, size=5000)

Histogram of the data:

fig = go.FigureWidget()
fig.add_histogram(x=m)
fig

However, when I try to produce a density plot using the figure factory, it does not produce what I want:

hist_data = [m]

group_labels = ['m1']
colors = ['#333F44']

# Create distplot
fig = go.FigureWidget(ff.create_distplot(hist_data, group_labels, show_hist=True, colors=colors))
fig.layout.update(title='Density curve')
fig

I can perhaps tinker with it until it gives me the right plot, but I think there is an issue there.

If I set show_hist=False, the plot looks much better:

The problem seems to be with the bins of the histogram. If we set scale=0.08 we can see that the histogram is displayed only in one bin:

2. Even though the histnorm is set to probability density by default, I did not manage to make it look like a probability density. It looks more like a frequency “distplot”.

3. The curve_type is set to kde. What kind of KDE is being used? I would like to try the epanechnikov kernel for instance.
Is a kde curve type meant to produce something like the density function in R?

4. When several distplots are combined, e.g.:

hist_data = [m, m+0.001]

group_labels = ['m1', 'm2']
colors = ['#333F44', '#37AA9C']

# Create distplot
fig = go.FigureWidget(ff.create_distplot(hist_data, group_labels, show_hist=False, colors=colors))
fig.layout.update(title='Density curve',
                                   )
fig

The rug plot as well as the legend do not appear in the logical sequence.
Sure, we can set

fig.layout.update(legend=dict(traceorder='normal'))

but I think the default should be the order in which they were added.

I also think that the distance between the rug plots is disproportionately big.

jmmease · April 9, 2019, 9:49am

Thanks for the detailed description of the issues you’re having with distplot @ursus,

I haven’t actually dug into how this figure factory works yet, so unfortunately I don’t have much guidance to offer at the moment.

@nicolaskruchten, what do you think about eventually adding a px.kde or pd.distplot function to plotly_express (https://github.com/plotly/plotly_express) to handle the distplot usecase?

@ursus, if we decide this is something that makes sense to implement in plotly_express we’ll likely direct our efforts there since plotly_express provides a much more unified and powerful API than the distplot figure factory currently does.

Thanks,
-Jon

ursus · April 9, 2019, 10:34am

Thank you!

plotly_express is great!

nicolaskruchten · April 9, 2019, 12:36pm

We could do a few things with px here.

We could add a marginal kwarg to px.histogram to get the ability to do rug, violin and box marginals similar to what we have in px.scatter and px.density_contour
We could add a ‘px.kde’ function that leverages go.Violin under the hood and uses its built-in points system to get the rug. (With an optional marginal kwarg too, why not!)
We could convince the JS guys to add a KDE option to go.Histogram
We could convince the JS guys to add points to go.Histogram

ursus · April 9, 2019, 5:46pm

I have mentioned R’s density function.
Mathematica also has a similar command which is really nice: SmoothKernelDistribution.

nicolaskruchten · April 24, 2019, 1:23am

I’ve implemented idea 1 above: px.histogram() now has a marginal option so you can add the rug there. Still no KDE option though. Toying with the idea of a new kde trace type in plotly.js at the moment… basically a blend of violin and histogram minus histfunc. Would also allow for smooth cumulative density functions which would be nice.

Mike3 · May 13, 2019, 11:38pm

That would be extremely welcome, as I have been experiencing issues with ff.create_distplot() as well here.

Could scipy help out in providing robust default kde solvers?

nicolaskruchten · May 14, 2019, 12:36am

Could scipy help out in providing robust default kde solvers?

Yes, but this would run against one of the most important Plotly Express design goals, which is to do as little work in Python as possible, deferring to the JS layer for almost everything. Notable exceptions include the OLS and LOWESS trendlines, but that’s mostly because the JS layer doesn’t support those. In the case of KDE, the JS layer already has an implementation in the violin trace type so Plotly Express uses that instead so as not to duplicate work.

ursus · May 14, 2019, 11:35pm

I forgot that there are also other packages

Scipy indeed is the package being used under the hood in ff.distplot(). However, it does not allow for other kernel types.

Richer is the KDE implementation from scikit learn.

For my case:

from sklearn.neighbors.kde import KernelDensity

fig = go.FigureWidget()
for i, kernel in enumerate(['gaussian', 'tophat', 'epanechnikov',
                            'exponential', 'linear', 'cosine']):
    kde = KernelDensity(kernel=kernel, bandwidth=(4*np.std(m)**5/(3*len(m)))**(1/5)).fit(X)
    log_dens = kde.score_samples(X_plot)
    fig.add_scatter(x=X_plot[:, 0], y=np.exp(log_dens), line=dict(width=1.5), name=kernel, showlegend=True)
hist = fig.add_histogram(x=m,  xbins=dict(start=m.min(), end=m.max(), size=0.0002), 
                         opacity=0.6, 
                         marker=dict(color='rgb(0, 0, 100)'), showlegend=False)
rug = fig.add_scatter(x=m, y=np.zeros(len(m)), mode='markers', 
                      marker=dict(symbol='line-ns-open', color='rgb(0, 0, 100)'),
                      yaxis='y2',
                      showlegend=False
                     )

fig.layout = dict(
            xaxis1=dict(domain=[0.0, 1.0],
                        anchor='y2',
                        zeroline=False),
            yaxis1=dict(domain=[0.15, 1],
                        anchor='free',
                        zeroline=False,
                        position=0.0),
            yaxis2=dict(domain=[0, 0.15],
                        anchor='x1',
                        zeroline=False,
                        dtick=1,
                        showticklabels=False)
)
fig

Scicit-learn requires you to set the bandwidth. Bandwidth estimation is on their # TODO list.

I have used the bandwidth which is the default in scipy.
Notice that the same bandwidth is not suitable for the Histogram plot. I have adjusted it to fit nicely. That might not be the right approach. The problem above has revealed that the default bin size in Histogram is not suitable for arrays with low standard deviation (often I also find it to be too large).

I’ve just found this plotly page which shows how it should be done

Also: my second objection above is invalid as stats.norm.pdf(0, scale=0.0008) = 498.67.

ursus · May 18, 2019, 11:34am

@nicolaskruchten if you consider adding a lightweight kde curve to px.histogram you may want to look at the KDEpy package.

The reported speed is orders of magnitude better.

dpraz · November 19, 2023, 1:55pm

@nicolaskruchten Just wanted to chime in and see if there is still hope for px.kde() to be implemented? I’m also quite frustrated with ff.create_distplot(), and it is currently the only option for me to add a marginal kde() to my other plotly figures. Currently, there’s no option for a vertical orientation, and I’m not able to achieve a PDF curve as I would like.

I think px.kde() would be an elegant solution and add another excellent option for robust statistical, graphical representation to plotly express.

Topic		Replies	Views
Ff.create_distplot keeps giving non-normalized histograms/kde's 📊 Plotly Python	5	2854	May 17, 2019
Distplot in Dash Dash Python	17	8575	April 3, 2020
Distplot histogram count 📊 Plotly Python	11	8440	October 24, 2019
"ImportError: FigureFactory.create_distplot requires scipy" Dash Python	2	5861	April 10, 2020
Distplot binning issue 📊 Plotly Python	5	2363	June 8, 2020

Issues with ff.create_distplot()

Related topics