Subplot for Kernel Density Estimator

Hello all, I am trying to plot the Kernel Density Estimator for each feature of my dataset. In this dataset, I have a target feature, โ€œClassโ€ that is a binary variable, either 0 or 1. For each plot of the KDE there is a line for the KDE for Class = 0 and Class = 1. I try to make subplots of these plots, I have 55 feature so 55 plots and I would like 5 plots per rows so 11 rows and 5 columns. Yet, when I try to do my subplot, it seems that all the 55 plots are stacke on each others on the subplot. Any help would be appreciated. I am new in Plotly and English is not my native language, forgive me for that.

Here is my code:

# Isolate each targeted class as a new dataframe, remove categorica variable EJ
# and remove the Class target.
class_0 = df[df["Class"]==0].drop(["Class", "EJ"], axis=1).dropna()
class_1 = df[df["Class"]==1].drop(["Class", "EJ"], axis=1).dropna()

# 55 plot since I have 55 features, 5 plot per row
fig = make_subplots(rows=11, cols=5)

# Iterate through both dataframe
for (col_name0, col_data0), (col_name1, col_data1) in zip(class_0.items(), class_1.items()):

    # Index to know where to place each plot on the subplot
    index = 0

    # Class 0 KDE
    kde_data0 = gaussian_kde(class_0[col_name0])
    kde_range0 = np.linspace(
        class_0[col_name0].min() - class_0[col_name0].max() * 0.1,
        class_0[col_name0].max() + class_0[col_name0].max() * 0.1,
        len(class_0[col_name0]),
    )

    estimated_values0 = kde_data0.evaluate(kde_range0)
    estimated_values_cum0 = np.cumsum(estimated_values0)
    estimated_values_cum0 /= estimated_values_cum0.max()


    # Class 1 KDE
    kde_data1 = gaussian_kde(class_1[col_name1])
    kde_range1 = np.linspace(
        class_1[col_name1].min() - class_1[col_name1].max() * 0.1,
        class_1[col_name1].max() + class_1[col_name1].max() * 0.1,
        len(class_1[col_name1]),
    )

    estimated_values1 = kde_data1.evaluate(kde_range1)
    estimated_values_cum1 = np.cumsum(estimated_values1)
    estimated_values_cum1 /= estimated_values_cum1.max()

    # Get the correct row number and col number to place each kde plot on the subplot
    row_num = (index % 11) + 1
    col_num = (index // 11) + 1

    # First KDE plot with class 0
    kde_plot = go.Scatter(name='Class 0', x=kde_range0, y=estimated_values0)
    fig.add_trace(kde_plot, row=row_num, col=col_num)

    # We add on the same plot the KDE for class 1
    kde_plot = go.Scatter(name='Class 1', x=kde_range1, y=estimated_values1)

    # We add the kde plot to the subplot
    fig.add_trace(kde_plot, row=row_num, col=col_num)

    # Increment the index
    index += 1

fig.update_layout(template='plotly_dark', height=2000, title_text='Kernel Density Estimation for each Features')
fig.show()

You can find the result of this code:

And here a single KDE plot:

1 Like

Hi @ldeco welcome to the community.

Here an example how to create your graph. Not sure though, where exactly is the error in your code.

import plotly.graph_objects as go
from plotly.subplots import make_subplots
import numpy as np

rows = 5
cols = 10

def create_trace():
    return go.Scatter(x=np.arange(10), y=np.random.randint(1, 20, size=10))
    
fig = make_subplots(rows=rows, cols=cols)

for r in range(1, rows+1):
    for c in range(1, cols+1):
        fig.add_trace(create_trace(), row=r, col=c)
        fig.add_trace(create_trace(), row=r, col=c)
fig.show()

mrep subplots

This is the problem.

Within your loop you reset your index. Move this line out of the loop. Maybe itโ€™s just this.

Thanks for this. I was trying to create plotly version of KDE plot with overlay grouped by a column in the dataframe. I tried to simplify it. only flip side, its consumes 10 mins for .5 mn dataset.

from scipy.stats import gaussian_kde

from scipy.stats import gaussian_kde

def kde_plot_plotly(data, target, group_col):
  '''
  # target  = continuous variable
  # group_col = categorical column
  '''
  # fig = make_subplots(rows=1, cols=1)
  fig = go.Figure()

  for item in data[group_col].unique():
    print(item,'\n-------')
    kde_df = data[data[group_col]==item][[target]].dropna()
    kde_data = gaussian_kde(kde_df[target])
    kde_range = np.linspace(
        kde_df[target].min() - kde_df[target].max() * 0.1,
        kde_df[target].max() + kde_df[target].max() * 0.1,
        len(kde_df[target])
        )

    #this step takes 95% of total time
    estimated_values = kde_data.evaluate(kde_range)

    kde_plot = go.Scatter(name=item, x=kde_range, y=estimated_values)
    fig.add_trace(kde_plot)

  fig.update_layout(template='plotly_dark', height=400, title_text=f'Kernel Density Estimation for each {target}')
  fig.show()