💸 Reduce costs by consolidating proprietary analytics & reporting software to open-source & Dash.
Challenge us to replace your analytics with Dash and reduce costs.

How does create_dendrogram defined similarity?

I found a really great tutorial on the website about making dendrogram plots with heatmaps:

However, the create_dendrogram function is very much a black box, and the documentation doesn’t describe how distances between samples are actually computed (e.g., Euclidean distance?). It works both when I plug in a correlation matrix or the data or the data itself (see below, where df is a pandas datframe of my data). Is it computing the pairwise Euclidean distance for all rows, and using that as the data from the distances in the dendrogram?

Correlation matrix:
FF.create_dendrogram(df.corr(), orientation=‘bottom’, labels=labels)

Just the data itself:
FF.create_dendrogram(df.T, orientation=‘bottom’, labels=labels)

Thanks!
chris

Hi Chris,

I’m aware this is a little late. I wanted to know the same things (actually I wanted to know how to change these things), so I jumped through the plotly code to find out. For reference, the create_dendrogram function is in tools.py. As you might expect, the code uses scipy to calculate the underlying clustering used in the dendrogram, with the relevant bits being :

import scipy.cluster.hierarchy as sch
import scipy.spatial as scs

and

d = scs.distance.pdist(X)
Z = sch.linkage(d, method='complete')
P = sch.dendrogram(Z, orientation=self.orientation, labels=self.labels, no_plot=True)

What’s clear from this is that when creating d (the distance matrix), the distance metric used is not specified, and so reverts to scipy’s default method (Euclidean), and the linkage method used to calculate Z (the clustering) is hard-coded to ‘complete’. So far as I can tell, there is currently no way to change these parameters without actually changing the plotly base code, which I’ve admittedly done before, as I needed different metrics (Pearson correlation).