Proper Data Format for Sankey

I am trying to feed a dataframe into a Sankey Diagram, and the counts look right and the paths looks right, but the labels are wrong or undefined. I’m new to this data structure, so I’m confused on how the labels are getting mixed up or not being defined.

to see where I was going wrong, I tried replicating the Scottish_df here (with the limited data), https://plot.ly/~alishobeiri/1591/plotly-sankey-diagrams/#/ and just copied the python code and ran it.

and it seems to have the same problem, some of the nodes are missing in the graph that are in the data, and it ends in “undefined” vs the Yes/No in the example.

Any pointers would be super helpful!

Hi @imadsen,

Welcome to the forums!

Could you add the code and dataset for what you tried? I didn’t see a reference to the full dataset in the example that you linked to. (To add code to a forum post put it inside a fenced code black as described at https://help.github.com/en/articles/creating-and-highlighting-code-blocks).

If possible, please also add a screenshot of the result that you’re getting. (You can add an image to a forum post by dragging the image file into the text area)

-Jon

Thanks for the fast response and welcome Jon!

That actually might be part of the problem, I’m not sure what data I need to get the desired result.

code:

init_notebook_mode(connected=True)
data_trace = dict(
    type='sankey',
    domain = dict(
      x =  [0,1],
      y =  [0,1]
    ),
    orientation = "h",
    valueformat = ".0f",
    node = dict(
      pad = 10,
      thickness = 0,
      line = dict(
        color = "black",
        width = 0
      ),
      label =  scottish_df['Node, Label'].dropna(axis=0, how='any'),
      color = scottish_df['Color']
    ),
    link = dict(
      source = scottish_df['Source'].dropna(axis=0, how='any'),
      target = scottish_df['Target'].dropna(axis=0, how='any'),
      value = scottish_df['Value'].dropna(axis=0, how='any'),
      color = scottish_df['Link Color'].dropna(axis=0, how='any'),
  )
)

layout =  dict(
    title = "Scottish Referendum Voters who now want Independence",
    height = 900,
    font = dict(
      size = 10
    ),    
)

fig = dict(data=[data_trace], layout=layout)
iplot(fig, validate=False)

The data is (same as the link):

|Source|Target|Value|Color|Node, Label|Link Color|
|1|0|5|20|#F27420|Remain+No – 28|rgba(253, 227, 212, 0.5)|
|2|0|6|3|#4994CE|Leave+No – 16|rgba(242, 116, 32, 1)|
|3|0|7|50|#FABC13|Remain+Yes – 21|rgba(253, 227, 212, 0.5)|
|4|1|5|14|#7FC241|Leave+Yes – 14|rgba(219, 233, 246, 0.5)|
|5|1|6|50|#D3D3D3|Didn’t vote in at least one referendum – 21|rgba(73, 148, 206, 1)|

Ideal outcome:

My outcome:

I think what I’m confused about is, assuming it doesn’t look right due to missing data, how are the labels associated with the right target/source? e.g. the Target is undefined, but how do I define it.

Hi @imadsen,

Could it be the case that the dataset you’re working with isn’t the full dataset from the example? You’re diagram has only 5 links, matching the five rows you printed out in the dataframe. The original example has ~15 links, so it must have been created from a larger data frame.

The link.source and link.target properties are arrays of integer indices into the node.label array of strings. What’s happening in your example, I think, is that the indices in the link.target array are all >= 5, and the list of node.label strings only has 5 elements, so that’s why the target nodes are undefined.

Hope that helps!
-Jon

1 Like

Hey Jon,
I believe that was part 1 of my issue, thanks for explaining!

The other thing I was primarily stuck on was how labels are assigned to nodes, and I found that it’s by the index of the node in the array, with the link target id / source id. (this is different from say, ipysankeywidget, because that library uses source NAME and target NAME, instead of IDs, and then uses those as node labels)

Hope this dialog is helpful for anyone else new to plotly sankey diagrams!

Thanks Jon!

2 Likes