Sankey / Parcat: pros & cons, would it be possible to combine their pros in a new kind of chart?

Hi there,

Everytime I start working with a dataset, I first look for discrepancies such as missing values or values that should not be present.

Often, tables can have records whose values depend of at least another field; it might make sense to have some values for a record of type A, but not for a record of type B.

One way to catch such disrep is via many tables, groupby, and count.

Another way to do it is by plotting a sankey diagram or a parcat, after having remaped/replaced the expected values by “Valid”, and the other ones by “Invalid”, for instance, to make to output more concise.

Sankey are great because it is possible to color specifc nodes, or specifc link between nodes. However, if there are too many “levels”, then the information regarding the percentage of a parent node with a value XYZ in a subnode is lost. A recent example of this was posted yesterday here : Show and tell: Sankey plot with dash - #3 by David22

On the provided example in the initial post, it is not possible to know how many applications made via “Indeed”, “Linkedin” or “Other” got no “No Response”. We only know that x% of the Applications ended with a “No Response”.

With a Parcat, we don’t lose this part of the info; a parcat shows the entire path, for each group of records. But, it is not possible (as far as I know) to neither color a specifc "node"or a specific line based on its value. I know it is possible to color a whole path/line, based on the value taken by one category, but it wouldnt be possible to plot something like this:

Such chart would have the merit of immediately showing that among the records in group B belonging to a Category, many don’t have any sub-category. Or, that some of the records with a type, lack their subtype.

It might be normal to not have any Type or SubType for the records in group A ; as such there is no need to highlight anything there. However, I would like to show a red line if some records in group A actually have a subtype by mistake.

In a perfect world, there would be processes in place to check that data inputs feeding a data warehouse make sense, but that is not always the case. Then, often with a delay, someone start thinking that “the data should be cleaned, because the outputs are noisy”.

Naively counting the NaN/None/Missing values/Default records won’t make it because as written above, for some records, it is expected and normal to not have any value. Moreover, when different teams are in charge of the input process, they might not necessarily fill the same fields, and there might be different combinations of filled values, for the same kind of records. Just because someone else might have forgotten to not make the field X optional.

So, long story short:

  1. What kind of charts do you usually use to show/catch data discrepancies when there are dependencies between the different fields
  2. Is there any chance for the Parcats charts to become a bit more customizable? The goal is to highlight the issues if there are some.

One more thing: I’m generating reports which are going to be printed. Which means that I lose everything displayed “onhover” or via any texttemplate.