Black Lives Matter. Please consider donating to Black Girls Code today.
Dash HoloViews is now available! Check out the docs.

WordCloud in DASH

thank you very much. @eliasdabbas

1 Like

This is not reasonable. In real programs , it’s highly possible that the frequency of words are the same.

Then they would have the same size in the word cloud.

Why would that be unreasonable?

The reason i said is not right I didn’t think it about deeply. But I still think it’s not reasonable. Wordclound is to fill a certain area by words . In your case if words are more enough the highlier they overlap,your case is more like a bubble graph. To solve this problem you need more space . That is to say ,bigger x and y axis. In troditional wordclounds ,it save space. And just think about this ,if a word’ frequency is very high and another is very small , they happen to be very near ,the small one will completely disappear. But your idea is the best by now.

This line of code ensures that the size of each word lies between 15 and 35.
So nothing is going to be less than 15 or greater that 35 in size. These can be changed of course.

1 Like

I’ve pointed out that you can’t solve the overlap problem.
Just plot a data sample with great variance and you will notice what i am talking about.

just plot and change you html page window size , you will see what I said.

Having data with a big difference is a common thing and there are many ways to deal with in plotting, one of the main ones is normalizing numbers to a certain range.

Maybe if you can share an example?

HERE is ONE EXAMPLE. actually even though you normalizing numbers , when you change the window size ,they will overlap . because scatter method is not to fill the area . the smaller the window size ,the nearer the texts are.

Besides, hoverinfo is not accurate , because the xaxis of each text are so near. I find that the hoverinfo is based on differeces of the value of xaxis , but not the text itself. When you hover on one text, the info may be another one’s, because their xaxis value is similar and near .

My english is poor , may i make you understand me?

import pandas as pd
import plotly as py
import plotly.graph_objs as go
import random

words = [‘征信’, ‘拍拍贷’, ‘查询’, ‘报告’, ‘贷款’, ‘个人’, ‘怎么’, ‘信用卡’, ‘逾期’, ‘被拒’, ‘如何’, ‘中心’, ‘信用’, ‘网贷’, ‘人人’, ‘分期’, ‘注册’, ‘好信’, ‘手机’, ‘钱包’, ‘个人信用’, ‘借呗’, ‘平安’, ‘捷信’, ‘微粒贷’, ‘借钱’, ‘记录’, ‘用钱’, ‘可以’, ‘花呗’, ‘身份证’, ‘拍拍’, ‘现金’, ‘微信’, ‘还款’, ‘问问’, ‘产品’, ‘51’, ‘信而富’, ‘什么’, ‘黑名单’, ‘360’, ‘17’, ‘黑户’, ‘怎么办’, ‘金融’, ‘帮你贷’, ‘消除’, ‘密码’, ‘账号’, ‘怎样’, ‘分期乐’, ‘拒绝’, ‘申请’]
frequency = [1083, 393, 353, 167, 123, 119, 83, 64, 57, 46, 44, 40, 37, 31, 29, 29, 28, 26, 25, 23, 23, 22, 21, 19, 18, 18, 18, 18, 18, 17, 15, 15, 15, 15, 14, 14, 13, 13, 13, 13, 13, 13, 12, 12, 11, 11, 11, 11, 10, 10, 10, 10, 10, 10]
percent = [0.362086258776329, 0.13139418254764293, 0.11802072885322636, 0.055834169174189235, 0.041123370110330994, 0.03978602474088933, 0.02774991641591441, 0.02139752591106653, 0.01905717151454363, 0.015379471748579069, 0.01471079906385824, 0.013373453694416584, 0.012370444667335341, 0.010364426613172852, 0.009695753928452023, 0.009695753928452023, 0.009361417586091608, 0.008692744901370779, 0.008358408559010365, 0.0076897358742895345, 0.0076897358742895345, 0.00735539953192912, 0.007021063189568706, 0.006352390504847877, 0.006018054162487462, 0.006018054162487462, 0.006018054162487462, 0.006018054162487462, 0.006018054162487462, 0.0056837178201270475, 0.005015045135406218, 0.005015045135406218, 0.005015045135406218, 0.005015045135406218, 0.004680708793045804, 0.004680708793045804, 0.0043463724506853894, 0.0043463724506853894, 0.0043463724506853894, 0.0043463724506853894, 0.0043463724506853894, 0.0043463724506853894, 0.004012036108324975, 0.004012036108324975, 0.00367769976596456, 0.00367769976596456, 0.00367769976596456, 0.00367769976596456, 0.003343363423604146, 0.003343363423604146, 0.003343363423604146, 0.003343363423604146, 0.003343363423604146, 0.003343363423604146]
lenth = len(words)
colors = [py.colors.DEFAULT_PLOTLY_COLORS[random.randrange(1, 10)] for i in range(lenth)]

data = go.Scatter(
x=random.choices(range(lenth), k=lenth),
y=random.choices(range(lenth), k=lenth),
mode=‘text’,
text=words,
hovertext=[’{0}
{1}{2}’.format(w, f, format(p, ‘.2%’)) for w, f, p in zip(words, frequency, percent)],
hoverinfo=‘text’,
textfont={‘size’: frequency, ‘color’: colors})
layout = go.Layout({‘xaxis’: {‘showgrid’: False, ‘showticklabels’: False, ‘zeroline’: False},
‘yaxis’: {‘showgrid’: False, ‘showticklabels’: False, ‘zeroline’: False}})
fig = go.Figure(data=[data], layout=layout)

py.offline.plot(fig)

here is my example.

import pandas as pd
import plotly as py
import plotly.graph_objs as go
import random

words = [‘征信’, ‘拍拍贷’, ‘查询’, ‘报告’, ‘贷款’, ‘个人’, ‘怎么’, ‘信用卡’, ‘逾期’, ‘被拒’, ‘如何’, ‘中心’, ‘信用’, ‘网贷’, ‘人人’, ‘分期’, ‘注册’, ‘好信’, ‘手机’, ‘钱包’, ‘个人信用’, ‘借呗’, ‘平安’, ‘捷信’, ‘微粒贷’, ‘借钱’, ‘记录’, ‘用钱’, ‘可以’, ‘花呗’, ‘身份证’, ‘拍拍’, ‘现金’, ‘微信’, ‘还款’, ‘问问’, ‘产品’, ‘51’, ‘信而富’, ‘什么’, ‘黑名单’, ‘360’, ‘17’, ‘黑户’, ‘怎么办’, ‘金融’, ‘帮你贷’, ‘消除’, ‘密码’, ‘账号’, ‘怎样’, ‘分期乐’, ‘拒绝’, ‘申请’]

frequency = [1083, 393, 353, 167, 123, 119, 83, 64, 57, 46, 44, 40, 37, 31, 29, 29, 28, 26, 25, 23, 23, 22, 21, 19, 18, 18, 18, 18, 18, 17, 15, 15, 15, 15, 14, 14, 13, 13, 13, 13, 13, 13, 12, 12, 11, 11, 11, 11, 10, 10, 10, 10, 10, 10]

percent = [0.362086258776329, 0.13139418254764293, 0.11802072885322636, 0.055834169174189235, 0.041123370110330994, 0.03978602474088933, 0.02774991641591441, 0.02139752591106653, 0.01905717151454363, 0.015379471748579069, 0.01471079906385824, 0.013373453694416584, 0.012370444667335341, 0.010364426613172852, 0.009695753928452023, 0.009695753928452023, 0.009361417586091608, 0.008692744901370779, 0.008358408559010365, 0.0076897358742895345, 0.0076897358742895345, 0.00735539953192912, 0.007021063189568706, 0.006352390504847877, 0.006018054162487462, 0.006018054162487462, 0.006018054162487462, 0.006018054162487462, 0.006018054162487462, 0.0056837178201270475, 0.005015045135406218, 0.005015045135406218, 0.005015045135406218, 0.005015045135406218, 0.004680708793045804, 0.004680708793045804, 0.0043463724506853894, 0.0043463724506853894, 0.0043463724506853894, 0.0043463724506853894, 0.0043463724506853894, 0.0043463724506853894, 0.004012036108324975, 0.004012036108324975, 0.00367769976596456, 0.00367769976596456, 0.00367769976596456, 0.00367769976596456, 0.003343363423604146, 0.003343363423604146, 0.003343363423604146, 0.003343363423604146, 0.003343363423604146, 0.003343363423604146]

lenth = len(words)
colors = [py.colors.DEFAULT_PLOTLY_COLORS[random.randrange(1, 10)] for i in range(lenth)]

data = go.Scatter(
x=random.choices(range(lenth), k=lenth),
y=random.choices(range(lenth), k=lenth),
mode=‘text’,
text=words,
hovertext=[’{0}
{1}{2}’.format(w, f, format(p, ‘.2%’)) for w, f, p in zip(words, frequency, percent)],
hoverinfo=‘text’,
textfont={‘size’: frequency, ‘color’: colors})
layout = go.Layout({‘xaxis’: {‘showgrid’: False, ‘showticklabels’: False, ‘zeroline’: False},
‘yaxis’: {‘showgrid’: False, ‘showticklabels’: False, ‘zeroline’: False}})

fig = go.Figure(data=[data], layout=layout)

py.offline.plot(fig)

Two things can help with this:

  1. Normalizing the numbers as I mentioned. In this case I normalized them between 15 and 45 as follows:

    frequency = [1083, 393, 353, 167, 123, 119, 83, 64, 57, 46, 44, 40, 37, 31, 29, 29, 28, 26, 25, 23, 23, 22, 21, 19, 18, 18, 18, 18, 18, 17, 15, 15, 15, 15, 14, 14, 13, 13, 13, 13, 13, 13, 12, 12, 11, 11, 11, 11, 10, 10, 10, 10, 10, 10]

    lower, upper = 15, 45
    frequency = [((x - min(frequency)) / (max(frequency) - min(frequency))) * (upper - lower) + lower for x in frequency]

  2. X axis problem: I suggest you don’t use random numbers for the X axis, because labels are more likely to overlap. So the suggested approach is to simply use range(len(data)) for the X axis values. This does NOT completely solve it, but it’s good enough in the majority of cases.

Resizing the window will change the positions of course. Plotly is flexible with this and it adjusts, but there is a limit. If you make the window very small they will definitely eventually overlap :slight_smile:

Full modified code:

import pandas as pd
import plotly as py
import plotly.graph_objs as go
import random

words = words = ['征信', '拍拍贷', '查询', '报告', '贷款', '个人', '怎么', '信用卡', '逾期', '被拒', '如何', '中心', '信用', '网贷', '人人', '分期', '注册', '好信', '手机', '钱包', '个人信用', '借呗', '平安', '捷信', '微粒贷', '借钱', '记录', '用钱', '可以', '花呗', '身份证', '拍拍', '现金', '微信', '还款', '问问', '产品', '51', '信而富', '什么', '黑名单', '360', '17', '黑户', '怎么办', '金融', '帮你贷', '消除', '密码', '账号', '怎样', '分期乐', '拒绝', '申请']


frequency = [1083, 393, 353, 167, 123, 119, 83, 64, 57, 46, 44, 40, 37, 31, 29, 29, 28, 26, 25, 23, 23, 22, 21, 19, 18, 18, 18, 18, 18, 17, 15, 15, 15, 15, 14, 14, 13, 13, 13, 13, 13, 13, 12, 12, 11, 11, 11, 11, 10, 10, 10, 10, 10, 10]

lower, upper = 15, 45
frequency = [((x - min(frequency)) / (max(frequency) - min(frequency))) * (upper - lower) + lower for x in frequency]


percent = [0.362086258776329, 0.13139418254764293, 0.11802072885322636, 0.055834169174189235, 0.041123370110330994, 0.03978602474088933, 0.02774991641591441, 0.02139752591106653, 0.01905717151454363, 0.015379471748579069, 0.01471079906385824, 0.013373453694416584, 0.012370444667335341, 0.010364426613172852, 0.009695753928452023, 0.009695753928452023, 0.009361417586091608, 0.008692744901370779, 0.008358408559010365, 0.0076897358742895345, 0.0076897358742895345, 0.00735539953192912, 0.007021063189568706, 0.006352390504847877, 0.006018054162487462, 0.006018054162487462, 0.006018054162487462, 0.006018054162487462, 0.006018054162487462, 0.0056837178201270475, 0.005015045135406218, 0.005015045135406218, 0.005015045135406218, 0.005015045135406218, 0.004680708793045804, 0.004680708793045804, 0.0043463724506853894, 0.0043463724506853894, 0.0043463724506853894, 0.0043463724506853894, 0.0043463724506853894, 0.0043463724506853894, 0.004012036108324975, 0.004012036108324975, 0.00367769976596456, 0.00367769976596456, 0.00367769976596456, 0.00367769976596456, 0.003343363423604146, 0.003343363423604146, 0.003343363423604146, 0.003343363423604146, 0.003343363423604146, 0.003343363423604146]

lenth = len(words)
colors = [py.colors.DEFAULT_PLOTLY_COLORS[random.randrange(1, 10)] for i in range(lenth)]

data = go.Scatter(
x=list(range(lenth)),
y=random.choices(range(lenth), k=lenth),
mode='text',
text=words,
hovertext=['{0}{1}{2}'.format(w, f, format(p, '.2%')) for w, f, p in zip(words, frequency, percent)],
hoverinfo='text',
textfont={'size': frequency, 'color': colors})
layout = go.Layout({'xaxis': {'showgrid': False, 'showticklabels': False, 'zeroline': False},
                    'yaxis': {'showgrid': False, 'showticklabels': False, 'zeroline': False}})

fig = go.Figure(data=[data], layout=layout)

py.offline.plot(fig)

Result:

1 Like

Since random.choices is only found in Python 3.6 and above, do you have any alternate solution for folks running Python 3.5 or below with random.choice (the ‘s’ is missing)?

random.shuffle should work:

import random
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
print('before:', x)
random.shuffle(x)
print('after: ', x)
before: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
after:  [2, 1, 8, 6, 5, 9, 4, 7, 10, 3]

So, based on your code, I wrote this function that plots a plotly worldcloud given an input text.

from wordcloud import WordCloud, STOPWORDS
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, iplot

def plotly_wordcloud(text):
    wc = WordCloud(stopwords = set(STOPWORDS),
                   max_words = 200,
                   max_font_size = 100)
    wc.generate(text)
    
    word_list=[]
    freq_list=[]
    fontsize_list=[]
    position_list=[]
    orientation_list=[]
    color_list=[]

    for (word, freq), fontsize, position, orientation, color in wc.layout_:
        word_list.append(word)
        freq_list.append(freq)
        fontsize_list.append(fontsize)
        position_list.append(position)
        orientation_list.append(orientation)
        color_list.append(color)
        
    # get the positions
    x=[]
    y=[]
    for i in position_list:
        x.append(i[0])
        y.append(i[1])
            
    # get the relative occurence frequencies
    new_freq_list = []
    for i in freq_list:
        new_freq_list.append(i*100)
    new_freq_list
    
    trace = go.Scatter(x=x, 
                       y=y, 
                       textfont = dict(size=new_freq_list,
                                       color=color_list),
                       hoverinfo='text',
                       hovertext=['{0}{1}'.format(w, f) for w, f in zip(word_list, freq_list)],
                       mode="text",  
                       text=word_list
                      )
    
    layout = go.Layout(
                       xaxis=dict(showgrid=False, 
                                  showticklabels=False,
                                  zeroline=False,
                                  automargin=True),
                       yaxis=dict(showgrid=False,
                                  showticklabels=False,
                                  zeroline=False,
                                  automargin=True)
                      )
    
    fig = go.Figure(data=[trace], layout=layout)
    
    return fig

text = "Wikipedia was launched on January 15, 2001, by Jimmy Wales and Larry Sanger.[10] Sanger coined its name,[11][12] as a portmanteau of wiki[notes 3] and 'encyclopedia'. Initially an English-language encyclopedia, versions in other languages were quickly developed. With 5,748,461 articles,[notes 4] the English Wikipedia is the largest of the more than 290 Wikipedia encyclopedias. Overall, Wikipedia comprises more than 40 million articles in 301 different languages[14] and by February 2014 it had reached 18 billion page views and nearly 500 million unique visitors per month.[15] In 2005, Nature published a peer review comparing 42 science articles from Encyclopædia Britannica and Wikipedia and found that Wikipedia's level of accuracy approached that of Britannica.[16] Time magazine stated that the open-door policy of allowing anyone to edit had made Wikipedia the biggest and possibly the best encyclopedia in the world and it was testament to the vision of Jimmy Wales.[17] Wikipedia has been criticized for exhibiting systemic bias, for presenting a mixture of 'truths, half truths, and some falsehoods',[18] and for being subject to manipulation and spin in controversial topics.[19] In 2017, Facebook announced that it would help readers detect fake news by suitable links to Wikipedia articles. YouTube announced a similar plan in 2018."

init_notebook_mode(connected=True)
iplot(plotly_wordcloud(text))

And this plots ok, but some parts of the figure gets cut off:
Untitled

I played around with the different layout parameters like autosize, automargin, `pad, etc like so:

layout = go.Layout(autosize=True,
                   xaxis=dict(showgrid=False, 
                              showticklabels=False,
                              zeroline=False,
                              automargin=True),
                   yaxis=dict(showgrid=False,
                              showticklabels=False,
                              zeroline=False,
                              automargin=True),
                   margin=go.layout.Margin(pad=1000),
                  )

But it doesn’t make any difference.

Also, as can be seen in the image above, there is word overlap. I tried y=random.shuffle(y) when defining the trace in go.Scatter, but that didn’t make any difference.

Any suggestions on how to fix these?

Short answer:
Don’t do word clouds :slight_smile: I think they are misleading, and not very useful. They show sizes without numbers. It’s also difficult to tell when some words are the same size, but have more letters. If “war” and “tremendous” had the same size, the longer word might look “bigger”.
I suggest you do a horizontal bar chart, which shows the most words, is natural to read, and you can add numbers so it’s clear which is bigger / smaller.

Longer answer:

  • Try running random.shuffle several times until you get one where things look good.
  • Since plotly gives an interactive chart users can zoom and pan, so it’s not a major issue.
  • I suggest you also remove the biggest words, because they cover too much space, and they are already known. An article about Wikipedia, will most likely have that word as the top word. It’s more interesting to know the second / third level words. This way, you will have a more evenly distributed set of words, easier to read, and less overlaps.
  • Having zero overlaps is very very complicated to implement because you have different word lengths, different letter shapes, and sizes. This solution is not perfect, but if you remove the biggest 2-3 words you should be able to get something that is 90% acceptable in most cases, with a few minor overlaps.

Good luck!

2 Likes

Hi kristada619,

I don’t know if you are interested but I’ve used a third party option for a wordcloud. You can generate the wordcloud using amuellers’s wordcloud. You can send the image of the wordcloud to a file and use dash the publish the image.

I’ve tried the suggested wordclouds in this post aswell, they didn’t work out for me.

See below link for amueller’s wordcloud.

3 Likes

I made a dash app which does this!

4 Likes

Yeah, I know how to display static images in plotly, but I don’t want my wordcloud to be a static image. I want to show an interactive wordcloud, where I would like to add features such as, say, hovering on a word shows the percentage of sentences in the document where the word appears, and upon clicking on a word displays the sentence(s) containing that word, etc.

Really awesome seeing all of the activity in this thread! Another set of solutions to explore would be to create your own Dash component. Dash components are frequently wrappers around existing React components or D3 graphs and it looks like there are some good components out there already: https://www.npmjs.com/package/react-d3-cloud, https://www.npmjs.com/package/react-wordcloud, https://github.com/jasondavies/d3-cloud.

We have many guides for creating components, see:

1 Like

Wanted to add this quick solution to generating wordclouds in Dash with the python library wordcloud. This code generates wordcloud from a dictionary The wordcloud object is converted to an image. The image is passed to Dash without having to save the image to disk.

from wordcloud import WordCloud
import base64
from io import BytesIO

di = {'abc':10, 'def': 20, 'ghi':2, 'jkl':55}
wc = WordCloud().generate_from_frequencies(frequencies=di)
wc_img = wc.to_image()
with BytesIO() as buffer:
    wc_img.save(buffer, 'png')
    img2 = base64.b64encode(buffer.getvalue()).decode()

app.layout = html.Div(children=[
                    html.Img(src="data:image/png;base64," + img2)
                ])           
2 Likes