Figure Friday 2024 - week 38

Week 38 of Figure-Friday brings us data on H-1B visas in the US.

Every year, a random drawing determines which skilled foreigners get permission to work in the US. But according to Bloomberg, some companies don’t play by the rules. Read more at Eric Fan’s post or the full Story published by Bloomberg.

We used the FY2021 data set for the sample figure below. Check out BloombergGraphics’ GitHub Repository for additional years.

Sample figure:

Code for sample figure:
import pandas as pd
import plotly.express as px

# Download data - https://github.com/plotly/Figure-Friday/blob/main/2024/week-38/fy2021.zip
df = pd.read_csv("TRK_13139_FY2021.csv") 
top_5_fields = df['BEN_PFIELD_OF_STUDY'].value_counts().head(5).index
filtered_df = df[df['BEN_PFIELD_OF_STUDY'].isin(top_5_fields)]
gender_breakdown = filtered_df.groupby(['BEN_PFIELD_OF_STUDY', 'gender']).size().unstack(fill_value=0).reset_index()
gender_breakdown_melted = gender_breakdown.melt(id_vars='BEN_PFIELD_OF_STUDY', var_name='gender', value_name='Count')

fig = px.bar(gender_breakdown_melted,
             x='BEN_PFIELD_OF_STUDY',
             y='Count',
             color='gender',
             barmode='group',
             labels={'BEN_PFIELD_OF_STUDY': 'Field of Study'},
             title='H-1B Visa Selection - Gender Breakdown of Top 5 Fields of Study')

fig.show()

Things to consider:

  • can you improve the sample figure built?
  • would a different figure tell the data story better?
  • what other columns in the data set can you explore?

Participation Instructions:

  • Create - use the weekly data set to build your own Plotly visualization or Dash app. Or, enhance the sample figure provided in this post, using Plotly or Dash.
  • Submit - post your creation to LinkedIn or Twitter with the hashtags #FigureFriday and #plotly by midnight Thursday, your time zone. Please also submit your visualization as a new post in this thread.
  • Celebrate - join the Figure Friday sessions to showcase your creation and receive feedback from the community.

:point_right: If you prefer to collaborate with others on Discord, join the Plotly Discord channel .

Thank you to BloombergGraphics for the data.

I am totally convinced that in this case there are many ways in which the sample plot can be improved, but before posting any plot, I want to talk about some issues I found on the dataset;

  • the WAGE_AMNT and BEN_COMP_PAID contain missing values except for 47 records.
  • bcn column that could serve as an identifier (that could be obfuscated to avoid personal information to be extracted), contains only two values, therefore cannot be used to differentiate or group by beneficiary.
  • ben_multi_reg_ind can be used to differentiate between those beneficiaries that appear more that one time in an H-1B application, but cannot be used to get rid of duplications.
  • BEN_PFIELD_OF_STUDY that contains the beneficiary’s field of study, contains inconsistent entries, for example: COMPUTER SCIENCE, COMP SCI, CS, COMPUTER SC, etc

I believe the dataset needs to be cleaned first, before start doing any plotting. What do you think?

3 Likes

I think that’s a good idea, @hebertodelrio .

Do you already know which graph you’d like to use or were you planning to clean the data first and then figure out the plotting process?

1 Like

I just finish cleaning up the ‘BEN_PFIELD_OF_STUDY’ that contained 13,460 unique values, and I brought it down to 7,484, where the total number of records is 269,377. I believe I am going to explore the single applications, those whose ‘ben_multi_reg_ind’ is not 1.

I also used the dictionary in I-129 Dictionary of Occupational Titles (DOT) Codes, to clean up the ‘DOT_CODES’ columns from floating number (it should be a 3 digit code) to an occupation group.

I would have loved to transform the ‘NAICS_CODE’ column which contains the industry classification but unfortunately is a bit complicated.

My intention is to compare industry, with occupation group

4 Likes

Hi, what @hebertodelrio mentioned is true. I think that analyzing these datasets is complex, but I propose a couple of visualizations that would allow for a general overview of the employers, the volume of requests per year, and the origin of the beneficiaries.



Application code

3 Likes

I’ve kept within @adamschroeder 's overall theme of a gender breakdown of different fields, but given the extremely messy state of ‘BEN_PFIELD_OF_STUDY’ that @hebertodelrio also notes, I’ve used the ‘DOT_CODE’ occupation codes, which limits the analysis to successful applicants. Rather than use a top_n approach, I transformed ‘DOT_CODE’ to occupation categories using the Job Codes sheet that Bloomberg supplied, and collapsed all but the two largest of these job categories. I think tidying the data this way reveals an interesting trend.

As I couldn’t include hover templates in the PNG, I’ve included the counts for each group as data labels.

Figure Friday is a great idea - thanks for posing an interesting challenge!

3 Likes

Update Sept 28:
Added visualizations to show H1-b visa approvals in a location-approximate grid, inspired by contributions of @empet. One of these visualizations uses go.Scatter, other one uses go.Bar. This data is normalized to show H1-B visa approvals per 1 million residents of each state, with population data of the 2020 census from Kaggle.

These updates are added to the previous code that produced choropleth maps with hover data of states and years.

Here 2 screenshots with the new visualizations:

image

image

These screenshots were included in my original post for week 38:

Here is the code:

import polars as pl
import polars.selectors as cs
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import numpy as np
import us

#------------------------------------------------------------------------------#
#     with us library, make dataframe of state abbreviations and names         #
#------------------------------------------------------------------------------#
df_state_names = (
    pl.concat(  # use concat to add a row for Washington DC
        [
            pl.DataFrame(us.states.mapping('abbr', 'name'))
            .transpose(include_header=True)
            .rename({'column': 'STATE_ABBR', 'column_0': 'STATE'})
            ,
            pl.DataFrame(
                {
                    'STATE_ABBR' : 'DC',
                    'STATE' : 'Washington DC'
                }
            )
        ]
    )
    .sort('STATE_ABBR')
)

#------------------------------------------------------------------------------#
#     use kaggle data for state population, join with state names              #
#------------------------------------------------------------------------------#
df_state_population = (
    pl.read_csv('kaggle_us_pop_by_state.csv')
    .rename({'state_code':'STATE_ABBR', '2020_census':'POPULATION'})
    .drop_nulls('rank')
    .select(pl.col('STATE_ABBR', 'POPULATION'))
    .sort('STATE_ABBR')
)
df_state_names = (
    df_state_names
    .join(
        df_state_population,
        on='STATE_ABBR',
        how='left'
    )
)

#------------------------------------------------------------------------------#
#     Row and Col #s used with plotly make_subplots. The long rows in          #
# #     this dataframe definition align with columns in the subplots of states #
#------------------------------------------------------------------------------#
df_state_xy = (
    pl.DataFrame(
        {
            'INFO'  : [
                ['AK', 1, 1], ['WA', 3, 1],   ['OR', 4, 1],  ['CA', 5, 1], ['HI', 8, 1] ,
                ['ID', 3, 2],  ['NV', 4, 2],  ['UT', 5, 2],  ['AZ', 6, 2],
                ['MT', 3, 3],  ['WY', 4, 3],  ['CO', 5, 3],  ['NM', 6, 3],
                ['ND', 3, 4],  ['SD', 4, 4],  ['NE', 5, 4],  ['KS', 6, 4], ['OK', 7, 4], ['TX', 8, 4],
                ['MN', 3, 5],  ['IA', 4, 5],  ['MO', 5, 5],  ['AR', 6, 5], ['LA', 7, 5], 
                ['WI', 2, 6],  ['IL', 3, 6],  ['IN', 4, 6],  ['KY', 5, 6], ['TN', 6, 6], ['MS', 7, 6],
                ['MI', 3, 7],  ['OH', 4, 7],  ['WV', 5, 7],  ['SC', 6, 7], ['AL', 7, 7],
                ['PA', 4, 8],  ['VA', 5, 8],  ['NC', 6, 8],  ['GA', 7, 8],  
                ['NY', 3, 9],  ['NJ', 4, 9],  ['MD', 5, 9],  ['DC', 6, 9], ['FL', 8, 9],
                ['VT', 2, 10], ['MA', 3, 10], ['CT', 4, 10], ['DE', 5, 10],
                ['ME', 1, 11], ['NH', 2, 11], ['RI', 4, 11]
            ],
        },
        strict=False
    )
    # next 3 lines unpack the list column, followed by list column deletion
    .with_columns(STATE_ABBR = pl.col('INFO').list.get(0))
    .with_columns(ROW = pl.col('INFO').list.get(1).cast(pl.UInt8))
    .with_columns(COL = pl.col('INFO').list.get(2).cast(pl.UInt8))
    .drop('INFO')
)

#------------------------------------------------------------------------------#
#     load data provided for this exercise into polars lazy frames             #
#------------------------------------------------------------------------------#
df_2021 = (
    pl.scan_csv('./Data_Set/TRK_13139_FY2021.csv',ignore_errors=True)
    .filter(pl.col('FIRST_DECISION').str.to_uppercase() ==  'APPROVED')
    .with_columns(YEAR = pl.lit('2021'))
    .with_columns(pl.col('WAGE_AMT').cast(pl.Int64))
    .with_columns(pl.col('BEN_COMP_PAID').cast(pl.Int64))     
)

df_2022 = (
    pl.scan_csv('./Data_Set/TRK_13139_FY2022.csv',ignore_errors=True)
    .filter(pl.col('FIRST_DECISION').str.to_uppercase() ==  'APPROVED')
    .with_columns(YEAR = pl.lit('2022'))
    .with_columns(pl.col('WAGE_AMT').cast(pl.Int64))
    .with_columns(pl.col('BEN_COMP_PAID').cast(pl.Int64))    
)

df_2023 = (
    pl.scan_csv('./Data_Set/TRK_13139_FY2023.csv',ignore_errors=True)
    .filter(pl.col('FIRST_DECISION').str.to_uppercase() ==  'APPROVED')
    .with_columns(YEAR = pl.lit('2023'))
    .with_columns(pl.col('WAGE_AMT').cast(pl.Int64))
    .with_columns(pl.col('BEN_COMP_PAID').cast(pl.Int64))    
)

df_2024_multi = (
    pl.scan_csv('./Data_Set/TRK_13139_FY2024_multi_reg.csv',ignore_errors=True)
    .filter(pl.col('FIRST_DECISION').str.to_uppercase() ==  'APPROVED')
    .with_columns(YEAR = pl.lit('2024_MULTI'))
    .with_columns(pl.col('WAGE_AMT').cast(pl.Int64))
    .with_columns(pl.col('BEN_COMP_PAID').cast(pl.Int64))    
)

df_2024_single = (
    pl.scan_csv('./Data_Set/TRK_13139_FY2024_single_reg.csv',ignore_errors=True)
    .filter(pl.col('FIRST_DECISION').str.to_uppercase() ==  'APPROVED')
    .with_columns(YEAR = pl.lit('2024_SINGLE'))
    .with_columns(pl.col('WAGE_AMT').cast(pl.Int64))
    .with_columns(pl.col('BEN_COMP_PAID').cast(pl.Int64))    
)

#------------------------------------------------------------------------------#
#     convert lazy frames to dataframes within the concat block                #
#------------------------------------------------------------------------------#
df_all = (
    pl.concat(
        [
            df_2021.collect(),
            df_2022.collect(),
            df_2023.collect(),
            df_2024_multi.collect(),
            df_2024_single.collect(),

        ],
    )
)

#------------------------------------------------------------------------------#
#     Group by state, calc total of each year, get state name from abbr        #
#------------------------------------------------------------------------------#
df_by_state = (
    df_all
    .rename({'state': 'STATE_ABBR'})
    .group_by('STATE_ABBR', 'YEAR')
    .agg(pl.len())
    .pivot(on='YEAR', index='STATE_ABBR')
    .with_columns(TOT_2024 = (pl.col('2024_SINGLE') + pl.col('2024_MULTI')))
    .drop('2024_SINGLE', '2024_MULTI')
    .rename({'TOT_2024':'2024'})
    .with_columns(TOTAL = pl.sum_horizontal(cs.integer()))
    .fill_null(strategy="zero")
    
    .join(
        df_state_names,
        on='STATE_ABBR',
        how='left'
    )
    .select(pl.col('STATE', 'STATE_ABBR', '2021', '2022', '2023', '2024', 'TOTAL' ))
    .sort('STATE')
)

#------------------------------------------------------------------------------#
#     Assemble hover data                                                      #
#------------------------------------------------------------------------------#
customdata=np.stack(
    (
        df_by_state['STATE'],   #  customdata[0]
        df_by_state['2021'],    #  customdata[1]
        df_by_state['2022'],    #  customdata[2]
        df_by_state['2023'],    #  customdata[3]
        df_by_state['2024'],    #  customdata[4]
        df_by_state['TOTAL'],   #  customdata[5]
        ), 
        axis=-1
    )

#------------------------------------------------------------------------------#
#     make each state show immigration data. animation frame is commented out  #
#------------------------------------------------------------------------------#

fig = px.choropleth(
    df_by_state,
    geojson="https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json",
    locationmode='USA-states',
    locations='STATE_ABBR',
    color='TOTAL',
    scope="usa",
    custom_data=['STATE', '2021', '2022', '2023', '2024', 'TOTAL',],
)
#------------------------------------------------------------------------------#
#     Update trace with hovertemplate                                          #
#------------------------------------------------------------------------------#
fig.update_traces(
    hovertemplate =
        '%{customdata[0]}<br>' + 
        '2021:   %{customdata[1]:,}<br>' +
        '2022:   %{customdata[2]:,}<br>' + 
        '2023:   %{customdata[3]:,}<br>' +
        '2024:   %{customdata[4]:,}<br>' + 
        'TOTAL:   %{customdata[5]:,}<br>'
        '<extra></extra>'
)
fig.update_layout(
    margin={"r":1, "t":1, "l":1, "b":1},
    showlegend=False
    )

fig.write_html(f'Immigration_Map_by_State_Map.html')
fig.show()

#------------------------------------------------------------------------------#
#     Prep for subplot display, with px.Scatter of each state                  #
#------------------------------------------------------------------------------#
df_by_state = (
    df_all
    .rename({'state': 'STATE_ABBR'})
    .group_by('STATE_ABBR', 'YEAR')
    .agg(pl.len())
    .pivot(on='YEAR', index='STATE_ABBR')
    .with_columns(TOT_2024 = (pl.col('2024_SINGLE') + pl.col('2024_MULTI')))
    .drop('2024_SINGLE', '2024_MULTI')
    .rename({'TOT_2024':'2024'})
    .fill_null(strategy="zero")
    .join(
        df_state_names,
        on='STATE_ABBR',
        how='left'
    )
    # calculate totals per 1 Million residents of each state
    .with_columns(PER_M_2021 = 1e6*  pl.col('2021')/pl.col('POPULATION'))
    .with_columns(PER_M_2022 = 1e6 * pl.col('2022')/pl.col('POPULATION'))
    .with_columns(PER_M_2023 = 1e6 * pl.col('2023')/pl.col('POPULATION'))
    .with_columns(PER_M_2024 = 1e6 * pl.col('2024')/pl.col('POPULATION'))

    .select(pl.col('STATE', 'STATE_ABBR', 'PER_M_2021', 'PER_M_2022', 'PER_M_2023', 'PER_M_2024'))
    .rename({'PER_M_2021' : '2021', 'PER_M_2022': '2022', 'PER_M_2023': '2023', 'PER_M_2024': '2024' })
    .sort('STATE')
    .drop('STATE')
    .transpose(include_header=True,column_names='STATE_ABBR',header_name='YEAR')
    .with_columns(pl.col('YEAR').cast(pl.UInt16))
)

#------------------------------------------------------------------------------#
#     Make subplots by state using go.Scatter                                      #
#------------------------------------------------------------------------------#
fig = make_subplots(rows=8, cols=11)
for state in df_state_xy['STATE_ABBR']:
    if True:  # state not in ['DC']:
        my_row = df_state_xy.filter(pl.col('STATE_ABBR') == state)['ROW'][0]
        my_col = df_state_xy.filter(pl.col('STATE_ABBR') == state)['COL'][0]
        fig.append_trace(go.Scatter(x=df_by_state['YEAR'], y=df_by_state[state]), row=my_row, col=my_col)
        fig.update_xaxes(showgrid=False,row=my_row, col=my_col)
        fig.update_yaxes(range=[-100,1200],row=my_row, col=my_col)
        fig.add_annotation(
            xref='x domain',
            yref='y domain', 
            showarrow=False, 
            x=0.5, 
            y=1.2, 
            text='<b>' + state + '</b>', 
            row=my_row, 
            col=my_col,
            )
        if state in ['AK','AZ', 'WI', 'ME', 'VT', 'WA', 'OR', 'CA', 'OK', 'TX', 'HI',  'FL', 'NY']:
            fig.update_yaxes(showticklabels=True, row=my_row, col=my_col,)
        else:
            fig.update_yaxes(showticklabels=False, row=my_row, col=my_col,)

my_title = 'Approved H1-B Visas per Million Residents, 2021 to 2024'
my_title += '<br><sup>USCIS data from Bloomberg, 2020 Census population data from Kaggle</sup>'
fig.update_layout(
    title = my_title, 
    template='plotly_white',
    showlegend=False, 
    height=700, 
    width=1000
    )
fig.update_xaxes(showticklabels=False)
fig.show()

#------------------------------------------------------------------------------#
#     Make subplots by state using go.Bar                                      #
#------------------------------------------------------------------------------#
fig = make_subplots(rows=8, cols=11)
for state in df_state_xy['STATE_ABBR']:
    if True:  # state not in ['DC']:
        my_row = df_state_xy.filter(pl.col('STATE_ABBR') == state)['ROW'][0]
        my_col = df_state_xy.filter(pl.col('STATE_ABBR') == state)['COL'][0]
        fig.append_trace(go.Bar(x=df_by_state['YEAR'], y=df_by_state[state]), row=my_row, col=my_col)
        fig.update_xaxes(showgrid=False,row=my_row, col=my_col)
        fig.update_yaxes(range=[-100,1200],row=my_row, col=my_col)
        fig.add_annotation(
            xref='x domain',
            yref='y domain', 
            showarrow=False, 
            x=0.5, 
            y=1.2, 
            text='<b>' + state + '</b>', 
            row=my_row, 
            col=my_col,
            )
        if state in ['AK','AZ', 'WI', 'ME', 'VT', 'WA', 'OR', 'CA', 'OK', 'TX', 'HI',  'FL', 'NY']:
            fig.update_yaxes(showticklabels=True, row=my_row, col=my_col,)
        else:
            fig.update_yaxes(showticklabels=False, row=my_row, col=my_col,)

fig.update_layout(
    title = my_title, 
    template='plotly_white',
    showlegend=False, 
    height=700, 
    width=1000
    )
fig.update_xaxes(showticklabels=False)
fig.show()
4 Likes

awesome visualizations, @U-Danny . I like those arrows at the top. Never seen that type before.
Is this the respective code: className="fa-solid fa-circle-arrow-left fa-2xl text-dark" ?

1 Like

Nice one, @Mike_Purtell
I know you said you plan to work on the state colors. One interesting idea could be to have the states’ colors represent the proportional (in percentage) amount of approved visa applications. That is, the percentage would represent the number of approved records divided by the states population or working force.

1 Like

Hi @adamschroeder , yes they are font awesome icons. The design is simple and allows you to add independent visualizations, but maintaining the concept of “stories with data” which I find interesting for an overview of the data.

3 Likes

Hi :grinning:
I’m glad to join the Figure Friday challenge!
My graph is based on columns ’ i129_employer_name’ and ‘gender’.
The number of records to build the graph is 99,984 which corresponds to the total number of the receipts (column ‘RECEIPT_NUMBER’).

newplot (7)

code on GitHub

4 Likes

I agree with @adamschroeder - I really like the clean design of this and the toggles on the top :star2:

For the scatter chart - it looks great, but I found it a bit tricky to understand/analyze, especially if the image is static. Most of the bubbles are small and crowded below 1,000 registrations, which makes it hard to read. Given that the employers are only visible on hover, this visual might not be suitable if you also want to use it as a static image. So depending on what you want to highlight, maybe a bar chart could make things clearer by showing the top N employers based on registration numbers :slight_smile: You could also show the bottom N employers the same way, if that’s important. This might make the data easier to understand and really highlight the key info.

But really amazing job! :rocket:

2 Likes

Looks like @adamschroeder wasn’t just aiming to challenge our data viz skills this week, but also our data wrangling abilities :laughing:. Great job spotting all these issues and teaching us about the potential pitfalls if we don’t address them!

2 Likes

Hey @_dave - I really like your idea of aggregating the data and grouping the rest under “Other occupations.” It does a great job of highlighting the top 2 categories if that’s the story you want to tell. I also appreciate your consideration of the downsides of static images.

However, I was wondering if it might be clearer to place the totals for male/female separately next to the chart, or to use percentage labels instead of absolute numbers. Readers typically scan from left to right and top to bottom, so I initially saw the labels and wondered why computer-related occupations appeared in similar length as architecture, even though the total numbers are quite different (46,533 vs. 8,895). It was only after noticing the x-axis scale that I understood the chart was using percentages, with absolute numbers provided as additional info. To avoid this confusion, displaying percentages as labels might be more effective. If you still want to show the absolute numbers, you could include them subtly next to the percentages :slight_smile:

By the way, if you ever want to share your Plotly chart with all its interactive features, pyCafe could be a great option. I have an example here: PyCafe - Solara - Interactive Iris Plot

Code: PyCafe - Solara - Interactive Iris Plot

While they don’t natively support single Plotly charts yet, you can make it work by embedding it in a solara component. It works like a charm!

2 Likes

These look fantastic, @natatsypora! :rocket: I love how you focused on gender distribution and kept your color choices consistent. The bar chart is especially intriguing—I’ve never seen this kind of implementation in Plotly before. It makes it easy to compare the shares for females and males since both start at the baseline of 0, which is usually not the case for a typical stacked bar chart.

I have one optional suggestion to enhance your narrative: you could add a reference line, such as the average female share (33%), to your bar charts if you want to further emphasize the gender story. This would allow for a comparison between individual employers and the average. However, this is just a suggestion. Your chart is already tidy and easy to understand, and adding a reference line might introduce unnecessary clutter depending on the story you want to tell :slight_smile:

3 Likes

Hi, thanks for your observations. With a better revision I have changed the information shown as well as the graph



1 Like

Thanks @li.nguyen - that’s great feedback! To be honest, the data labels were a quick afterthought when I realised I couldn’t embed an html here. I agree they should be elsewhere. The confusion they create actually reflects a bigger underlying issue though that I couldn’t solve with the visualisation - of how to communicate absolute numbers and proportions effectively in the one visualisation. The message I saw in the data was the skew in gender proportions was driven by the two largest occupation groups - to show that really well feels like it needs both proportions and counts. Thanks also for the suggestion of PyCafe - I’ll check it out!

2 Likes

This looks fantastic and neat! :star2: You could consider replacing the line chart with a stacked area chart to differentiate between selected and non-selected. This might make it clearer that selected plus non-selected equals the total beneficiaries. :blush:

I truly love the aesthetics of your chart though. It’s excellent for storytelling through data visualization and reminds me of Tableau’s story feature. I haven’t seen this done in Plotly before, so it’s really nice! I’ve bookmarked your post :bookmark:

1 Like

Nice work @_dave . In MS-Windows, you can make a screen shot of your active window with hover data visible with Alt+Print Screen to capture the active window, followed by CTRL+V to paste it somewhere else. I usually paste these into MS-Word to trim the edges before sharing with this community. Limitation is you can only show 1 set of hover data at a time.

I am a big fan of customized hover info. It makes a big difference when sharing plotly-generated html files with colleagues.

1 Like

Thanks @Mike_Purtell for the handy tips! I’ll give that a try next time. And yes, you can do a lot with a good hover! Nice work also on your visualisations!