dcc.Upload PDF file

Hi there! I’m strugglin with the follow situation:

I have a dcc.Upload that pretend to receive a PDF file in order to apply a parse process with the camelot python module to extract the tables in PDF file an then, convert these in a pandas dataframes to create graphs…

I get this fuction that figures into the dash layout:


def GetInput3():

    return html.Div([

        dcc.Upload(

                    id='upload-data3',

                    children=html.Div(id='drag_drop', children=[

                        'Arrastra y suelta o ',

                        html.A(

                            'selecciona el buró de crédito en formato PDF')

                    ], style={'color': 'white'}),

                    multiple=True

                    ), ])

The result of drag and drop is pretending to obtain some table extracted by a the camelot module, so i defined a function that try to get the main table and transform it to a dataframe (i guess here is the problem, mainly i the way to decode the pdf, this is a process that works fro my in other process wich incluides a excel or csv file, not pdf file)

def parse_contents3(contents, filename, date):
    content_type, content_string = contents.split(',')

    decoded = base64.b64decode(content_string)
    try:
        if 'pdf' in filename:
            # Assume that the user uploaded a PDF file
            try:
                tables = camelot.read_pdf(io.StringIO(
                    decoded.decode('utf-8')), pages='all')
            except Exception as e:
                print(e)
                return html.Div([
                    'There was an error processing this file.'])
        else:
            return html.Div([
                'Try to set a PDF file.'
            ])

    except Exception as e:
        print(e)
        return html.Div([
            'There was an error processing this file.'
        ])

The last function is called by the callback that store the main df, which goint to be the root of a several graphs in the layout

@app.callback(Output('c-store3', 'data'),
              [Input('upload-data3', 'contents')],
              [State('upload-data3', 'filename'),
               State('upload-data3', 'last_modified')])
def update_output(list_of_contents, list_of_names, list_of_dates):
    if list_of_contents is not None:
        children = [
            parse_contents3(c, n, d) for c, n, d in
            zip(list_of_contents, list_of_names, list_of_dates)]

        return children

Where do you find the issue?
Thanks in advance… i repeat that this process works for my processing a csv or xlsx file, but didnt’t work with a pdf file.

1 Like
1 Like

Hi @vnavdulov

This is a great article! Lots of useful examples, including how to get metadata from a PDF. Nice summary at the end too:

Conclusion

Today we looked at one direction a data scientist can take to upload a PDF to a Plotly dashboard and display certain content to a user. This can be super useful for a person who has to pull specific information from PDFs (ie. maybe customer email, name, and phone number) and needs to analyze that information. While this dashboard is basic, it can be tweaked for different styles of PDFs. For example, maybe you want to create a parsing dashboard that displays the citations in a research paper PDF you are reading so you can either save or use those citations to find other papers. Try to this code out today and if you add any cool functions, let me know!

1 Like