dcc.Upload to handle a Word docx file

vnavdulov · February 2, 2023, 4:58am

Hello all, I have a number of Word files, transcripts from MS Teams. I have a script that parses them into a pandas df and breaks it down by speaker, text, and time. I would like to use dcc.Upload to upload, parse, store, and extract some insight with an NLP pipeline from these transcripts. The example provided by Plotly, naturally, deals with the csv and xlsx formats. Any ideas on how I would approach this? Any help is much appreciated!

def get_data_from_word(path_to_file):
    from docx import Document
    # Creating a word file object
    doc_object = open(path_to_file, "rb")

    # creating word reader object
    doc_reader = Document(doc_object)
    data = ""

    for p in doc_reader.paragraphs:
        data += p.text + "\n"

    return data


def get_csv(paragraphs):
    combined_paragraphs = []
    speaker_text = []

    for x in range(len(paragraphs)):
        try:
            speaker = paragraphs[x][1]
            next_speaker = paragraphs[x + 1][1]

            if speaker == next_speaker:
                speaker_text.append(paragraphs[x][2])
            # extract sentences
            else:
                speaker_text.append(paragraphs[x][2])
                text = ''.join(speaker_text)
                combined_paragraphs.append([speaker, text])
                speaker_text = []
        except:
            pass

AIMPED · February 2, 2023, 8:42pm

HI @vnavdulov , where exactly do you need help?

vnavdulov · February 2, 2023, 9:04pm

Hello @AIMPED , I ended up figuring it out. Thank you! If anyone want’s to import MS Teams transcript into a plotly dash app and do some analysis on it, here is what I got. Parses the word file into time, speaker, text columns. I’m working on a function that will combine text per speaker, since Teams supplies multiple lines for a speaker before the next one speaks (for whatever reason)

        elif 'docx' in filename:
             #get the text from the MS Word file
            import docx
            file = io.BytesIO(decoded)
            doc = docx.Document(file)
            #parse the MS Teams transcript
            text = ""
            for para in doc.paragraphs:
                text += para.text+"\n"
            try:
                file_data = text.split('\n')
                paragraphs = []
                s = 0
                f = 3
                for x in range(len(file_data) // 3):
                    paragraphs.append(file_data[s:f])
                    s += 3
                    f += 3
                import pandas as pd
                df = pd.DataFrame(paragraphs)
                df.columns = ['time', 'speaker', 'text']

            except Exception as e:
                print(e)
                return html.Div([
                    'There was an error parsing this file.'
                ])

lesson learned, start with documentation and go to ChatGPT

Topic		Replies	Views
Using Dash upload component to upload txt file and generating plots Dash Python	1	2274	January 12, 2021
Upload xlsx and use script for analysis Dash Python	0	244	November 30, 2020
Issue with upload component and .txt file Dash Python	1	437	October 16, 2020
Show & Tell: Parse data from dcc.Upload without pandas (multiple headers) Dash Python	1	703	August 7, 2020
Synchronizing Upload to a Store component Dash Python	3	412	May 11, 2022

dcc.Upload to handle a Word docx file

Related topics