dcc.Upload to handle a Word docx file

Hello all, I have a number of Word files, transcripts from MS Teams. I have a script that parses them into a pandas df and breaks it down by speaker, text, and time. I would like to use dcc.Upload to upload, parse, store, and extract some insight with an NLP pipeline from these transcripts. The example provided by Plotly, naturally, deals with the csv and xlsx formats. Any ideas on how I would approach this? Any help is much appreciated!

def get_data_from_word(path_to_file):
    from docx import Document
    # Creating a word file object
    doc_object = open(path_to_file, "rb")

    # creating word reader object
    doc_reader = Document(doc_object)
    data = ""

    for p in doc_reader.paragraphs:
        data += p.text + "\n"

    return data


def get_csv(paragraphs):
    combined_paragraphs = []
    speaker_text = []

    for x in range(len(paragraphs)):
        try:
            speaker = paragraphs[x][1]
            next_speaker = paragraphs[x + 1][1]

            if speaker == next_speaker:
                speaker_text.append(paragraphs[x][2])
            # extract sentences
            else:
                speaker_text.append(paragraphs[x][2])
                text = ''.join(speaker_text)
                combined_paragraphs.append([speaker, text])
                speaker_text = []
        except:
            pass

HI @vnavdulov , where exactly do you need help?

1 Like

Hello @AIMPED , I ended up figuring it out. Thank you! If anyone want’s to import MS Teams transcript into a plotly dash app and do some analysis on it, here is what I got. Parses the word file into time, speaker, text columns. I’m working on a function that will combine text per speaker, since Teams supplies multiple lines for a speaker before the next one speaks (for whatever reason)

        elif 'docx' in filename:
             #get the text from the MS Word file
            import docx
            file = io.BytesIO(decoded)
            doc = docx.Document(file)
            #parse the MS Teams transcript
            text = ""
            for para in doc.paragraphs:
                text += para.text+"\n"
            try:
                file_data = text.split('\n')
                paragraphs = []
                s = 0
                f = 3
                for x in range(len(file_data) // 3):
                    paragraphs.append(file_data[s:f])
                    s += 3
                    f += 3
                import pandas as pd
                df = pd.DataFrame(paragraphs)
                df.columns = ['time', 'speaker', 'text']

            except Exception as e:
                print(e)
                return html.Div([
                    'There was an error parsing this file.'
                ])

lesson learned, start with documentation and go to ChatGPT

1 Like