How to reproduce 'content' Output of dcc.Upload?

I am not able to reproduce the exact output (‘content’) of the dcc.Upload component.

If I upload the file my_excel.xlsx to the dcc.Upload component, my callback receives a “base64 encoded string” (according to the dcc.Upload documentation). I don’t know how to reproduce the exact same string without the dcc.Upload component.

my current approach:

with open('tests/data/my_excel.xlsx', 'rb') as file:
    raw_data = file.read()
	
_, content_string = raw_data.split(',') # this Fails

I get the error TypeError: a bytes-like object is required, not 'str'

if I add

raw_data = base64.b64encode(raw_data)

before the split, I get the same error.

How do I get the exact same “base64 encoded string” without the dcc.Upload Component?

Thanks very much in advance

Solution:

import base64
import io
import pandas as pd
import magic

filepath = 'tests/data/my_excel.xlsx'

# Reproduce output of dcc.Upload Component
with open(filepath, "rb") as file:
    decoded = file.read()
content_bytes = base64.b64encode(decoded)
content_string = content_bytes.decode("utf-8")

mime = magic.Magic(mime=True)
mime_type = mime.from_file(filepath)
content_type = "".join(["data:", mime_type, ";base64"])

contents = "".join([content_type, ",", content_string])

# and now revert: convert contents to binary file stream
content_type, content_string = contents.split(",")
decoded = base64.b64decode(content_string)
df = pd.read_excel(io.BytesIO(decoded))

(based on this SO reply)

The reason for this error is that in Python 3, strings are Unicode, but when transmitting on the network, the data needs to be bytes instead. We can convert bytes to string using bytes class decode() instance method, So you need to decode the bytes object to produce a string. In Python 3 , the default encoding is “utf-8” , so you can use directly:

b"python byte to string".decode("utf-8")

Python makes a clear distinction between bytes and strings . Bytes objects contain raw data — a sequence of octets — whereas strings are Unicode sequences . Conversion between these two types is explicit: you encode a string to get bytes, specifying an encoding (which defaults to UTF-8); and you decode bytes to get a string. Clients of these functions should be aware that such conversions may fail, and should consider how failures are handled.