Dash Upload Component - Decoding Large Files

Currently it seems like the upload object automatically encodes and copies the data that is uploaded which makes it impossible to process large files. Is there any way around this?

1 Like

There isn’t anyway around this. What about encoding the files makes parsing them impossible? Can you share a small, reproducable example?

Sure, the following example works fine (newest firefox on Ubuntu 17.10) with files around ~150MB but any larger will just hang. Occasionally I can get the tab to crash but mostly the upload will progress (evidenced by watching the CPU/memory usage) a bit then stop with no indication of what happened.

import dash_core_components as dcc
import dash_html_components as html

app = dash.Dash()
app.css.config.serve_locally = True
app.scripts.config.serve_locally = True

app.layout = html.Div([
    dcc.Upload(
        id='upload-data',
        children=html.Div([
            'Drag and Drop or ',
            html.A('Select Files')
        ]),
        style={
            'width': '100%',
            'height': '60px',
            'lineHeight': '60px',
            'borderWidth': '1px',
            'borderStyle': 'dashed',
            'borderRadius': '5px',
            'textAlign': 'center',
            'margin': '10px'
        },
        max_size=-1,
        # Allow multiple files to be uploaded
        multiple=False
    ),
    html.Div(id='output-data-upload'),
])

@app.callback(dash.dependencies.Output('output-data-upload', 'children'),
             [dash.dependencies.Input('upload-data', 'contents'),
              dash.dependencies.Input('upload-data', 'filename')])
def update_output(contents, filename):
    if contents is not None:
        # do something with contents
        children='processed data from file ' + filename
        return children
    else:
        return 'no contents'


if __name__ == '__main__':
    app.run_server(debug=True)

There’s an issue about this on the Dropzone issue tracker (the React component that the Dash Upload component uses). The suggestion there is to increase the timeout however according to the changelog this option was only added in 4.4 and Dash currently uses 4.2.3.

The issue also mentions a chunked upload option which sounds like it would solve this problem. Unfortunately this is an even newer feature, only just in the latest version of Dropzone.

So it sounds like upgrading to the latest version of Dropzone is going to be the way to go. I’ve created an issue here.

Edit: so much for fixing things with a simple upgrade. I had forgotten that everything is going through Dash’s reactive AP.

1 Like

Yeah, this is a limitation in Dash right now. In Dash, all of the data (the store) is stored in the web browser. This data is accessed whenever any of the input elements change and then sent to the server.

In this case, ~150MB ends up being a lot for the web browser to handle. It’s hard to say why the browser is crashing / freezing. It could be one of many things:
1 - The process of “uploading” the data and converting it into memory on the client is CPU and causes things to freeze
2 - The process of making a 150MB HTTP request is perhaps CPU intensive and causes things to freeze
3 - Converting the file into memory on the client (the browser) takes up a lot more memory than 150MB and causes the machine to run out of memory
4 - Converting the file to memory on the Dash server (in a dev environment the client and server will be on the same machine) causes the machine to run out of memory

We could solve 2 and 4 by supporting streamed requests. In flask, this would look like https://blog.pelicandd.com/article/80. This would require some updates in dash-renderer, which is dash’s JS front-end.

If the issue is 3, then we need to work on memory management in Dash’s front-end. I’m sure there is lots of low hanging fruit here. However, due to Dash’s architecture, we need to keep the file contents around in memory in the client as they need to be accessible whenever a Dash callback might need them. This is unlike, say, Google Drive where uploading a file just passes it through the browser from your computer to their servers - it doesn’t persist on your browser.

So, this brings us to a question about what the underlying use case is. Perhaps “uploading a very large file” is slightly outside of the reactive dash paradigm. The Dash upload component could simply stream the data to the Dash Flask server (without keeping it in memory) and the Dash developer could refer to that data on the disk by its filename and the user’s session if they needed to access it.

In psuedocode:

def serve_layout():
    session_id = rand_id()
    html.Div([ 
       html.Div(session_id, id='session-id', style={'display': 'none'}),
       dcc.Dropdown(id='dropdown'),
       dcc.Upload(endpoint='/streaming-upload', session=session_id)
    ])

# save the file, bypass Dash
@flask.route("", methods=["POST"])
def save_streaming_upload():
    user_session = request.params['session']
    with open(user_session, "bw") as f:
        chunk_size = 4096
        while True:
            chunk = flask.request.stream.read(chunk_size)
            if len(chunk) == 0:
                return

            f.write(chunk)

@app.callback(Output('...'), [Input('dropdown', 'value'), Input('session-id', 'children')])
def filter_data(value, session):
    df = pd.read_csv(session)
    ...   

In this case, we’re identifying the user’s particular upload by setting some type of session ID, perhaps as part of the dcc.Upload component. The dcc.Upload component would be responsible for making a streaming request to the endpoint parameter.

Perhaps something like this would work.

In any case, I would be curious to hear more about your use case @sophotrope. What would you like to do with these huge files? Are you just saving them? Or are you creating dynamic UIs based off of them? How do you expect your app work with multiple, concurrent users?

Thanks for all the attention here. It’s very helpful.

My use case: I’m doing something which may not be ideal for dash… I’m making a small application which ingests a dataset from the user and then performs some fancy machine learning tasks to analyze user’s submitted data. The program populates a number of interactive plots. These plots also have parameters that can change and require updating. There is a data flow hierarchy here so the reactivity of the whole thing is a nice paradigm. And because my computational backend is python it seemed natural to try to use dash.

In the future this program would likely be useful if deployed on a server on the cloud. In that case the user would be uploading his data to some persistent filesystem. However, I do not foresee a usecase in which several users are being served the same content concurrently. Even on the cloud it would be an istance-per-user type application.

For an MVP all I would really want is the ability for the user to go through a kind of ‘file selection’ button in order to search their local filesystem for a datafile and for that file (address or handle) be passed off to the backend for processing – not copied and encoded on the server. I’m not very knowledgeable about this stuff, but I think I’ve read that HTML5 exposes this kind of capability in the file API, but I think it hasn’t been translated through to the dash API. Correct me if I’m wrong. Thanks in advance.

1 Like

Note that if the user is also running the app, then you can just display a dropdown with a list of the user’s files that they could select. You could just populate this dropdown’s options by listing files using glob.glob or something like that. psuedocode:

app.layout = html.Div([
     dcc.Dropdown(id='file-list', options=[{'label': i, 'value': i} for i in glob.glob('*.csv')])
])

This won’t work in a client-server pattern (where the app is running on a different machine than the user who is viewing it) but it would work fine for development.

1 Like

To make @chriddyp’s idea for an MVP support a client server setup, you could also embed an iframe within your dash app that points to a page with a super simple vanilla html upload form, thereby escaping the Dash app just for the upload. Then you’d need a ‘refresh’ button or some such which triggers a callback that updates the dropdown with the values of the filesystem glob now that the file has been uploaded.

Although now you’re just bypassing the Upload component entirely, so you obviously lose all the nice features it supports.

See Show And Tell -- Dash Resumable Upload for a new approach to uploading large files :tada:

Did you get any success in building the intended app @sophotrope? I am trying to build something similar, and it would be great to know if your application was successfully built? If yes, then could you share all key learning from your project?

Thanks in advance for all the help!

AP

The idea of using flask instead of dash works for me. I have tested for 1GB file. Here is my code:

from os.path import join
from flask import Flask, flash, request
from werkzeug.utils import secure_filename
import dash
import dash_html_components as html
import os
cwd = os.getcwd()
UPLOAD_FOLDER = cwd + '\\www'
external_stylesheets = ['https://codepen.io/chriddyp/pen/bWLwgP.css']
app = dash.Dash(__name__, external_stylesheets=external_stylesheets)
app.config['UPLOAD_FOLDER'] = UPLOAD_FOLDER

app.layout = html.Div( 
        children=[
            html.Iframe(id='iframe-upload',src=f'/upload'),
            html.Div(id='output')
                ]
)

@app.server.route('/upload', methods=['GET', 'POST'])
def upload_file():
    if request.method == 'POST':
        file = request.files['file']
        filename = secure_filename(file.filename)
        file.save(os.path.join(app.config['UPLOAD_FOLDER'], filename))

    return '''
    <form method=post enctype=multipart/form-data>
      <input type=file name=file>
      <input type=submit value=Upload>
    </form>
    '''
if __name__ == '__main__':
   app.run_server(debug=True)

A new issue for this solution is how to make the flask part interactive with Dash components. For example, how to show the name of the uploaded file in ‘output’ block. I can do this by monitoring the filenames in the upload folder, but there should be some more straightforward way.

2 Likes

Hello,

I’m currently writing my masters’ thesis in the field of Topological Data Analysis in combination with the classification of observations. Comprised is the implementation of the interactive analytics process as a dashboard web application.

The user should be allowed to upload a csv file containing the required data. As the whole process is implemented in Python, I have chosen Dash to implement the user interface. I use three datasets in my thesis, consisting 1k resp. 280k and 800k observations. The size of the two large files is around 150MB resp. 210MB.

Problems occur regarding the upload of the large files to show the first 10 records for feature selection. To make the dataset usable for the application, I use read_csv of the pandas module to read the dataset into a pandas dataframe. This dataframe is serialized as a JSON string and saved in a hidden Div. The JSON serialization of the dataframe is stored in the browsers’ cache. Uploading one of the larger files, an Out of Memory Error in my browser is occurring and the app doesn’t change. This behavior seems to be independent from the web browser used.

The first approach to solve my problem was to replace the upload component from the module dash-core-components by the upload component of the module dash-resumable-upload, which didn’t solve the problem. The out of memory error isn’t occurring but the website is not updated with a preview of the first ten records. The next approach was to use a flask cache on the filesystem but the problem persists and the browser shows an Out of Memory Error. The last approach to get rid of my problem was to replace the hidden Div by a Store component of the dash-core-components module, which didn’t solve my problem, too. If I use just a subset of 50k observations everything works fine.

What could I do that Dash will work with a larger amount of data? Thank you very much in advance for your responses.

Has the Problem of large dataset handling being Solved by Now ?.. We have used sql for data handling and its working fine, but when it comes to larger dataset greater than 50 MB same issues are faced. The web browser crashes. Could you update on this please ?

1 Like

I have did the same think,
but the data I am taking is from local SQL database server and not from cloud.
Like @sophotrope said We need to use cloud instead of local SQL server.
will it be possible to do it?

Hey, just wanted a clarification.

So utilizing flask to save the file…does that change the type of file that is being uploaded? So will ‘Sample.xlsx’ still be ‘Sample.xlsx’? And how do you call for use later?

Thanks,

I am also wondering if there has been any update on uploading large files with Dash without them being fully in memory in both the browser and server. I did see some of the other threads about dash-resumable-upload, but it appears that is no longer being maintained.

Thanks!

1 Like

Is there any update on uploading large files (~ > 150MB) using dcc.Uplaod ?
Is there any way to come over this problem ?

1 Like

@jauerb Any luck finding a solution?

@maulberto3 I have made an upload component that can handle large files without problems. It is a fork of the dash-resumable-upload which utilizes Resumable.js. You can check the documentation on the github page: dash-uploader

Installing

pip install dash-uploader

Simple Example

import dash
import dash_html_components as html
import dash_uploader as du

app = dash.Dash(__name__)

# 1) configure the upload folder
du.configure_upload(app, r"C:\tmp\Uploads")

# 2) Use the Upload component
app.layout = html.Div([
    du.Upload(),
])

if __name__ == '__main__':
    app.run_server(debug=True)
2 Likes

Hi @np8 Thanks, will definitely look at it. Thanks again.