Dash Upload to AWS S3 Bucket

I saw a few posts on how to upload files from a dash app to an AWS S3 bucket that didn’t get answered so I created a working example of one way to do this. AWS S3 buckets have a feature called Presigned URLs that allow you to securely and quickly upload any kind of file directly to your S3 bucket while also keeping your bucket private.

In order to do this, you will need an AWS account, a beginners knowledge of how IAM works in AWS and the boto3 python package.

app.py

from dash import Dash, html, dcc, Output, Input, callback, State
import base64
import boto3
import logging
from botocore.exceptions import ClientError
import requests

app = Dash(__name__)

app.layout = html.Div([dcc.Upload(
                id='upload',
                children=html.Div([
                    'Drag and Drop'
                ]),
                style={
                    'lineHeight': '60px',
                    'borderWidth': '1px',
                    'borderStyle': 'dashed',
                    'borderRadius': '5px',
                    'textAlign': 'center'
                },
            ),
            html.Div(id='output-upload')
])


@callback(Output('output-upload', 'children'),
          Input('upload', 'contents'),
          State('upload', 'filename'),
          State('upload', 'last_modified'),
          prevent_initial_call=True)
def update_output(content, name, date):

    # the content needs to be split. It contains the type and the real content
    content_type, content_string = content.split(',')
    # Decode the base64 string
    content_decoded = base64.b64decode(content_string)

    message = upload_file(content_decoded, name, date)

    if message is not None:
        return f"{message.status_code} - {message.reason}"


def create_presigned_post(bucket_name, object_name):
    """Generate a presigned URL S3 POST request to upload a file

    :param bucket_name: string
    :param object_name: string
    :param expiration: Time in seconds for the presigned URL to remain valid
    :return: Dictionary with the following keys:
        url: URL to post to
        fields: Dictionary of form fields and values to submit with the POST
    :return: None if error.
    """

    # Generate a presigned S3 POST URL
    s3_client = boto3.client('s3')

    try:
        response = s3_client.generate_presigned_post(bucket_name,
                                                     object_name,
                                                     ExpiresIn=3600)
    except ClientError as e:
        logging.error(e)
        return None

    # The response contains the presigned URL and required fields
    return response


def upload_file(contents, filename, date):

    result = create_presigned_post("whatever-you-called-your-bucket", filename)

    if result is not None:
        #Upload file to S3 using presigned URL
        files = {'file': contents}
        r = requests.post(result['url'], data=result['fields'], files=files)

        return r


if __name__ == '__main__':
    app.run_server(
        debug=True
    )

If you are using render to deploy your app, you can add your AWS user (the one that this app can use to work with S3) credentials access keys to the environment variables of your web service on render.

2 Likes

Nice suggestion, though this solution is not enough if your app is running on a remote server. Given that you execute the upload from a callback, this means that you were able to get the uploaded data on the server.

For example, AWS imposes a 6MB size limit on each individual HTTP request. So when your app runs on an AWS lambda, you have to make sure that the data being transferred from the browser to the server and back, upholds that size limit. This means that whatever data you upload via dcc.Upload fits inside the AWS data limit.

It would be great if we can use the S3 to circumvent this size limit and send the large data file directly from the browser to the S3. Unfortunately, I don’t have a solution for this yet.

What if we convert your example to a clientside callback, I guess that should work. Would it be secure to do that though, or do you leave your S3 vulnerable for attacks?

Does anyone have a solution to this problem?
Thanks!

Sorry i’ve been mia for a bit. Following up on this topic…
I don’t believe that 6MB size limit to be true, do you have some docs on it? The only size limitation I know of is the 5GB S3 put object API limit which can also be worked around if you use multipart uploads like described here.

I’d have to test to see what AWS allows me to do for this scenario but for this application 5GB is plenty for my use case. Security wise without a presigned post my bucket is left a little vulnerable since the bucket itself is public. I’ll revisit this topic security wise to get a better understanding of public S3 bucket vulnerabilities.

Data wise 5GB isn’t a ton but there are different ways to do this if you need larger data sets to be transferred.

@xdxd The solution I built works for my usage, are you trying something different? What is your use case?

1 Like

The limit exists if you run your tool on an AWS lambda. Lambdas have a payload limit of 6MB, so for example this post: amazon web services - Request payload limit with AWS API Gateway - Stack Overflow, and check the invocation payload (request and response) entry in the table here: Lambda quotas - AWS Lambda.

This means that every single HTTP request going in or out of the Lambda must adhere to this limit. So if you upload data to the server, it can be max up to 6MB in size. Similarly, your callbacks cannot output more than 6MB of data, or the lambda will throw error 413 (payload too large). With regards to dcc.Upload, what happens is that inside the browser the data gets converted to a base64 string and then that string gets uploaded to the server. There, your callback will process the string, and perhaps send it back to the browser to store it in a dcc.Store. If so, you might have other callbacks that listen to that Store and once the Store has new data in it, they will take that data, parse it and load that data in other components in your dashboard.

  1. this means that the data in the store has to be transferred to the server multiple times, and as such the whole data structure cannot be larger than 6MB
  2. converting data to a base64 string might increase the size of the data. I have found examples where files that are 5MB on my local pc, exceed the 6MB limit when converted to base64. Whether this happens is very much file dependent and needs to be tested.

To come back to the topic, @xdxd, in the end I was able to create a solution that works very well. I was planning to upload it to github repository so I could share it, but it is not necessarily a small solution and it contained a lot of custom features that are related to my work environment. To share it I would have to strip that all out and I didn’t find the time to do that. But I can give you a general step by step guide on how to do it.

Disclaimer: I designed my solution for a Dash tool that runs on an AWS Lambda which sits behind an AWS API Gateway.

Summary of procedure: We create a clientside_callback that takes care of the upload of the data on the browser. The clientside_callback runs then data is uploaded through a dcc.Upload and the clientside callback does 2 things: 1-make a custom request to the server requesting a presigned post url for our S3 bucket, 2- upload the data to the S3 using the presigned post url. This procedure sends data directly from browser to S3.

While the above summary makes it sound simple, there are a couple of things that need to set up to make that work. So to go in a little bit more detail:

  1. To run you application on AWS, you usually create a lambda_handler function, which is the entry point function for your lambda. In this function you need to parse the event and check whether the incoming request was sent to your custom url, or whether it was a regular dash request. To give an example: I assign a custom url to my tools: my-domain.com/my-tool/dash, the custom upload url could then be my-domain.com/my-tool/upload-data. In the lambda_handler you can then check the event with
    if event.get("pathParameters", {}).get("proxy", "unknown") == "upload-data":
        # do logic to create a presigned-post url and return that url
    else:
        # do what you did before as the request is one of the regular dash requests.
    
  2. Generating a presigned post can be done like:
    response = boto3.client("s3").generate_presigned_post(
                 bucket_name,
                 required_name_for_file_to_be_uploaded,
                 Fields={"Content-Type": required_content_type_for_file_to_be_uploaded},
                 Conditions=[{"Content-Type": required_content_type_for_file_to_be_uploaded}],
                 ExpiresIn=desired_minutes * 60,  # time in seconds
             )
    
  3. In the client_side callback request the presigned post with (if you set up an authorizer also include the authorization token):
    const post_url = await fetch("my-domain.com/my-tool/upload-data", {
        headers: {
            "Authorization": <user_token> 
        },
        method: "GET"
    })
    .then(response => {return response.json()});
    
  4. Upload the data to the presigned post url:
    const data = new FormData();
    Object.entries(post_url.fields).forEach(([field, value]) => {
        data.append(field, value);
    });
    data.append("file", data_to_upload);
    
    var post_response = await fetch(post_url.url, {
        method: "POST",
        body: data,
    }).then(response => {return response})
    
  5. Check for success
    if (post_response.status == 204) {
        // upload successful
        return {
            "s3_file_path": post_url.fields.key,
        }
    }
    
  6. Store this file path in a dcc.Store and pass that data around to every callback that needs to access the uploaded data.
  7. Whenever a callback needs the uploaded data, download it during callback execution
    tmp_path_on_lambda = os.path.join(location, os.path.basename(s3_file_path))
    boto3.client("s3").download_file(bucket_name, s3_file_path, tmp_path_on_lambda )
    with open(tmp_path_on_lambda , "r") as f:
        # read file
    
    
    

These steps should give you a hint on how to get started. You will have to add security checks around each of these steps and you will have to give the Lambda permissions to interact with the accompanying S3 bucket. So you will have to set up template.yaml correctly as well.

Secondly, while playing around with the above, check the contents of all the responses that are returned by these functions to get an understanding of what data is being passed around.

Thirdly, some of the processing/safeguarding will have to be done in the clientside_callback (in JavaScript) and some can be done by Python functions in the server/callbacks.

If you have more questions, feel free to ask them, and I will see if I can answer them.

Ahh you are correct, I was not thinking about lambda limits. Your solution is the correct one should lambda need to be involved, effectively:
image

My solution didn’t involve lambda. I am curious now, I have to try a bunch of huge files and see what happens.

Solid solution!!

The schematic looks right indeed. A side note here, it is possible to use any lambda that is able to generate a presigned-post url. It doesn’t need to be the same one as the lambda that is running the tool (I do that for convenience and maintainability).

And thanks, it took a lot of effort to figure out how to make everything work. Especially with the JavaScript in between and then browser CORS errors due to not formatting the http post request correct in the clientside_callback (I am no JavaScript programmer). There was plenty of frustration to be had. If I ever get to making a nice MWE out of my setup, I will share it on the forum.

In any case, with my setup I was able to upload files up to 180MB without issue (I didn’t try much more). My guess about it, is that the limits will be determined either by what AWS S3 allows, or the browser memory (because dcc.Upload will first store the base64 converted uploaded data in its memory). Guessing a bit more, I think the browser memory will be the limiting factor.