How to Upload Files Larger than 5GB to Amazon S3: A Guide

Amazon S3 is a widely used cloud storage service, ideal for storing and retrieving large amounts of data. But if you’ve ever tried to upload a file larger than 5GB, you may have encountered some issues. In this guide, we’ll explain how to overcome this limitation using a process known as multipart upload.

How to Upload Files Larger than 5GB to Amazon S3: A Guide

Amazon S3 is a widely used cloud storage service, ideal for storing and retrieving large amounts of data. But if you’ve ever tried to upload a file larger than 5GB, you may have encountered some issues. In this guide, we’ll explain how to overcome this limitation using a process known as multipart upload.

What Is Multipart Upload?

Multipart upload allows you to upload a single object as a set of parts. Each part is a portion of the object’s data. You can upload these object parts independently and in any order. If transmission of any part fails, you can re-upload that part without affecting other parts. This also allows you to pause and resume object uploads, which is useful for network issues or large file transfers.

Steps to Perform Multipart Upload

Step 1: Initiate Multipart Upload

Before uploading parts, you must initiate a multipart upload and get an upload ID. The upload ID is a unique identifier that associates all the parts in the current multipart upload.

import boto3

s3_client = boto3.client('s3')

response = s3_client.create_multipart_upload(
    Bucket='your_bucket_name',
    Key='your_object_key'
)

upload_id = response['UploadId']

Step 2: Upload Parts

Now you can upload the parts. Each part must be at least 5MB in size, except for the last part.

file_path = 'path_to_your_large_file'
part_number = 1
part_info = {'Parts': []}

with open(file_path, 'rb') as data:
    while True:
        part_data = data.read(5*1024*1024) 

        if not part_data:
            break

        part = s3_client.upload_part(
            Bucket='your_bucket_name',
            Key='your_object_key',
            UploadId=upload_id,
            PartNumber=part_number,
            Body=part_data
        )

        part_info['Parts'].append({
            'PartNumber': part_number,
            'ETag': part['ETag']
        })

        part_number += 1

Step 3: Complete Multipart Upload

After uploading all parts, you must complete the multipart upload.

s3_client.complete_multipart_upload(
    Bucket='your_bucket_name',
    Key='your_object_key',
    UploadId=upload_id,
    MultipartUpload=part_info
)

Benefits of Multipart Upload

Using the multipart upload API has several advantages:

  • Efficiency: By uploading parts in parallel, you can significantly reduce the total time of your upload.
  • Flexibility: You can upload parts in any order, and you can even pause and resume uploads.
  • Resiliency: If a part fails to upload, you can just re-upload that part without affecting others.

Conclusion

Uploading large files to Amazon S3 can be a challenge, but with the multipart upload API, it’s a breeze. It allows you to upload files in parts, providing efficiencies, flexibility, and resiliency. So, the next time you need to upload a file larger than 5GB to Amazon S3, remember, multipart upload is your friend.

Keywords: Amazon S3, multipart upload, large file upload, boto3, cloud storage

Disclaimer: This article assumes a basic understanding of Python and the boto3 library. Always remember to secure your AWS credentials and avoid exposing them in your scripts.


About Saturn Cloud

Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Join today and get 150 hours of free compute per month.