How to Automatically Sync Two Amazon S3 Buckets Without Using s3cmd: A Guide

How to Automatically Sync Two Amazon S3 Buckets Without Using s3cmd: A Guide
In the world of data science, syncing data between Amazon S3 buckets is a common task. Although s3cmd
is a popular tool for this job, it’s not the only option. This blog post presents a step-by-step guide on how to automatically sync two Amazon S3 buckets without using s3cmd
. We’ll be using AWS CLI and AWS Lambda for this purpose.
What Is Amazon S3 Sync?
Amazon S3 Sync is a feature that allows you to copy and sync files between different S3 buckets or even within a single bucket. It provides a reliable and efficient way to back up data or ensure consistent data in different environments.
Prerequisites
To follow along, you need:
- An AWS account
- AWS CLI installed and configured
- Basic knowledge of Python and AWS Lambda
Step 1: Install and Configure AWS CLI
AWS CLI is a unified tool to manage AWS services. It brings the power of AWS to your terminal, enabling automation through scripts. If you haven’t installed it yet, you can do so by running the following commands:
pip install awscli --upgrade --user
aws configure
During the configuration process, you’ll be prompted to provide your AWS Access Key ID
, AWS Secret Access Key
, Default region name
, and Default output format
. You can find these details in your AWS IAM dashboard.
Step 2: Sync S3 Buckets Using AWS CLI
You can sync two S3 buckets using the sync
command in AWS CLI. Here’s the basic syntax:
aws s3 sync s3://source-bucket s3://destination-bucket
This command synchronizes the source bucket to the destination bucket. It only copies new and modified files, making it an efficient way to sync data.
Step 3: Automate the Sync Process Using AWS Lambda
While the AWS CLI is powerful, it requires manual intervention. To fully automate the syncing process, we’ll use AWS Lambda, a serverless computing service that lets you run your code without provisioning or managing servers.
Create a New Lambda Function
Go to the AWS Lambda console and create a new function. Choose “Author from scratch,” provide a name, select Python as the runtime, and choose an IAM role that has permissions to read from the source bucket and write to the destination bucket.
Add the Sync Code
In the function code section, add the following Python code:
import boto3
def lambda_handler(event, context):
s3 = boto3.resource('s3')
copy_source = {
'Bucket': 'source-bucket',
'Key': 'object-key'
}
bucket = s3.Bucket('destination-bucket')
bucket.copy(copy_source, 'object-key')
Replace 'source-bucket'
, 'destination-bucket'
, and 'object-key'
with your bucket names and object key.
Test the Lambda Function
Save and test the function. If everything is set up correctly, your S3 buckets should sync when the Lambda function runs.
Step 4: Schedule the Lambda Function
To make the function run automatically, you can use Amazon CloudWatch Events. Create a new rule that triggers on a schedule, and set the schedule expression based on your needs. For example, to run the sync daily at noon, use the cron expression 0 12 * * ? *
.
Conclusion
With AWS CLI and AWS Lambda, you can automate the process of syncing two Amazon S3 buckets without using s3cmd
. The initial setup might take some effort, but the reward is a robust, scalable, and fully automated solution. Happy syncing!
About Saturn Cloud
Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Join today and get 150 hours of free compute per month.