How to Work With Pycharm and AWS SageMaker Using AWS SageMaker Python SDK

In this blog, we are going to discuss how to make use of AWS SageMaker services locally on PyCharm using the AWS SageMaker Python SDK.

In this blog, we are going to discuss how to make use of AWS SageMaker services locally on PyCharm using the AWS SageMaker Python SDK.

Amazon SageMaker, which is a fully managed ML service, has made it easier for organizations to put their ML ideas into production faster and it has improved the productivity of data science teams to a greater height. Many teams are able to easily and quickly train models, tune the models for better results, and deploy the models to production-ready environments.

On the other hand, many developers, and data scientists would prefer to have the full advantage of the services of SageMaker studio, while also using the preferred local IDE, such as PyCharm or Visual Studio Code for python code development. They prefer to combine the capabilities of the two to optimize their productivity and the results of the projects.

CTA

In our last post, on how to use Pycharm and AWS SageMaker, we discussed how to work with the two by making an SSH connection using the Remote Development Gateway plugin.

This post shows how you can use AWS SageMaker to manage your training jobs and experiments on AWS using the AWS SageMaker Python SDK with Pycharm as your local IDE, but you can use your preferred IDE with no code changes.

AWS SageMaker Python SDK

AWS SageMaker Python SDK provides several high-level abstractions for working with AWS SageMaker.

These are

  1. Estimators: They help in encapsulating training on AWS SageMaker. There is also an Estimator that runs SageMaker-compatible custom Docker containers, enabling you to run your own ML algorithms by using the SageMaker Python SDK.

  2. Models: Encapsulates built ML models. AWS SageMaker Python SDK provides built-in algorithms with pre-trained models from popular open-source model hubs, such as Tensorflow Hub, Pytorch Hub, and HuggingFace. You can deploy these pre-trained models as-is or first fine-tune them on a custom dataset and then deploy them to a SageMaker endpoint for inference.

  3. Predictors: They provide real-time inference and transformation using Python data types against a SageMaker endpoint.

  4. Session: Provides a collection of methods for working with SageMaker resources.

  5. Transformers: They encapsulate batch transform jobs for inference on AWS SageMaker.

  6. Processors: They encapsulate running processing jobs for data processing on AWS SageMaker.

AWS SageMaker Python SDK supports local mode, which allows you to create estimators and deploy them to your local environment. This is a great way to test your scripts before running them in SageMaker-managed training or hosting environments.

With SageMaker local mode, the managed frameworks (TensorFlow, MXNet, Chainer, PyTorch, and Scikit-Learn) and images you supply yourself are downloaded to your local computer and show up in Docker. This Docker image is the same as in the SageMaker-managed training or hosting environments, so you can debug your code locally and faster.

Setup

To get started, complete the following steps:

  1. In your AWS account, create a new user with programmatic access that enables access key ID and secret access key for the AWS CLI.

  2. Then, you attach these permissions to the new user-created, AmazonSageMakerFullAccess and AmazonS3FullAccess, and then if possible limit them to specific AWS S3 storage buckets.

  3. After attaching the permissions, now create an execution role for the SageMaker permissions above. SageMaker will use this role to perform operations on your behalf on the AWS hardware that is managed by SageMaker.

  4. Now, proceed to install the AWS CLI on your local computer and perform a quick configuration with aws configure

    $ aws configure
    AWS Access Key ID [None]: AKIAI*********EXAMPLE
    AWS Secret Access Key [None]: wJal********EXAMPLEKEY
    Default region name [None]: eu-west-1
    Default output format [None]: json

For more information on the configuration, see Configuring the AWS CLI

After the above steps;

  1. Install Docker if you have not yet installed it on your local computer.

  2. Make sure that you have all the required Python libraries to run your code locally.

  3. Then add SageMaker Python SDK to your local library. You can use pip install sagemaker or create a virtual environment with venv for your project then install SageMaker within the virtual environment.

Now after setting up your environment ready to develop and train your ML algorithms using AWS SageMaker on your local IDE, we will discuss important things you must not forget while developing.

Making your code SageMaker compatible.

There are certain rules that you must follow so as to make your code compatible with SageMaker, eg reading input data and writing output models and other artifacts.

The script will be very similar to the one you might run outside SageMaker, but you can access useful properties about the training environment through various environment variables.

Through the following code, we show some important environment variables used by SageMaker for managing the infrastructure.

For input data location, SM_CHANNEL_{channel_name}

SM_CHANNEL_TRAINING=/opt/ml/input/data/training

SM_CHANNEL_VALIDATION=/opt/ml/input/data/validation

SM_CHANNEL_TESTING=/opt/ml/input/data/testing

The following code uses the model output location to save the model artifact:

SM_MODEL_DIR=/opt/ml/model

The code below uses the output location to write non-model training artifacts:

SM_OUTPUT_DATA_DIR=/opt/ml/output

Now, using the below code, you can pass the SageMaker environment variables as arguments so you can still run the script outside the SageMaker:

# SageMaker default SM_MODEL_DIR=/opt/ml/model
if os.getenv("SM_MODEL_DIR") is None:
    os.environ["SM_MODEL_DIR"] = os.getcwd() + '/model'

# SageMaker default SM_OUTPUT_DATA_DIR=/opt/ml/output
if os.getenv("SM_OUTPUT_DATA_DIR") is None:
    os.environ["SM_OUTPUT_DATA_DIR"] = os.getcwd() + '/output'

# SageMaker default SM_CHANNEL_TRAINING=/opt/ml/input/data/training
if os.getenv("SM_CHANNEL_TRAINING") is None:
    os.environ["SM_CHANNEL_TRAINING"] = os.getcwd() + '/data'

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--train', type=str,default=os.environ.get('SM_CHANNEL_TRAINING'))
    parser.add_argument('--model_dir', type=str, default=os.environ.get('SM_MODEL_DIR'))
    parser.add_argument('--output_dir', type=str, default=os.environ.get('SM_OUTPUT_DATA_DIR'))

AWS SageMaker Experiments for Organizing, tracking and comparing ML trainings

Amazon SageMaker Experiments helps you to group, organize and track your ML iterations when you have lots of experiences with different preprocessing configurations, different hyperparameters or even different ML algorithms to test.

AWS SageMaker Experiments automatically tracks the inputs, parameters, configurations, and results of your iterations as trials. You can assign, group, and organize these trials into experiments. It is integrated with Studio, providing a visual interface to browse your active and past experiments, compare trials on key performance metrics, and identify the best-performing models.

Conclusion

This blog post shows you how to use AWS SageMaker Python SDK with your preferred local IDE (for this case, we use PyCharm) to take full advantage of AWS SageMaker to develop, train and test ML algorithms.

We also introduce AWS SageMaker Experiments which helps in the organization and tracking of different experiments in SageMaker Studio.

CTA

Additional Resources:


About Saturn Cloud

Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.