Setting up JupyterHub Securely on AWS

How to secure JupyterHub for your data science team

In our previous blog post on JupyterHub, we walked through the basic deployment steps for The Littlest JupyterHub (TLJH) and Zero-to-JupyterHub (ZTJH). Our recommendation for anyone looking to deploy JupyterHub as a data science platform in production was to use ZTJH. We’ll assume you’re using that for this blog post.

Once you have Zero-JupyterHub up and running, security is the top priority. You should feel confident that your data science platform is safe and that your users can access it easily. In this post, we strive to not only show how to secure your JupyterHub, but why each of these steps is important. When we’re done, you will have the most common security measures in place to keep the bad actors out.

Reminder: the helm upgrade command

As described in the previous post, Helm is the Kubernetes package manager used to install and update JupyterHub running on our Kubernetes cluster and in our case deployed on AWS EKS.

When we update config.yaml, we will run the helm upgrade command, given below. We will refer back to it throughout the blog post:

helm upgrade --cleanup-on-fail \
  <your-release-name> jupyterhub/jupyterhub \
  --namespace <your-namespace> \
  --version=<JH-helm-chart-version> \
  --values config.yaml

NOTE: In our previous post, we recommended that you save your values, those in brackets <...>, as comments in your config.yaml.

  • <your-release-name> - given that the same “chart” (package) can be installed multiple times on the same Kubernetes cluster, this release name is simply a way of distinguishing between those different installations.
    • In our case, we used ztjh-release.
  • <your-namespace> - this is the Kubernetes namespace that JupyterHub will be created in. If that namespace doesn’t exist, it will create it for you.
    • In our case, we went with ztjh.
  • <JH-helm-chart-version> - each version of JupyterHub is associated with a Helm chart version. Reference this document for more details.
    • In our case, because we are deploying JupyterHub version 1.5, we use Helm chart version 1.2.0.

Security and HTTPS

From our first blog post, our ZTJH deployment is up and running, but in its most basic form. To login as a user, we have to navigate to the EXTERNAL-IP. That is a long and confusing URL string that AWS provided. Let’s use an easier domain name instead.

We will first get a new domain name that is short and easy to remember. Then we will set up automatic HTTPS by creating a Let’s Encrypt certificate, which auto-renews every few months. This will keep our friendly domain name secure behind HTTPS.

HTTP stands for hyper-text transfer protocol. It is the standard protocol used to transfer data over the internet. HTTPS is simply the encrypted or secured (hence the “S”) extension of HTTP. By using HTTPS you can guard the connection from third parties being able to read it. We establish the secure connection using transport layer security, or TLS.

Register your domain name

The JupyterHub documentation for this step is quite sparse. This is because of how many different domain providers there are. To give you a sense of how to do this with your provider, we will walk through each step of the process with hover.com. First buy the domain name you would like to use. In our case, we chose “demohub.tech”, which at the time of this writing was on sale for five bucks.

1. Create a CNAME record for your domain

With a newly purchased domain, create a “CNAME” record that points to the EXTERNAL-IP. A “CNAME”, or Canonical Name, is a DNS record that points to another domain name, in our case the one provided by AWS, whereas an A-record points to a IP address. How you do this depends on which domain provider you’re using.

First navigate to the “DNS” tab, then select “ADD A RECORD”.

For the DNS record, use these options:

  • “TYPE”, select “CNAME”
  • “HOSTNAME”, choose a hostname. In our case, we selected “my.demohub.tech”
    • If you’d like to use the domain name without any prefix, enter “@”.
  • “TARGET”, paste the EXTERNAL-IP URL.

2. Wait for the DNS to propagate

DNS records take time to be updated on the servers, so be patient while that happens over the next few minutes (or hours in some cases). For those interested to learn more on how DNS works, have a read through this amusing comic.

You will know when the DNS changes have propagated successfully when you can access your JupyterHub from your new domain.

NOTE: It’s CRITICAL that you wait for these changes to propogate before proceeding.

Add Let’s Encrypt certificate

Now that we can access our JupyterHub from an easy domain name, we’ll add a TLS certificate to increase security even more. Just as the JupyterHub docs outline, we will use Let’s Encrypt for us.

1. Update the config

Update the config.yaml that you used for your initial deployment by adding the following:

proxy:
  https:
    enabled: true
    hosts:
      - <your-domain-name>
    letsencrypt:
      contactEmail: <your-email-address>

In our example, our domain name is my.demohub.tech.

2. Run helm upgrade

Run the helm upgrade command.

Wait a few minutes and then navigate to your domain. You should see that your JupyterHub is further secured by TLS, represented by the little lock symbol next to your domain name in the browser. You may also notice that the address no longer starts with http, but instead is https.

Managing users using OAuth 2.0

To add users to your JupyterHub, you currently need to add them to the config.yaml and have them set a password upon first login. Although this is better than nothing, we can go a step further and configure JupyterHub to use an OAuth 2.0 provider (from here on referred to simply as OAuth).

One of the most obvious benefits of using OAuth is that you get single sign-on (SSO). Your users no longer need to remember an additional username and password to login. Using an OAuth provider like GitHub or Google makes it so the users only need to remember account information for accounts they already regularly use. Making it easier for your users to log in securely is a security benefit in itself. Also, multi-factor authentication (MFA) can be enabled for these providers if desired.

Besides easy logins, there are many technical reasons why OAuth is the industry standard protocol for authenticating users. At a high-level they include the use of tokens, which limit the scope of user information that is shared, and the fact that the authentication server is also required to use TLS to keep the data encrypted.

The ZTJH docs detail how to configure and setup an OAuth for a variety of different providers including GitHub, Google, Azure Active Directory, Auth0, etc. The steps needed to setup an OAuth application for each provider will be slightly different, but the overall procedure is similar. We will walk through the steps on GitHub to give a detailed example.

GitHub OAuth setup

Before getting started, you will need a GitHub account if you don’t have one already. It’s free and the process of setting up the OAuth application is fairly straight-forward.

1. Create the OAuth application in Github.

Once logged in, navigate to the Settings page by clicking on your profile picture in the top-right of the screen.

Then click Developer Settings at the bottom left. Select OAuth Apps, and then click New OAuth App.

On this screen, we need to change the following values:

  • “Application name” - give your OAuth application a memorable name.
    • We went with ztjh-oauth.
  • “Homepage URL” - enter your domain name.
    • Our domain is http://my.demohub.tech.
  • “Authorization callback URL” - it’s important this is configured correctly for the authorization process to work. Enter https://<your-domain-name>/hub/oauth_callback.
    • In our case, http://my.demohub.tech/hub/oauth_callback.

Click “Register application” when you’re done.

Now just copy the “Client ID”, and create and copy a “Client secret”. You will use these in the next step.

2. Update the helm configuration file

With the GitHub OAuth application created, and the Client ID and Client secret in hand, update your config.yaml accordingly.

  ```yaml
  hub:
    config:
      GitHubOAuthenticator:
        client_id: <your-client-id>
        client_secret: <your-client-secret>
        oauth_callback_url: https://<your-domain-name>/hub/oauth_callback
      JupyterHub:
        authenticator_class: github
  ```

You don’t need to delete anything (such as the Authenticator key). Simply ensure that the fields shown above are populated.

3. Run the helm upgrade command for the changes to take effect.

This may take a minute or two, but once the changes are in, you can navigate to <your-domain-name> to find a “Sign in with GitHub” button.

Upon your first login, you will be rerouted to GitHub and asked to login.

If you encounter an issue like “400: Bad Request” or similar, try accessing your JupyterHub in a private browser session. It’s also worthwhile double-checking that the oauth_callback_url in the config.yaml matches what you have configured in the GitHub OAuth application.

Handling secrets for private image registries

The last, and likely the most advanced, security topic that we will cover is on secrets used to pull container images from private registries. In our specific case, we will cover how to pull private Docker images stored on AWS Elastic Container Registry (ECR).

There are many reasons why you might want to use personal container images on your JupyterHub. Perhaps you need to create a bespoke environment for your data science team, outfitted with all the necessary tools they will need to get their work done. Or from the security perspective, you are interested in having broader control over the kinds of packages that are installed and would like the ability to regularly perform additional security scans. Custom workspace options like these can be accomplished by creating images with specific packages and configurations, storing those images, and launching them when users log into your JupyterHub.

Be sure to install these two prerequisites before proceeding:

  • aws-cli - the command line interface to interact with AWS
  • docker - a tool to build and push images to a private registry

Prepare private image registry

We will start by creating an ECR repository for our private images and then configure our JupyterHub to pull a particular image when a user launches their workspace, aka JupyterLab.

1. Create ECR repo from AWS console

Log into AWS console and navigate to ECR service. For a more detailed tutorial, see the AWS ECR docs: Creating a private repository.

Click “Create repository”:

Then give this new image repo a name and click “Create”. In our example, we chose ztjh-image-repo.

2. Prepare the private image you want to use

To keep things simple for this blog, we will pull a publicly available JupyterHub image from Docker Hub to use for the rest of these steps. Keep in mind that for your production environment, you might want to customize your image by either creating it from scratch or by modifying an existing image.

Pull jupyterhub/singleuser image from Dockerhub.

docker pull jupyterhub/singleuser

To make modifications to this image you will need to edit the Dockerfile and build it locally before pushing it up to ECR. The Dockerfile for the jupyterhub/singleuser image can be found in the JupyterHub GitHub repo.

3. Create and push your images up to ECR

Now that you have an image you would like to use, push it up to ECR. The steps are below, but for a more detailed tutorial, see the AWS ECR docs: Pushing a Docker image.

The first step is passing your AWS login information to the docker CLI so that it can handle the push. You’ll need your AWS region and aws_account_id.

aws ecr get-login-password --region <region> | docker login --username AWS --password-stdin <aws_account_id>.dkr.ecr.<region>.amazonaws.com

Next tag your image appropriately. In our case, <your-ecr-repo-name> is ztjh-image-repo, we kept the tag the same, singleuser.

docker tag jupyterhub/singleuser <aws_account_id>.dkr.ecr.<region>.amazonaws.com/<your-ecr-repo-name>:<tag>

Finally, push your docker image up to ECR. This process will take a few minutes depending on how large your image is and how fast your network upload speeds are.

docker push <aws_account_id>.dkr.ecr.<region>.amazonaws.com/<your-ecr-repo-name>:<tag>

You can log into AWS and navigate to the ECR repo you created to verify that it was successfully pushed up.

Create secret to access private image registry

At this point, you have created and pushed the image you would like to be used whenever your users launch JupyterHub. You might be wondering how your JupyterHub cluster has access to the private image registry. This is where Kubernetes ‘secrets’ come into play. To use them, we will need to update the config.yaml and redeploy.

The ZTJH has documentation on all of the configuration items that can be made to the config.yaml, but for our purposes we are interested in two sections, imagePullSecret and singleuser.image.

1. Add imagePullSecret to the config.yaml

In your existing config.yaml add the following section.

imagePullSecret:
  create: true
  registry: <aws_account_id>.dkr.ecr.<region>.amazonaws.com/<your-ecr-repo-name>
  email: <your-email-address>
  username: aws
  password: aws ecr get-login-password --region <region> | cut -d' ' -f6

This section will create the Kubernetes secret needed to login to AWS ECR when it needs to pull the private image. The name of this secret is image-pull-secret and once created, can be viewed using a kubectl command, which we will show below.

2. Add singleuser.image to the config.yaml

Now we will update the config.yaml to use the private image we referenced above whenever a new user launches JupyterHub. Add the following section.

singleuser:
  image:
    name: <aws_account_id>.dkr.ecr.<region>.amazonaws.com/<your-ecr-repo-name>
    tag: <tag>

3. Run the helm upgrade command

For both of these changes to take hold, we will need to run the helm upgrade command. This might take a few minutes so be patient.

Once complete, you can verify that the changes are working by logging in and launching JupyterHub.

When your JupyterLab has finally launched, you may also notice that he UI for this particular image, jupyterhub/singleuser is also different from what we started with. This is another indication that the process worked.

For those interested, you can view the image-pull-secret using the following command:

kubectl describe secrets image-pull-secret -n <your-namespace>

You might notice that the actual ‘secret’ itself, i.e. the username and password, is hidden.

Conclusion

We covered a few of the most important and common security topics that should be considered for any JupyterHub deployment. We could not cover all security topics, so feel free to review the security section of the Zero-to-JupyterHub docs.

Ultimately, we hope this blog helped you understand the steps needed to provide a base level of security and some of the reasons each piece helps to keep your JupyterHub safe. It is certainly important to consider security early in your deployment so that you can establish the necessary protocols before your users log in. By properly securing your Data Science platform, you can prevent vulnerabilities that bad actors can exploit.

Check out other resources on setting up JupyterHub:


About Saturn Cloud

Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Join today and get 150 hours of free compute per month.