GitHub Actions and Continuous Integration for Data Scientists

Github Actions and Continuous Integration is an invaluable tool to make sure your code is consistently high quality and error free. This code goes through a very basic code sanitization, linting, and testing configuration you can use for your projects.

Continuous Integration(CI) is pretty critical when working with code to ensure that

  • Code is always consistently formatted
  • Any automated testing is always passing
  • Code is error free

These checks provide a great quality of life improvement to any data scientist working with the codebase. These automated checks give team members confidence that is mostly error free and bug free. Automating continuous integration is critical because once tests aren’t passing, it can take alot of work to bring a repository back into compliance.

This blog post does not cover

  • deploying ML models with CI
  • Applying Git to Jupyter noteboks

Though those would be excellent follow up topics.

A very simple Makefile for your projects

We use makefiles to manage all of our projects. You don’t have to, but they’re pretty useful. You also don’t need to understnad or read Makefile syntax. For the most part you can just grab this Makefile, and just replace “MY_PROJECT” with the directory of your project, and modify the rest of the commandas as you see fit.

.PHONY: format
format:
	@echo -e '\n\nCheck formatting with Black...'
	black --line-length 100 --exclude '/(\.vscode|node_modules)/' .
	isort tests MY_PROJECT

.PHONY: flake8
flake8:
	# If you make changes here, also edit .pre-commit-config.yaml to match
	@echo -e '\n\nFlake8 linting...'
	flake8 MY_PROJECT
	flake8 tests

.PHONY: mypy
mypy:
	mypy --config-file mypy.ini ./

.PHONY: isort
isort:
	isort MY_PROJECT tests --check

.PHONY: lint
lint: \
	black \
	flake8 \
	mypy \
	isort

.PHONY: test
test:
	pytest -n auto --cov-report term-missing --cov=MY_PROJECT/ -s


pipeline: \
    format \
    lint \
    test

What are these operations?

make format

This uses the black code formatter to automatically standardize all code formatting and isort to standardize module import order. This sounds insignificant the primary value for teams is removing the mental burden of having to ever think about or discuss code formatting in the future.

make flake8 mypy

Flake8 and mypy are 2 code invaluable tools. Flake8 can detect common errors in code such as accidentally overwriting a variable, or mispelling an object attribute. Mypy can automatically detect all sorts of type related bugs, assuming your code uses type annotations pretty heavily. Both of these tools have saved me tons of time by catching bugs before I ever kicked off another 3 hour job that would fail with some stupid bug in the code.

make test

This Makefile uses pytest to automatically run unit tests on your code. We strongly recommend pytest. More on that in a future article

GitHub Actions

GitHub Actions is a very convenient place to run CI. Create a file in your repository “.github/workflows/ci.yml”.

name: test
on: push
jobs:
  test:
    name: test
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - uses: mamba-org/setup-micromamba@v1
        name: Set up micromamba
        with:
          environment-file: environment.yaml
          init-shell: >-
            bash
          cache-environment: true
      - name: pythonpath
        run: echo "PYTHONPATH=${GITHUB_WORKSPACE}" >> $GITHUB_ENV
      - name: path
        run: echo "PATH=/home/runner/micromamba/envs/operations/bin:/home/runner/micromamba-bin/:${PATH}" >> $GITHUB_ENV
      - name: test
        run: make pipeline

This section assumes you are ok with using conda (really, micromamb). If you are using pip, and would like to use this template, then you can convert your requirements.txt file into an equivalent environment.yml like this:

dependnecies:
- pip
- pip:
  - pip-pkg1
  - pip-pkg2
  - pip-pkg3

The above ci.yaml is configured to run on every push (on: push). The steps are as follows

Step 1. Checkout the code

Nothing to say here. GitHub will checkout your code and make sure it’s available

Step 2. Setup your conda environment

This will use micromamba to setup a conda environment based on an environment.yml. It will also cache it, which will speed up subsequent invocations of your CI.

Step 3. Set your PYTHONPATH and PATH

These 2 steps ensure that any bash scripts in your Makefile pick up the right python environment, and can import your codebase.

Step 4. Run your linting and test pipeline from the Makefile

The last step does the heavy lifting.

Conclusion

There you have it. Setting up CI for data science projects can be done very quickly, and once done, pays huge dividends.


About Saturn Cloud

Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.