GitHub Actions and Continuous Integration for Data Scientists
Continuous Integration(CI) is pretty critical when working with code to ensure that
- Code is always consistently formatted
- Any automated testing is always passing
- Code is error free
These checks provide a great quality of life improvement to any data scientist working with the codebase. These automated checks give team members confidence that is mostly error free and bug free. Automating continuous integration is critical because once tests aren’t passing, it can take alot of work to bring a repository back into compliance.
This blog post does not cover
- deploying ML models with CI
- Applying Git to Jupyter noteboks
Though those would be excellent follow up topics.
A very simple Makefile for your projects
We use makefiles to manage all of our projects. You don’t have to, but they’re pretty useful. You also don’t need to understnad or read Makefile syntax. For the most part you can just grab this Makefile, and just replace “MY_PROJECT” with the directory of your project, and modify the rest of the commandas as you see fit.
.PHONY: format
format:
@echo -e '\n\nCheck formatting with Black...'
black --line-length 100 --exclude '/(\.vscode|node_modules)/' .
isort tests MY_PROJECT
.PHONY: flake8
flake8:
# If you make changes here, also edit .pre-commit-config.yaml to match
@echo -e '\n\nFlake8 linting...'
flake8 MY_PROJECT
flake8 tests
.PHONY: mypy
mypy:
mypy --config-file mypy.ini ./
.PHONY: isort
isort:
isort MY_PROJECT tests --check
.PHONY: lint
lint: \
black \
flake8 \
mypy \
isort
.PHONY: test
test:
pytest -n auto --cov-report term-missing --cov=MY_PROJECT/ -s
pipeline: \
format \
lint \
test
What are these operations?
make format
This uses the black
code formatter to automatically standardize all code formatting and isort
to standardize module import order. This sounds insignificant the primary value for teams is removing the mental burden of having to ever think about or discuss code formatting in the future.
make flake8 mypy
Flake8 and mypy are 2 code invaluable tools. Flake8 can detect common errors in code such as accidentally overwriting a variable, or mispelling an object attribute. Mypy can automatically detect all sorts of type related bugs, assuming your code uses type annotations pretty heavily. Both of these tools have saved me tons of time by catching bugs before I ever kicked off another 3 hour job that would fail with some stupid bug in the code.
make test
This Makefile uses pytest to automatically run unit tests on your code. We strongly recommend pytest. More on that in a future article
GitHub Actions
GitHub Actions is a very convenient place to run CI. Create a file in your repository “.github/workflows/ci.yml”.
name: test
on: push
jobs:
test:
name: test
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: mamba-org/setup-micromamba@v1
name: Set up micromamba
with:
environment-file: environment.yaml
init-shell: >-
bash
cache-environment: true
- name: pythonpath
run: echo "PYTHONPATH=${GITHUB_WORKSPACE}" >> $GITHUB_ENV
- name: path
run: echo "PATH=/home/runner/micromamba/envs/operations/bin:/home/runner/micromamba-bin/:${PATH}" >> $GITHUB_ENV
- name: test
run: make pipeline
This section assumes you are ok with using conda (really, micromamb). If you are using pip, and would like to use this template, then you can convert your requirements.txt file into an equivalent environment.yml like this:
dependnecies:
- pip
- pip:
- pip-pkg1
- pip-pkg2
- pip-pkg3
The above ci.yaml is configured to run on every push (on: push
). The steps are as follows
Step 1. Checkout the code
Nothing to say here. GitHub will checkout your code and make sure it’s available
Step 2. Setup your conda environment
This will use micromamba to setup a conda environment based on an environment.yml. It will also cache it, which will speed up subsequent invocations of your CI.
Step 3. Set your PYTHONPATH and PATH
These 2 steps ensure that any bash scripts in your Makefile pick up the right python environment, and can import your codebase.
Step 4. Run your linting and test pipeline from the Makefile
The last step does the heavy lifting.
Conclusion
There you have it. Setting up CI for data science projects can be done very quickly, and once done, pays huge dividends.
About Saturn Cloud
Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.
Saturn Cloud provides customizable, ready-to-use cloud environments for collaborative data teams.
Try Saturn Cloud and join thousands of users moving to the cloud without
having to switch tools.