The Busy Data Scientist's Guide to Data Science Resources 2022

There are plenty of places to start when building your list of data science resources, but you’re a busy data scientist. We’ve collected a handful of resources for different needs, all serving the purpose of making your work easier and more productive.

There are plenty of places to start when building your list of data science resources – but you’re a busy data scientist. We’ve collected a handful of resources for different needs, all serving the purpose of making your work easier and more productive.

Here is a reference guide to the top resources you need to know about, organizing into a few lists that meet a variety of needs.

  • Machine Learning and Deep Learning Tools
    • Tensorflow
    • PyTorch
    • XGBoost
    • LightGBM
    • scikit-learn
  • Free & Enterprise Data Science and Compute Platforms
    • Saturn Cloud
    • Domino
    • RStudio
    • And small shoutout to AWS Sagemaker, AzureML, Google Vertex
  • Data Visualizations Tools
    • Bokeh
    • Plotly
    • D3.JS
    • Seaborn
    • Altair
    • matplotlib
  • Workflow Orchestration Tools
    • Prefect
    • Luigi
    • Metaflow
    • Flyte
    • Airflow
  • Free and Paid Data Science Courses
    • Alexey Grigorev courses
    • Matt Dancho
  • Free & Enterprise GPU Computing Platforms
    • Saturn Cloud
    • Paperspace
    • NVIDIA Academic Grants Program (must apply for free GPU compute)
  • Model Management Tools
    • CometML
    • Weights & Biases
    • Verta.ai
    • Neptunel.ml
    • MLFlow (open source)

Machine Learning and Deep Learning Tools



https://www.tensorflow.org/

TensorFlow

TensorFlow is a free and open-source software library for machine learning and artificial intelligence. It can be used across a range of tasks but has a particular focus on training and inference of deep neural networks


Pytorch

Pytorch

PyTorch is an open source machine learning framework based on the Torch library, used for applications such as computer vision and natural language processing, originally developed by Meta AI and now part of the Linux Foundation umbrella. It is free and open-source software released under the Modified BSD license.


XGBoost

XGBoost

XGBoost is an open-source software library which provides a regularizing gradient boosting framework for C++, Java, Python, R, Julia, Perl, and Scala. It works on Linux, Windows, and macOS


LightGBM

LightGBM

LightGBM, short for Light Gradient Boosting Machine, is a free and open source distributed gradient boosting framework for machine learning originally developed by Microsoft. It is based on decision tree algorithms and used for ranking, classification and other machine learning tasks.


scikit-learn

scikit-learn

scikit-learn is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support-vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy. Scikit-learn is a NumFOCUS fiscally sponsored project.


Free & Enterprise Data Science and Compute Platforms


https://saturncloud.io

Saturn Cloud

Saturn Cloud is a data science platform for scalable Python, R, and Julia for teams and individuals. Without having to switch any tools, Saturn provides a flexible environment where computational biologists and data scientists can launch high-powered notebooks (Jupyter, RStudio, VS Code, and more) in the cloud, quickly use Dask clusters, GPUs, deploy cloud resources to expand their data science capabilities, collaborate throughout an entire project lifecycle, and more. Get started for free here.


Domino Data Lab Logo

Domino Data Lab

Domino Data Lab’s MLOps platform enable data scientists to develop better medicines, grow more productive crops, adapt risk models to major economic shifts, and more. Data scientists and machine learning engineers can do exploratory data analysis and model development without configuring and using their own compute resources. DDL has a 14-day, no obligation free trial where you can experience a full Domino Enterprise MLOps Platform.



RStudio

RStudio

RStudio offers open-source data science software, as well as RStudio Team, a unique, modular platform of enterprise-ready professional software products that enable teams to adopt R, Python, and other open-source data science software at scale.

Others include AWS SageMaker, AzureML, and Google Vertex.

Data Visualization Tools


Bokeh

Bokeh

Build powerful data applications with a wide array of widgets, plot tools, and UI events that can trigger real Python callbacks. The Bokeh server is the bridge that lets you connect these tools to rich, interactive visualizations in the browser.



Plotly

Plotly

Plotly provides online graphing, analytics, and statistics tools for individuals and collaboration, as well as scientific graphing libraries for Python, R, MATLAB, Perl, Julia, Arduino, and REST.




D3.js

D3.js

D3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life using HTML, SVG, and CSS. D3’s emphasis on web standards gives you the full capabilities of modern browsers without tying yourself to a proprietary framework, combining powerful visualization components and a data-driven approach to DOM manipulation.



Seaborn

Seaborn

Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.




Altair

Altair

Altair offers a comprehensive suite of data visualization software suitable for enterprise deployment. Business users, engineers, and analysts can connect to virtually any data source and build data monitoring, analysis, and reporting applications without writing a single line of code. Their stream processing engine connects directly to real-time streaming and historic time series data sources, including MQTT, Kafka, Solace, and many others.


Matplotlib

Matplotlib

Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy. It provides an object-oriented API for embedding plots into applications using general-purpose GUI toolkits like Tkinter, wxPython, Qt, or GTK.

Workflow Orchestration Tools



Prefect

Prefect

Prefect is a modern workflow management tool designed to orchestrate data stacks by building, running, and monitoring data pipelines. It is an open-source tool powered by the Prefect Core workflow engine and serves modern project management.



Flyte

Flyte

Flyte enables production-grade orchestration for machine learning workflows and data processing created to accelerate local workflows to production.



Luigi

Luigi

Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc.



Metaflow

Metaflow

Metaflow is a human-friendly Python/R library that helps scientists and engineers build and manage real-life data science projects. Metaflow was originally developed at Netflix to boost productivity of data scientists who work on a wide variety of projects from classical statistics to state-of-the-art deep learning.



Airflow

Airflow

Airflow is an open-source platform for authoring, scheduling and monitoring data and computing workflows. Airflow uses Python to create workflows that can be easily scheduled and monitored and provides many plug-and-play operators that are ready to execute your tasks on Google Cloud Platform, Amazon Web Services, Microsoft Azure and many others.

Free and Paid Data Science Courses



Machine Learning Bookcamp

Machine Learning Bookcamp

Alexey Grigorev teaches a machine learning bootcamp where you can learn machine learning by doing projects and get the skills needed to work as a data scientist or machine learning engineer.



Business Science

Business Science

Matt Dancho provides data science courses for business where you can apply new skills to your job immediately. Learn as Matt walks you through large-scale data science projects covering things like high-performance time series, shiny web applications, general data science for business, and more.

Free & Enterprise GPU Computing Platforms



Saturn Cloud Logo

Saturn Cloud

Saturn Cloud is a data science platform for scalable Python, R, and Julia for teams and individuals. It offers free and enterprise tiers to meet the needs of new data scientists as well as experienced teams.

Without having to switch any tools, Saturn provides a flexible environment where data scientists can launch high-powered notebooks (Jupyter, RStudio, VS Code, and more) in the cloud, quickly use Dask clusters, GPUs, deploy cloud resources to expand their data science capabilities, collaborate throughout an entire project lifecycle, and more. Get started for free here.



Paperspace Gradient Logo

Paperspace Gradient

Paperspace Gradient is an end-to-end machine learning platform where individuals and teams can build, train, and deploy Machine Learning models of any size and complexity.Paperspace offers a free plan with limits to CPU and GPU machines. They also offer paid plans for greater access.



NVIDIA Academic Grants Program Logo

NVIDIA Academic Grants Program (must apply for free GPU compute)

The NVIDIA Academic Hardware Grant Program endeavors to advance education and research by enabling groundbreaking, innovative, and unique academic research projects with world-class computing resources. It provides educators with a hands-on platform to teach AI, deep learning, and data science to students in any discipline.

Model Management Tools


CometML Logo

CometML

CometML is a machine learning platform which AI researchers and data scientists use to track, compare and explain their ML experiments. It allows ML practitioners to keep track of their databases, history of performed experiments, code modifications and production models. Comet’s ML platform supports productivity, reproducibility, and collaboration.



Weights & Biases Logo

Weights & Biases

Weights & Biases is the machine learning platform for developers to build better models faster. Use W&B’s lightweight, interoperable tools to quickly track experiments, version and iterate on datasets, evaluate model performance, reproduce models, visualize results and spot regressions, and share findings with colleagues.



MLFlow

MLFlow (Open Source)

MLflow is an open source platform for managing the end-to-end machine learning lifecycle including experimentation, reproducibility, deployment, and a central model registry. MLflow currently offers four components: MLflow Tracking, MLflow Projects, MLflow Models, and Model Registry.



Neptune AI

Neptune AI

Neptune is a metadata store for MLOps, built for research and production teams that run a lot of experiments. Product. It is an organized place for all your experiments, data exploration notebooks, and more. It supports any kind of project workflow and can be used by individuals or teams.

Thank you

We hope this list is helpful and please email mel@saturncloud.io to contribute to it! If you are a busy data scientist, check out www.saturncloud.io to make your life much easier. Saturn Cloud is a flexible, scalable platform for data science.