The Busy Data Scientist's Guide to Data Science Resources 2022
There are plenty of places to start when building your list of data science resources – but you’re a busy data scientist. We’ve collected a handful of resources for different needs, all serving the purpose of making your work easier and more productive.
Here is a reference guide to the top resources you need to know about, organizing into a few lists that meet a variety of needs.
- Machine Learning and Deep Learning Tools
- Free & Enterprise Data Science and Compute Platforms
- Saturn Cloud
- And small shoutout to AWS Sagemaker, AzureML, Google Vertex
- Data Visualizations Tools
- Workflow Orchestration Tools
- Free and Paid Data Science Courses
- Alexey Grigorev courses
- Matt Dancho
- Free & Enterprise GPU Computing Platforms
- Saturn Cloud
- NVIDIA Academic Grants Program (must apply for free GPU compute)
- Model Management Tools
- Weights & Biases
- MLFlow (open source)
Machine Learning and Deep Learning Tools
TensorFlow is a free and open-source software library for machine learning and artificial intelligence. It can be used across a range of tasks but has a particular focus on training and inference of deep neural networks
PyTorch is an open source machine learning framework based on the Torch library, used for applications such as computer vision and natural language processing, originally developed by Meta AI and now part of the Linux Foundation umbrella. It is free and open-source software released under the Modified BSD license.
XGBoost is an open-source software library which provides a regularizing gradient boosting framework for C++, Java, Python, R, Julia, Perl, and Scala. It works on Linux, Windows, and macOS
LightGBM, short for Light Gradient Boosting Machine, is a free and open source distributed gradient boosting framework for machine learning originally developed by Microsoft. It is based on decision tree algorithms and used for ranking, classification and other machine learning tasks.
scikit-learn is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support-vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy. Scikit-learn is a NumFOCUS fiscally sponsored project.
Free & Enterprise Data Science and Compute Platforms
Saturn Cloud is a data science platform for scalable Python, R, and Julia for teams and individuals. Without having to switch any tools, Saturn provides a flexible environment where computational biologists and data scientists can launch high-powered notebooks (Jupyter, RStudio, VS Code, and more) in the cloud, quickly use Dask clusters, GPUs, deploy cloud resources to expand their data science capabilities, collaborate throughout an entire project lifecycle, and more. Get started for free here.
Domino Data Lab’s MLOps platform enable data scientists to develop better medicines, grow more productive crops, adapt risk models to major economic shifts, and more. Data scientists and machine learning engineers can do exploratory data analysis and model development without configuring and using their own compute resources. DDL has a 14-day, no obligation free trial where you can experience a full Domino Enterprise MLOps Platform.
RStudio offers open-source data science software, as well as RStudio Team, a unique, modular platform of enterprise-ready professional software products that enable teams to adopt R, Python, and other open-source data science software at scale.
Data Visualization Tools
Build powerful data applications with a wide array of widgets, plot tools, and UI events that can trigger real Python callbacks. The Bokeh server is the bridge that lets you connect these tools to rich, interactive visualizations in the browser.
Plotly provides online graphing, analytics, and statistics tools for individuals and collaboration, as well as scientific graphing libraries for Python, R, MATLAB, Perl, Julia, Arduino, and REST.
Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
Altair offers a comprehensive suite of data visualization software suitable for enterprise deployment. Business users, engineers, and analysts can connect to virtually any data source and build data monitoring, analysis, and reporting applications without writing a single line of code. Their stream processing engine connects directly to real-time streaming and historic time series data sources, including MQTT, Kafka, Solace, and many others.
Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy. It provides an object-oriented API for embedding plots into applications using general-purpose GUI toolkits like Tkinter, wxPython, Qt, or GTK.
Workflow Orchestration Tools
Prefect is a modern workflow management tool designed to orchestrate data stacks by building, running, and monitoring data pipelines. It is an open-source tool powered by the Prefect Core workflow engine and serves modern project management.
Flyte enables production-grade orchestration for machine learning workflows and data processing created to accelerate local workflows to production.
Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc.
Metaflow is a human-friendly Python/R library that helps scientists and engineers build and manage real-life data science projects. Metaflow was originally developed at Netflix to boost productivity of data scientists who work on a wide variety of projects from classical statistics to state-of-the-art deep learning.
Airflow is an open-source platform for authoring, scheduling and monitoring data and computing workflows. Airflow uses Python to create workflows that can be easily scheduled and monitored and provides many plug-and-play operators that are ready to execute your tasks on Google Cloud Platform, Amazon Web Services, Microsoft Azure and many others.
Free and Paid Data Science Courses
Alexey Grigorev teaches a machine learning bootcamp where you can learn machine learning by doing projects and get the skills needed to work as a data scientist or machine learning engineer.
Matt Dancho provides data science courses for business where you can apply new skills to your job immediately. Learn as Matt walks you through large-scale data science projects covering things like high-performance time series, shiny web applications, general data science for business, and more.
Free & Enterprise GPU Computing Platforms
Saturn Cloud is a data science platform for scalable Python, R, and Julia for teams and individuals. It offers free and enterprise tiers to meet the needs of new data scientists as well as experienced teams.
Without having to switch any tools, Saturn provides a flexible environment where data scientists can launch high-powered notebooks (Jupyter, RStudio, VS Code, and more) in the cloud, quickly use Dask clusters, GPUs, deploy cloud resources to expand their data science capabilities, collaborate throughout an entire project lifecycle, and more. Get started for free here.
Paperspace Gradient is an end-to-end machine learning platform where individuals and teams can build, train, and deploy Machine Learning models of any size and complexity.Paperspace offers a free plan with limits to CPU and GPU machines. They also offer paid plans for greater access.
NVIDIA Academic Grants Program (must apply for free GPU compute)
The NVIDIA Academic Hardware Grant Program endeavors to advance education and research by enabling groundbreaking, innovative, and unique academic research projects with world-class computing resources. It provides educators with a hands-on platform to teach AI, deep learning, and data science to students in any discipline.
Model Management Tools
CometML is a machine learning platform which AI researchers and data scientists use to track, compare and explain their ML experiments. It allows ML practitioners to keep track of their databases, history of performed experiments, code modifications and production models. Comet’s ML platform supports productivity, reproducibility, and collaboration.
Weights & Biases is the machine learning platform for developers to build better models faster. Use W&B’s lightweight, interoperable tools to quickly track experiments, version and iterate on datasets, evaluate model performance, reproduce models, visualize results and spot regressions, and share findings with colleagues.
MLFlow (Open Source)
MLflow is an open source platform for managing the end-to-end machine learning lifecycle including experimentation, reproducibility, deployment, and a central model registry. MLflow currently offers four components: MLflow Tracking, MLflow Projects, MLflow Models, and Model Registry.
Neptune is a metadata store for MLOps, built for research and production teams that run a lot of experiments. Product. It is an organized place for all your experiments, data exploration notebooks, and more. It supports any kind of project workflow and can be used by individuals or teams.
We hope this list is helpful and please email firstname.lastname@example.org to contribute to it! If you are a busy data scientist, check out www.saturncloud.io to make your life much easier. Saturn Cloud is a flexible, scalable platform for data science.