Glossary

  • Anomaly Detection

    Anomaly detection is the process of identifying rare or unusual data points, events, or observations that deviate from the expected patterns in a dataset read more...

  • Apache Hadoop

    Hadoop is an open-source software framework that is used for distributed storage and processing of large datasets. It is designed to handle data that is too big to fit on a single computer and can be … read more...

  • Apache Hive

    Apache Hive is an open-source data warehouse system built on top of Apache Hadoop for querying and analyzing large datasets stored in Hadoop distributed file system (HDFS) or other compatible storage … read more...

  • Apache Spark

    Apache Spark is an emerging de facto platform and trade language for big data analytics. It has a high computing power and a set of libraries for parallel big data processing on compute clusters. It … read more...

  • ARIMA (Autoregressive Integrated Moving Average)

    ARIMA, which stands for Autoregressive Integrated Moving Average, is a widely-used time series forecasting model in statistics and econometrics. It is designed to predict future values of a time … read more...

  • Artificial Intelligence

    The word Artificial means something that is not natural. Human beings are able to perform tasks that are higher-level mental processes such as perceptual learning, memory organisation and critical … read more...

  • Association Rule Learning

    Association rule learning is a machine learning technique that discovers the relationships between variables in a dataset. It is commonly used in market basket analysis to identify patterns in … read more...

  • Attention Mechanism

    Attention Mechanism is a technique used in deep learning models, particularly in natural language processing and computer vision, to selectively focus on specific parts of the input data when … read more...

  • Auto-regressive models

    Auto-regressive models are a class of generative models that predict the probability distribution of a sequence of tokens by conditioning each token's probability distribution on the tokens that … read more...

  • Autoencoders

    Autoencoders are a type of neural network that can learn to compress and reconstruct data. Autoencoders consist of an encoder network that transforms the input data into a latent representation and a … read more...

  • Bias and Variance

    Bias and variance are two fundamental concepts in machine learning and statistics that describe the sources of error in predictive models read more...

  • Big Data Analytics

    Big data analytics is the process of extracting meaningful insights, and VALUE from data. read more...

  • Bioinformatics

    Bioinformatics is a field formed from the integration of mathematical, statistical and computational methods to analyze biological information, including genes and their products, whole organisms, or … read more...

  • CatBoost

    CatBoost is a machine learning algorithm for gradient boosting on decision trees. It is designed to handle categorical features in the data, which is a common challenge in many real-world datasets. … read more...

  • Clustering

    Clustering is a machine learning technique that involves grouping similar data points together based on their characteristics or features. Clustering can be used for a variety of applications such as … read more...

  • Collaborative Filtering

    Collaborative Filtering is a widely-used technique in recommendation systems that leverages the past behavior, preferences, or opinions of users to generate personalized recommendations. It is based … read more...

  • Computer Vision

    Computer vision is a field of artificial intelligence and computer science that focuses on enabling computers to interpret and understand visual information from the world around them. It involves … read more...

  • Confusion Matrix

    A confusion matrix is a table that summarizes the performance of a machine learning model by comparing its predicted output with the actual output. A confusion matrix shows the number of true … read more...

  • Content-Based Filtering

    Content-Based Filtering is a recommendation technique that recommends items to users based on their preferences and past behavior. It works by analyzing the content of the items themselves and … read more...

  • Continuous applications

    Continuous applications are end-to-end programs that respond instantly to data. Continuous application embodies the streaming process and it incorporates static data the whole time. Continuous … read more...

  • Convolutional Neural Networks (CNN)

    Convolutional Neural Networks (CNN) are a type of deep learning architecture specifically designed for processing grid-like data, such as images or time-series data. CNNs consist of multiple layers, … read more...

  • Coreference Resolution

    Coreference Resolution is a natural language processing technique that identifies and links noun phrases that refer to the same entity in a text read more...

  • Cron

    Job scheduling has to do with allocating system resources to many different tasks by an operating system. The system handles prioritized job queues that are awaiting CPU time and determines which job … read more...

  • Cross-Validation

    Cross-Validation is a widely-used model validation technique in machine learning that helps assess the performance and generalizability of a model. read more...

  • DALL-E and DALL-E 2

    DALL-E is a generative AI model developed by OpenAI that generates images from textual descriptions. Combining natural language understanding with image generation capabilities, DALL-E is based on the … read more...

  • Dask

    Dask is an open-source tool that makes it easier for data scientists to carry out parallel computing in Python. Through distributed computing and Dask dataframes, it allows you to work with large … read more...

  • Data analysis platform

    A data analysic platform is an environment that provides the necessary services and tools, which are needed to extract value from data. read more...

  • Data governance

    Data Governance is a business issue as much as it looks entirely like a technical challenge solely for the IT team. read more...

  • Data Normalization

    Data Normalization is a pre-processing technique used in machine learning and data analysis to scale the features or variables of a dataset to a common range, improving the performance and stability … read more...

  • Data Pipelines

    Data Pipelines are a set of tools and techniques for moving and processing data from one system or application to another, used in a variety of industries and applications. read more...

  • Data Science

    The science of studying data, with a focus on extracting meaningful insights for businesses, is what we call data science. It is multidisciplinary, as it combines the principles and practices from the … read more...

  • Data Standardization

    Data Standardization, also known as feature scaling or z-score normalization, is a pre-processing technique used in machine learning and data analysis to transform the features or variables of a … read more...

  • Data Transformation

    Data Transformation is the process of converting data from one format or structure to another, with the goal of making it more suitable for analysis or machine learning. read more...

  • Data Visualization?

    Data Visualization is the graphical representation of data and information, allowing for easier understanding and analysis of complex data sets. read more...

  • Data Warehouse

    A Data warehouse is a scalable data processing system that supports analytical processes and reporting of insights from data. read more...

  • Dataframes

    A dataframe is a data structure that presents data in form of a table with rows and columns. read more...

  • Deep Learning

    Deep learning is a subfield of machine learning, which is, in turn, a subfield of artificial intelligence with a central goal of using algorithms modelled like a human brain with a lot of data. read more...

  • Dependency Parsing

    Dependency Parsing is a natural language processing technique that involves analyzing the grammatical structure of a sentence to identify the relationships between words. read more...

  • DNA Sequence

    A DNA sequence is how the sequence or order of nucleotide bases in a piece of DNA is determined. DNA (deoxyribonucleic acid) contains all the information needed to build and maintain an organism – … read more...

  • Docker

    Docker is an open-source platform that enables one to package an application with the operating system (OS) libraries and all its dependencies required to run it in any environment. read more...

  • ETL (Extract, Transform, Load)

    ETL (Extract, Transform, Load) is a data integration process that involves extracting data from various sources, transforming it into a structured and usable format, and loading it into a target data … read more...

  • Exponential Smoothing

    Exponential Smoothing is a time series forecasting method that involves assigning exponentially decreasing weights to past observations, with the goal of making recent observations more important than … read more...

  • Feature Engineering

    Feature Engineering is the process of creating new features or transforming existing features in a dataset to improve the performance of machine learning models. read more...

  • Feature Scaling

    Feature Scaling is a data preprocessing technique that involves transforming the features of a dataset to have similar scales or ranges, improving the performance and accuracy of machine learning … read more...

  • Feature Selection

    Feature Selection is the process of selecting a subset of the most important and relevant features from the original dataset for use in machine learning models. read more...

  • Flux

    Flux is a machine-learning library for the multi-paradigm, fast, statistical programming language, Julia, which was developed by MIT. Flux is able to take another Julia function and a set of arguments … read more...

  • Generative Adversarial Networks (GANs)

    Generative Adversarial Networks (GANs) are a class of neural networks that are trained to generate new data that is similar to a training dataset, with applications in image generation, video … read more...

  • Generative AI

    Generative AI is a branch of artificial intelligence that focuses on creating new content or data, such as images, text, music, or other forms of media, by learning from existing data. read more...

  • Genomics

    Genomics is a field of science, which is focused on understanding and interpreting the DNA makeup of an organism through sequencing and analysis. Just as a genome is central to the life of an … read more...

  • Gensim

    Gensim is an open-source Python library for natural language processing (NLP), specifically designed for unsupervised topic modeling and document similarity analysis, with efficient implementations of … read more...

  • GPU

    Graphics Processing Unit (GPU) is a computer chip that is responsible for handling the computational demands of graphics-intensive functions on a computer read more...

  • Gradient Boosting

    Gradient Boosting is a popular ensemble method for building powerful machine learning models, involving the combination of multiple weak models, typically decision trees, to create a strong predictive … read more...

  • Grid Search

    Grid Search is a hyperparameter tuning technique used in machine learning to find the optimal combination of hyperparameters for a model by performing an exhaustive search through a manually specified … read more...

  • Hosted Jupyter

    Hosted or Cloud Jupyter notebooks are integrated development environments that provide a complete ecosystem for data science and machine learning. Cloud Jupyter comes with already installed and … read more...

  • Hosted Notebooks

    Hosted notebooks are cloud-based platforms that provide an interactive environment for users to write, execute, and share code, as well as visualize data and results read more...

  • Hybrid Recommender Systems

    Hybrid Recommender Systems combine two or more recommender systems, such as Content-Based Filtering and Collaborative Filtering, to provide more accurate and diverse recommendations for various … read more...

  • Hyperparameter Tuning

    Hyperparameter tuning is the process of choosing the best hyperparameters that will produce the best results in a learning algorithm. Hyperparameter tuning is an important process when it comes to … read more...

  • Jupyter

    Jupyter is an open-source project with the goal of developing comprehensive browser-based software for interactive computing. It has allowed scientists all over the world to collaborate by being able … read more...

  • Jupyter Notebook

    Jupyter Notebook is an open-source web-based application that enables one to create, and share computational documents which contain live code, equations, visualizations and explanatory text. Just … read more...

  • JupyterHub

    JupyterHub is an open-source platform designed to serve Jupyter Notebooks to multiple users, making it an ideal solution for team collaboration, teaching, and research. read more...

  • MLOps (Machine Learning Operations)

    MLOps, or Machine Learning Operations, is a set of practices that combines machine learning, DevOps, and data engineering to streamline the process of deploying, monitoring, and maintaining machine … read more...

  • MLOps Platforms

    MLOps Platforms are software solutions that help organizations manage the end-to-end machine learning lifecycle, from data preprocessing and model development to deployment, monitoring, and … read more...

  • Model Drift

    Model drift is a common issue in machine learning where the performance of a model degrades over time due to changes in the input data distribution. read more...

  • Model Evaluation

    Model evaluation is a critical process in machine learning that is used to assess the performance of a trained model. It involves comparing the predicted values from the model to the actual values in … read more...

  • Model Monitoring

    Model monitoring is the process of tracking the performance of a machine learning model in real-time and making adjustments as needed to ensure that the model continues to perform accurately and … read more...

  • Natural Language Generation (NLG)

    MLOps Platforms are software solutions that help organizations manage the end-to-end machine learning lifecycle, from data preprocessing and model development to deployment, monitoring, and … read more...

  • NumPy

    NumPy was built by Travis Oliphant in 2005. Today, it is popularly used for data science, engineering, and mathematical programming. It has become a global standard in Python for performing … read more...

  • One-shot Learning

    One-shot learning is a machine learning approach that aims to train models to recognize new objects or classes based on very few examples, sometimes as few as one. read more...

  • Pandas

    Pandas is a Python library for data analysis and manipulation. It provides powerful data analysis tools and data structures for handling complex and large-scale datasets. read more...

  • Parallel Computing

    Parallel computing uses multiple processors to perform a single task simultaneously, in order to increase the speed and efficiency of the computation. This is done by dividing the task into smaller … read more...

  • Parquet

    Parquet is an open-source columnar storage format for efficient and high-performance data storage and processing. It is used in a wide range of big data applications, including Apache Hadoop and … read more...

  • Part-of-Speech (POS) Tagging

    Part-of-Speech (POS) tagging is the process of labeling words in a text with their corresponding part of speech, such as noun, verb, adjective, or adverb. It is used for a variety of natural language … read more...

  • Plotly

    Plotly is a popular open source interactive data visualization tools that allow you create visualizations or charts to understand your data. Plotly has over 40 different types of charts for … read more...

  • PySpark

    PySpark is the Python API for Apache Spark, an open-source distributed computing framework used for big data processing and analysis. PySpark is a powerful tool for big data processing and analysis, … read more...

  • Ray

    Ray is an open-source platform designed for building distributed applications with ease. It is a flexible and scalable system that can handle a wide range of workloads, from simple data processing … read more...

  • S3 Bucket

    S3 is an AWS (Amazon web service) product that offers data storage, scalability, and security. With S3, you can store data of various sizes and kinds such as text, file, object, videos, backup and … read more...

  • Scikit-Learn

    Scikit-learn offers a range of algorithms for supervised, unsupervised and reinforcement learning algorithms which include non-linear, linear, ensemble, association, clustering, dimension reduction … read more...

  • Stemming in Natural Language Processing

    Stemming is a text preprocessing technique used in natural language processing (NLP) to reduce words to their root or base form. The goal of stemming is to simplify and standardize words, which helps … read more...

  • Tensorflow

    TensorFlow is an open-source framework for building and training machine learning models. It was developed by Google and is widely used in various applications, from image and speech recognition to … read more...

  • Underfitting

    Underfitting refers to a machine learning model that fails to capture the underlying pattern or relationship in the dataset, resulting in poor performance on both training and test data. read more...

  • Unsupervised Learning

    Unsupervised learning is a type of machine learning where the model learns from a dataset without labeled output variables. The goal of unsupervised learning is to discover hidden patterns, … read more...