Appendix B: Technical Glossary
Also known as the ground truth, absolute truth refers to the correct answer, or reality.
A method of testing two or more approaches or techniques to determine which performs better and is statistically significant.
A metric used in machine learning that is used to evaluate the performance of a machine learning model.
The mechanism by which an agent moves between states of an environment, by using a policy.
An approach to machine learning where the algorithm itself chooses the data to learn from.
An acronym for augmented reality, which places a computer-generated image on a human’s view of the real world.
Used by neural networks, the updating of biases and weights based on error. This occurs when the estimated output exceeds a defined error threshold. The error at output is propagated back into the network to update values of neurons.
A machine learning technique of creating multiple models on subsets of the same data and combining them to improve overall prediction.
The technique of splitting data into multiple subsets with replacement. Each sample is known as a bootstrap sample.
Bias refers to a prejudice or favoritism for a particular object, person, or group. Collection and interpretation of data can be affected by bias.
A large corpus of data—which is typically both structured and unstructured.
A classification where the output can be one of two mutually exclusive categories.
A machine learning technique that combines weak classifiers into a classifier of higher accuracy.
Categorical data refers to data that falls into a discrete, mutually exclusive set, group, or class. Categorical features are known as discrete features.
A centroid is the center of a cluster as defined by a k-means algorithm.
Centroid-based clustering refers to algorithms that separate data into clusters. Centroid-based clusters do not have a hierarchy.
A class is one of a set of label values.
A classification model presents an output from two or more discrete classes.
Clipping is a data preparation technique that involves trimming outliers present in datasets—below a minimum and above a maximum value.
A technique that groups related items together.
Commonplace in recommendation systems, collaborative filtering is a technique where the interests of one individual are modeled around the interests of others with similar features.
The storing of data in columns rather than rows (as per many traditional database structures). Columnar data storage improves speed due to reduced disk load and improved data transfer.
A type of bias that refers to one confirming their own preexisting beliefs or opinions when interpreting data or information.
An n × n table that presents the success of a classification model based on the label and model classification. A confusion matrix can be used to calculate performance metrics including precision and recall.
A variable that can take any infinite value between its minimum and maximum.
The phenomenon of training a machine learning model whereby additional training and validation does not improve the accuracy of the model.
A neural network often used in image recognition that has at least one convolutional layer.
A neural network often used in image recognition that has at least one convolutional layer.
A cost function is a performance metric that is used to measure the error of a machine learning model.
A technique for evaluating how well a machine learning model will generalize when exposed to new data by testing it against subsets of data (validation set) withheld from the training data.
A dataframe is a labeled data structure that can contain columns of different types of data.
A dataset is a collection of examples.
The process of converting data from one form or type to another.
The separating line between classes learned by a machine learning model in classification problems.
A sequence of flowchart-like statements which represents possible decisions and their respective consequences.
A neural network with many hidden layers.
The process of reducing the number of features to represent a vector.
The term dimensions has several definitions, most often used to refer to the number of items in a feature vector.
A variable that can take a finite number of values.
A model that is learning in a continuous, real-time fashion.
The use of a collection of machine learning models to achieve better predictive performance.
The state of the world that contains the agent.
A collection of features. Examples can be labeled or unlabeled.
EDA is a data science technique that seeks to understand data insights through statistical analysis and graphical visualization.
A model output that is incorrectly predicted as false.
A model output that is incorrectly predicted as true.
A management framework for Apache, used for governance of data in Hadoop clusters.
A feature is a data attribute and its value. As an example, skin color is brown is a feature where skin color is the attribute and brown is the value.
The process of choosing the features required to explain the outputs of a statistical model while excluding irrelevant features.
The selection of features a machine learning model trains on.
A type of neural network that has no recursive or cyclical relationships. As such, data feeds forward.
A distributed service that collects, aggregates, and transfers large, real-time data into the HDFS (Hadoop Distributed File System).
An evaluation technique that combines precision and recall into a measure of classification effectiveness.
The ability for a machine learning model to correctly make predictions on previously unseen data.
This term refers to how well a model fits an observation set and summarizes any differences between observed and expected values.
Hadoop is a commonly used, open source framework developed by Apache that caters for distributed processing of big datasets.
Common libraries, modules, and extensions that support Hadoop modules.
Hadoop MapReduce is a design framework for software development that facilitates the processing of large datasets in parallel.
Refers to a Hadoop database that enables the storage and management of sparse data.
A heuristic is a quick fix or, in other words, a quick machine learning solution to a problem.
A layer in a neural network between the input layer and output layer.
A type of clustering algorithm that clusters groups with ranks.
Hive is an open source library that enables Hadoop tasks to be programmed using SQL. This provides a relational database storage structure.
A hyperparameter refers to the details that are edited between successive training iterations of a machine learning model.
A separating boundary that enables classification of data points.
This form of bias refers to associations made automatically based on an individual’s mental constructs and memories which can affect the way machine learning models are designed and developed.
An open source, massively parallel processing database for Apache that enables data querying from HDFS or HBase.
The process of a machine learning model making a prediction when applied to an unlabeled dataset.
The layer of a neural network that receives the input.
The IoT refers to devices that are connected to the Internet and as such send data via sensors to a cloud-based ecosystem.
The degree that a machine learning model’s behaviors can be explained. Regression models, for instance, may be easier to explain than deep neural network models.
A round of computing model weights and updating during training.
JSON, or JavaScript Object Notation, is a lightweight data file type that facilitates quicker retrieval of data, particularly in web-focused applications, as it is easier to parse and generate.
A machine learning API for Python.
This term refers to (x, y) co-ordinates of features in an image.
A common unsupervised, clustering algorithm that groups data around centroid locations.
The output or answer of an example.
Data that contain features and a label.
A time delay within a system.
A type of model that assigns weights to each feature to calculate predictions.
A value that represents how distant a model’s prediction is from the associated label.
Data, usually structured, that is created by devices or machines such as algorithms or sensors.
A subset of AI where computer models are trained to learn from their actions and environment over time with the intention of performing better.
The term given to an enormous collection of records, interchangeable with big data.
Matlab is the language many university students begin with. It is useful for fast prototyping, as it contains a large machine learning repository.
A matrix is an array used for representation of data.
Referring to data held about data and gives information and context about what initially the data item is about.
MongoDB is a NoSQL database, which is used to store unstructured data with no particular schema.
The signal representation a machine learning algorithm learns from provided training data.
The process of choosing a statistical model for machine learning from known models.
The process of choosing the best machine learning model for a given problem.
Classifications that are separated between two or more distinct groups or classes.
NewSQL databases maintain the integrity of typical relational SQL databases while providing the scalable performance of NoSQL structures.
Anything that is not part of the signal or is making the signal less apparent is noise.
The method of converting a range of values into a standard range or scale.
A modern approach to databases that is useful for managing data that changes frequently or data that is unstructured or semi-structured in nature. Rows can all have their own set of unique column values. NoSQL has been driven by the requirements to provide better performance in storing big data. The architecture has better write performance, takes less storage space with compression, and reduces operational overhead.
An open source maths library for Python.
The metric that a machine learning model is attempting to optimize.
Also known as anomalies, outliers are values that are not consistent with the bulk of the data and distant from other values.
The final layer of a neural network that provides the model output.
Overfitting occurs when a machine learning model performs well on a training dataset but does not perform as well on a validation set. The model essentially learns the training data.
Attribute values that control the output and behavior of a machine learning system.
On the basis the null hypothesis is true, the
P value refers to the probability of receiving a value equal to or greater than the observed value.
A measure of how correct a machine learning model is.
A mapping of states to actions used by an agent in reinforcement learning.
A metric used to evaluate classification models reflecting the number of true positives over the total.
A machine learning model’s output given input data.
A data management and monitoring framework to ensure data security.
R is a software and programming language for statistical computing and graphics. Project R has been designed as a data mining tool, while R programming language is a high-level statistical language that is used for analysis.
Real-time data is data that is generated, transferred, processed, stored, and visualized within milliseconds.
This is a supervised learning technique in which the output is a real value.
Smart health refers to the use of mobile and IoT technologies for the better health and well-being of people and to improve quality of life.
An extremely useful, open source application to facilitate the transfer of data between Hadoop and traditional relational database systems.
Spark is a simple programming model that can be used with Java, Scala, Python, and R that enables large-scale data processing applications to be written quickly.
SQL (Structured Query Language) is a language used for managing data held in a traditional database management system.
A tensor is an object similar to a vector that is represented as an array that can hold data in N dimensions. A tensor is a generalization of a matrix in N-dimensional space.
An Alphabet-backed, open source library of data computations optimized for machine learning that enables multilayered neural networks and quick training.
A model output that is correctly predicted as false.
A model output that is correctly predicted as true.
Underfitting occurs when a machine learning model is not able to capture the signal of the data and will have poor performance on both training and validation data.
An acronym for Extensible Markup Language, XML is another form of flat data file designed to make the importing, exporting, generation, and movement of data easier.