Appendix A. Glossary of terms

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Appendix A. Glossary of terms

At Manning, the publisher’s policy is to help the reader by going the extra mile to make sure the terms used in our books are defined there. The reasoning is that it saves readers some time when they encounter an unfamiliar term.

As is the case for any other business book, this book’s audience is diverse. Depending on your background, you’ll be familiar with most of the terms used in this book but might appreciate definitions for a few of them. The terms that should be defined are different for every user. To avoid a situation in which the flow of text is broken by definitions of terms that most readers know, I’ve defined most of them in this glossary. To help the readers who encountered the term for the first time, I tailored the definitions toward being understandable, even if that comes at the expense of some loss of formality/precision.

4+1 architectural view— A methodology that describes the software architecture based on a set of specialized views. See Krutchen’s paper [85] for details, or Wikipedia [86] for a summary.
Accuracy— In the context of the binary classifier, accuracy is a technical metric that measures the success of the classification algorithm. It’s proportional to the data that’s correctly classified. The formula for accuracy [181] is
Actuary— A business professional who uses quantitative methods to measure risk and uncertainty in business domains. Actuaries are licensed professionals who have a set of standards that guide practitioners through quantitative models that are appropriate to use in such situations. As a general rule, actuarial methods used in the insurance industry analyze the population as a whole and provide guidance on the outcomes expected for the entire population. Individual outcomes within that population can vary widely. See Wikipedia [182] for details.
Application programming interface (API)— A standardized way to invoke the functionality of a software system.
Artificial intelligence (AI)— For the purpose of this book, AI is defined as an area of computer science that studies how to allow computers to complete tasks that historically required human intelligence. Note that the definition of AI, as well as the relationship between AI and machine learning (ML), often depends on which source you use. Section 8.1 details some of the definitions of AI you can find in the industry today.
Autoregressive integrated moving average (ARIMA)— A statistical technique for analyzing a time series and forecasting the future values of the series [109]. ARIMA is based on autoregression (regression versus previous values of the time series), differencing, and the moving average.
Bias-variance tradeoff— A method for decomposing an error of some predictive algorithm in the components that could be traced to the structure of the predictive algorithm. See Wikipedia [183] for details.
Big data— Data that is too big to process on a single machine. There are different definitions of big data. One of the common definitions of big data was based on V’s, such as Velocity (how often data changes), Volume (how big the data is), Variety (how many different data sources and data types are in the data), and Veracity (how much you can trust the data). Historically, the number of V’s included in a definition varied, with the early definitions based on three V’s: Volume, Velocity, and Variety.
Business vertical— A grouping of businesses, where all businesses in that group cater to a specific customer base that shares some common characteristics, such as profession (healthcare, for example) or need (transportation, for example).
Classification— In the context of ML, classification identifies to which category input data belongs. Categories in which classification would happen are predefined. Classification could be between two classes (binary classification) or across multiple classes.
Commercial-off-the-shelf (COTS) product— A product that’s already made and that you can purchase from someone else.
Cost of capital— In the context of starting a new business project, cost of capital is the minimum rate of return that must be exceeded for the project to be worthwhile.
Cross industry standard process for data mining (CRISP-DM)— A standard that defines the process for analytics and data mining. It predates the popularity of big data and AI. It’s an iterative process in which you start with understanding the business, then understanding the data. Next comes preparing the data for modeling, performing the modeling, and then evaluating the results. If the results are satisfactory, you then deploy the model; otherwise, you repeat the aforementioned cycle. See Wikipedia [184] for details.
Customer churn— The proportion of your recurring customers who decide to stop doing business with you.
Data lake— Stores all the data available to the organization in a single repository, allowing you to reference it all during analysis. For a discussion of data lakes and the philosophy behind building them, see Gorelik’s book [103] and Needham’s book [185].
Data science— A multidisciplinary field that uses algorithms from many different quantitative and scientific fields to extract insights from data. Similar to many other areas that have captured the popular imagination, it’s not universally agreed what all the fields are that are a part of data science. Some of the fields that are often considered part of data science include statistics, programming, mathematics, machine learning, operational research, and others [66]. Closely related fields that are sometimes considered part of data science include bio-informatics and quantitative analysis. While AI and data science closely overlap, they aren’t identical, because AI includes fields such as robotics, which are traditionally not considered part of data science. Harris, Murphy, and Vaisman’s book [66] provides a good summary of the state of data science before the advancement of deep learning.
Data scientist— A practitioner of the field of data science. Many sources (including this book) classify AI practitioners as data scientists.
Database administrator (DBA)— A professional responsible for the maintenance of a database. Most commonly, a DBA would be responsible for maintaining a RDBMS-based database.
Deep learning— A subfield of AI that uses artificial neural networks arranged in a significant number of layers. In the last few years, deep learning algorithms have been successful in a large number of highly visible applications, including image processing and speech and audio recognition. Deep learning also was used in AI algorithms that demonstrated a high level of performance in playing various games, such as Go and StarCraft, which exceeded the levels at which the best humans could play. See various appendix C entries [153–156,176]. Deep learning algorithms demonstrate a close-to-human (or even above human) level of performance in many of these tasks and recent newsworthy successes of AI.
End user license agreement (EULA)— A legal agreement between a user and the company governing usage of the computer software.
Enterprise data warehouse (EDW)— A type of database that’s optimized for the reporting and analysis of enterprise data.
High-frequency trading (HFT)— In the context of the financial markets, HFT is a type of trading based on the combination of computer algorithms, high volume, and low latency.
Internet of Things (IoT)— A network consisting of various physical devices connected to the internet [46]. The type of physical devices varies widely, and, in principle, anything that performs any function in the physical world could be an IoT device. Examples of IoT devices range from smart thermometers [36,37] to connected vehicles and homes.
K-means— One of the original clustering algorithms, it assigns its input data to one of the K clusters, where K is an integer.
Label— In the context of classification, a label is the name of the category that data used in training belongs to.
Lean startup— A methodology for running business operations described in Reis’ book [28]. Some of the principles of the lean startup methodology are to shorten the business development cycle by iterative product development and testing the product in the marketplace as soon as practical. While originally described in the context of startups, this methodology is now extensively used by organizations of all sizes. See also minimum viable product (MVP) and pivot.
Linear response— A type of response in the system that’s proportional to the change in input. If an input change of 1 unit results in a change of x output units, then 2 units of input change would result in a change of 2x units in the system’s output.
Long short-term memory (LSTM)— A type of deep learning network characterized by the particular structure of the neural network [110]. It’s typically used in the prediction of future values in a time series.
Mindshare— Exemplifies how well known and how often some concept, idea, or product is considered.
Minimum viable product (MVP)— A product that provides enough functionality to your customers so that your organization can learn if the business direction in which your company is moving is the correct one [28].
Operations research (OR)— A field of research that uses various mathematical and analytical methods to help in making better decisions. Historically, it developed as a part of applied mathematics and predates ML and AI. Today, OR is often considered one of the fields associated with data science.
Opportunity cost— Suppose you have a set of actions of which you can take only one, and you chose one of the actions. The opportunity cost of the action you have taken is the value of the most valuable choice among the actions that you didn’t take [186].
Pivot— In the context of the lean startup methodology, a pivot is an act of structured course correction, designed to test a new hypothesis about the business or strategy [28].
Proportional-integral-derivative (PID) controller— According to the Wikimedia Foundation [34]:

A proportional-integral-derivative controller is a control loop feedback mechanism widely used in industrial control systems and a variety of other applications requiring continuously modulated control.

PID compares errors between current values and a desired value of some process variable for the system under control and applies the correction to that process variable based on proportional, integral, and derivative terms. PID controllers are widely used in various control systems.
Quantitative analysis (QA)— According to Will Kenton [187]:

Quantitative analysis (QA) is a technique that seeks to understand behavior by using mathematical and statistical modeling, measurement, and research. Quantitative analysts aim to represent a given reality in terms of a numerical value.
Quantitative analyst (quant)— A practitioner of quantitative analysis [187]. Common business verticals in which quants work are trading and other financial services.
Predictive analytics— A type of analytics that uses historical data to predict future trends. It answers the question: “What would happen next?”
Recommendation engine— A software system that looks at the previous choices between items that the user has made and recommends a new item that the system expects would match the user’s interest. The recommendation engine can recommend many different types of items. In the case of a clothing retailer, an example of an item would be a sweater. In the case of Netflix, an example of an item would be a movie.
Reinforcement learning— In the context of ML, reinforcement learning studies the design of the agents that are able to maximize a long-term reward obtained from some environment.
Relational database management system (RDBMS)— A database system with a strong mathematical foundation. That foundation is known as a “relational model.” RDBMSs are widely used in most of the non-big data applications and are often the default choice for data storage. RDBMSs typically use SQL as a language to query the data in the database.
Root mean square error (RMSE)— A technical metric that’s often used to measure the results of statistical, ML, and AI algorithms. It’s used to measure the difference between the quantities that the algorithm has predicted and the actual quantities that resulted. RMSE penalizes large prediction errors more than small prediction errors [188]. RMSE is defined by the following formula:
Where:
- n = The number of points for which the ML algorithm has predicted the value
- Ŷ_i = The predicted value of point i, according to the ML algorithm
- Y_i = The actual value of point i
Six Sigma— A set of methods that help an organization improve its business processes [21,22]. While Six Sigma historically has been used to help improve the quality of many manufacturing processes, the methods and practices associated with Six Sigma have been used extensively in all fields of business. Practitioners of Six Sigma view all work as processes that are subject to never-ending improvement. Six Sigma has pioneered usage of data and statistical techniques to improve the field of business operations.
Smart environment— An environment that can be computer controlled. It typically includes many IoT devices and uses a computer system and AI to control and orchestrate the behavior of those devices.
Streaming analytics— A type of analytics applied to streaming data to process that data within some deadline from the moment it arrives in the system. Streaming data is data that’s continuously arriving in the computer system.
Supervised learning— A type of ML in which model training occurs by presenting training data for which the correct result of the application of the algorithm is known. For example, if the goal is to classify email messages in spam or not spam categories, the training set would consist of email messages that are labeled (known to be) spam messages and email messages that are known not to be spam.
Training— In the context of ML algorithms, training is a necessary step in preparing the algorithm for performing its function. During training, data is presented to the model, and parameters of the model are optimized.
Unified modeling language (UML)— A standard for the description of the structure and behavior of software systems using visual diagrams. See OMG’s UML site [189] for details.
Unsupervised learning— A type of ML in which patterns are found in the unlabeled data. One example of unsupervised learning is clustering.
Zero sum game— A game in which the success of one player comes at the expense of other players. In zero sum games, my gain is your loss, and vice versa.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Appendix A. Glossary of terms

Create new playlist

Sign In

Sign Up

Appendix A. Glossary of terms

Table of Contents for
Appendix A. Glossary of terms