Glossary

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Glossary

A/B testing (split testing): A method to test which product version works best in practice. Customers are randomly divided into groups and shown different versions of a product (such as an element on a website). At the end of the test period, the results are analysed to see which versions performed best relative to one or more metrics

Algorithm: A sequence of actions followed to arrive at a result

Analytic model: One or more mathematical formulas that together approximate a phenomenon of interest

Apache Software Foundation: A non-profit US corporation consisting of a decentralized open-source community of developers. It maintains much of the software used within the big data ecosystem

Artificial intelligence (AI): A general term for a machine that can respond intelligently to its environment

Artificial neural networks (ANN): Analytic models that learn tasks by training networks of basic nodes which are linked in sometimes complex architectures

Batch job: A computer job, such as a data transfer or a computation, that is run at regularly scheduled intervals (often daily), rather than continuously

Batch processing: A process that is executed as a series of consecutive batch jobs

Beam (Apache): An open-source programming model designed to handle data movements in both batch and streaming modes

Big data ecosystem: The technologies that have been developed to store, transfer and process big data

Black-box model: An analytic model whose internal workings cannot easily be explained or understood

Business intelligence (BI): The field of technology dealing with the transfer, storage and delivery of data specifically for reporting and analysis

CapEx: Capital Expenditure. An investment whose benefit extends over a long period of time, such as durable goods or development of software that will be used for a long time. See also OpEx

Cloud computing: The use of hardware or software not owned by the end user but made available on demand according to some subscription model

Clustering: An analytic technique in which the data is divided into groups (clusters) in a way that attempts to group similar elements together

Concurrency: When evaluating suitability of software, concurrency refers to the number of users that can use the software simultaneously

Cross-validation: A method to validate analytic models by repeatedly splitting the test data, training the model on part of the data, and then testing its effectiveness on the remaining data

Dark data: A term for data which is generated by normal computer networks but not typically analysed

Data lakes: Any big data storage system designed to store raw data whose end use may not be known at time of collection

Data science: The practice of applying any number of analytic techniques using any number of data sources. The term implies the creative use of non-standard approaches in bringing business value

Data warehouses: Databases structured to facilitate analysis and reporting rather than to run operations

Deep learning: Utilizing artificial neural networks with many hidden layers (typically dozens or hundreds of layers)

Elastic Search: A widely used enterprise search platform, similar in functionality to Apache Solr

Ensemble: The term for a collection of analytic models producing separate outputs, which are then merged in a democratic way to produce a single output

ETL: Extract, Transfer, Load. The steps through which data is moved from source systems to a data warehouse. Sometimes executed as ELT

Exabyte: 10¹⁸ bytes, or 1000 petabytes

Expert systems: An AI that imitates the decision-making ability of a human expert, typically by learning and deducing facts and rules

Fast data: Data which appears at high velocity and must be received, analysed and responded to in real time

Feature engineering: Creating data fields not in the original records, but which you expect to be of explanatory value in an analytic model. An example would be calculating a field ‘time since last purchase’ from a database consisting only of purchase events

Flink: An open-source processing framework for streaming data

Forrester: An American market research and advisory firm

Forrester Wave: Forrester’s periodic evaluations of vendors in specific technology spaces

Gartner: An American research and advisory firm specializing in IT

Gartner Hype Cycle: A branded, graphical presentation developed by Gartner for representing the maturity and adoption of various technologies

Gartner Magic Quadrants: Analysis provided by Gartner comparing vendors for various technology offerings. Typically updated annually

General Data Protection Regulation (GDPR): A comprehensive EU regulation related to privacy, data protection and fair usage of data, effective May 2018

Gigabyte (GB): 10⁹ bytes, or 1000 kilobytes

Go: An ancient Chinese board game for two players. The goal is to surround the most territory with your stones

Goodness-of-fit test: A statistical test to assess how well a model fits the test data

Graphical processing unit (GPU): An electronic circuit specially designed for computer graphics or image processing

Hadoop (Apache): The foundational open-source software framework for distributed storage and processing of data. It uses HDFS for storage and MapReduce for processing

Hadoop Distributed Files System (HDFS): The distributed, scalable file system used by Hadoop

Hive (Apache): An open-source software for data warehousing on Hadoop

Infrastructure as a Service (IaaS): Computer server space, networking and load balancers that are used on a subscription basis

Internet of Things (IoT): A term for the billions of devices in use today that have embedded sensors and processors plus network connectivity

JavaScript: A high-level programming language often used in web browsers

JSON. JavaScript Object Notation. A common, human-readable data storage format

Kafka (Apache): A highly scalable open-source message queueing platform originally developed by LinkedIn and released to open-source in 2011

Key performance indicator (KPI): A quantifiable measure of performance often used within organizations to set targets and measure progress

Lambda architecture: A data processing architecture designed to balance the requirements of fast data and accurate data storage

Latency: The time taken for data to move between points

Linkage attack: An attempt to de-anonymize private data by linking it to PII

Machine learning (ML): The process through which an AI program self-improves by continuously learning from training data

MapReduce: The programming model used in Hadoop for spreading data processing across a computer cluster

Massively parallel processing (MPP) databases: Databases that spread data across multiple servers or nodes, which communicate via a network but do not share memory or processors

Micro-conversions: Events progressing towards a goal but which do not have significant value in themselves

Minimum viable product (MVP): A functioning product with the minimum features to satisfy early customers and generate feedback for future development

Models: See analytic model

Model training: An iterative process of adjusting model parameters to improve model fit to available data

Monte Carlo simulations: Repeatedly entering random numbers into the distributions assumed to govern a process and then studying the outcomes

Neural networks: See artificial neural networks

noSQL databases: Databases that allow storage and processing of data which is not necessarily in tabular form

OpEx: Operational Expenditure. An ongoing business cost. See also CapEx

Personally identifiable information (PII): Information that is unique to an individual, such as passport number.

Personas: A hypothesized user group with certain attributes, goals, and/or behaviours

Petabyte (PB): 10¹⁵ bytes, or 1000 terabytes

Platform as a Service (PaaS): Cloud services to build and maintain the middleware that runs on the computer hardware and supports software applications

Principal component analysis: A mathematical technique that can be used to reduce the number of variables in a model

Private clouds: A technology cloud maintained by and used within a single organization

Public clouds: A technology cloud maintained by a third party and made available according to some subscription model

RAM: Random access memory. Computer memory that can be accessed without touching preceding bytes

RASCI: A framework for defining project responsibility, divided into Responsible, Authorizing, Supporting, Consulting and Informed individuals

REST (Representational State Transfer) service: A simple, well-defined computer architecture often used to deliver information between computers across the web

Return on investment (ROI): A measure of the benefit of an investment. There are several ways to calculate ROI

Safe Harbour Decision: A ruling by the European Commission in 2000 which allowed US companies complying with certain data governance standards to transfer data from the EU to the US. On 6 October 2015, the European Court of Justice invalidated the EC’s Safe Harbor Decision. A replacement for Safe Harbor, the EU-US Privacy Shield, was approved by the European Commission nine months later (July 2016)

Salesforce (salesforce.com): A popular, cloud-based software for managing customer data and assisting sales efforts

Self-service analytics: When end users are given the data and tools to generate their own basic analysis, pivot tables and charts

Semi-structured data: Unstructured data to which a few structured fields are added, such as adding time and location fields to free-text data

Software as a Service (SaaS): Centrally hosted software that is used on a subscription basis

Software framework: Software providing general, extendible, low-level functionality that can be leveraged by more specialized software

Solr (Apache): An open-source, stand-alone full-text search platform often used by enterprises to manage text search

Spark (Apache): A computing framework developed at Berkeley Labs which runs distributed computations over RAM memory. It has replaced Hadoop’s MapReduce in many applications

Split testing: See A/B testing

Standard Query Language (SQL): The standard language for inserting and retrieving data from relational databases

Technology stack: A collection of software components that interact to form a complete technology solution

Terabyte (TB): 10¹² bytes, or 1000 gigabytes

TPU (tensor processing unit): An application-specific integrated circuit developed by Google for machine learning

Training: See model training

Training data: The data used to fit the parameters of an analytic model

Unstructured data: Data such as free text or video that is not divided into predefined data fields

Version control system (VCS): A type of software tool that controls and archives changes to code, as well as other documents

XML (eXtensible Markup Language): A format for encoding data in a document that is both machine and human readable, as defined by certain standard specifications

Yottabyte: 10²⁴ bytes, or 1000 zettabytes

Zettabyte: 10²¹ bytes, or 1000 exabytes

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Glossary

Create new playlist

Sign In

Sign Up

Glossary

Table of Contents for
Glossary