Glossary

A/B testing (split testing): A method to test which product version works best in practice. Customers are randomly divided into groups and shown different versions of a product (such as an element on a website). At the end of the test period, the results are analysed to see which versions performed best relative to one or more metrics

Algorithm: A sequence of actions followed to arrive at a result

Analytic model: One or more mathematical formulas that together approximate a phenomenon of interest

Apache Software Foundation: A non-profit US corporation consisting of a decentralized open-source community of developers. It maintains much of the software used within the big data ecosystem

Artificial intelligence (AI): A general term for a machine that can respond intelligently to its environment

Artificial neural networks (ANN): Analytic models that learn tasks by training networks of basic nodes which are linked in sometimes complex architectures

Batch job: A computer job, such as a data transfer or a computation, that is run at regularly scheduled intervals (often daily), rather than continuously

Batch processing: A process that is executed as a series of consecutive batch jobs

Beam (Apache): An open-source programming model designed to handle data movements in both batch and streaming modes

Big data ecosystem: The technologies that have been developed to store, transfer and process big data

Black-box model: An analytic model whose internal workings cannot easily be explained or understood

Business intelligence (BI): The field of technology dealing with the transfer, storage and delivery of data specifically for reporting and analysis

CapEx: Capital Expenditure. An investment whose benefit extends over a long period of time, such as durable goods or development of software that will be used for a long time. See also OpEx

Cloud computing: The use of hardware or software not owned by the end user but made available on demand according to some subscription model

Clustering: An analytic technique in which the data is divided into groups (clusters) in a way that attempts to group similar elements together

Concurrency: When evaluating suitability of software, concurrency refers to the number of users that can use the software simultaneously

Cross-validation: A method to validate analytic models by repeatedly splitting the test data, training the model on part of the data, and then testing its effectiveness on the remaining data

Dark data: A term for data which is generated by normal computer networks but not typically analysed

Data lakes: Any big data storage system designed to store raw data whose end use may not be known at time of collection

Data science: The practice of applying any number of analytic techniques using any number of data sources. The term implies the creative use of non-standard approaches in bringing business value

Data warehouses: Databases structured to facilitate analysis and reporting rather than to run operations

Deep learning: Utilizing artificial neural networks with many hidden layers (typically dozens or hundreds of layers)

Elastic Search: A widely used enterprise search platform, similar in functionality to Apache Solr

Ensemble: The term for a collection of analytic models producing separate outputs, which are then merged in a democratic way to produce a single output

ETL: Extract, Transfer, Load. The steps through which data is moved from source systems to a data warehouse. Sometimes executed as ELT

Exabyte: 1018 bytes, or 1000 petabytes

Expert systems: An AI that imitates the decision-making ability of a human expert, typically by learning and deducing facts and rules

Fast data: Data which appears at high velocity and must be received, analysed and responded to in real time

Feature engineering: Creating data fields not in the original records, but which you expect to be of explanatory value in an analytic model. An example would be calculating a field ‘time since last purchase’ from a database consisting only of purchase events

Flink: An open-source processing framework for streaming data

Forrester: An American market research and advisory firm

Forrester Wave: Forrester’s periodic evaluations of vendors in specific technology spaces

Gartner: An American research and advisory firm specializing in IT

Gartner Hype Cycle: A branded, graphical presentation developed by Gartner for representing the maturity and adoption of various technologies

Gartner Magic Quadrants: Analysis provided by Gartner comparing vendors for various technology offerings. Typically updated annually

General Data Protection Regulation (GDPR): A comprehensive EU regulation related to privacy, data protection and fair usage of data, effective May 2018

Gigabyte (GB): 109 bytes, or 1000 kilobytes

Go: An ancient Chinese board game for two players. The goal is to surround the most territory with your stones

Goodness-of-fit test: A statistical test to assess how well a model fits the test data

Graphical processing unit (GPU): An electronic circuit specially designed for computer graphics or image processing

Hadoop (Apache): The foundational open-source software framework for distributed storage and processing of data. It uses HDFS for storage and MapReduce for processing

Hadoop Distributed Files System (HDFS): The distributed, scalable file system used by Hadoop

Hive (Apache): An open-source software for data warehousing on Hadoop

Infrastructure as a Service (IaaS): Computer server space, networking and load balancers that are used on a subscription basis

Internet of Things (IoT): A term for the billions of devices in use today that have embedded sensors and processors plus network connectivity

JavaScript: A high-level programming language often used in web browsers

JSON. JavaScript Object Notation. A common, human-readable data storage format

Kafka (Apache): A highly scalable open-source message queueing platform originally developed by LinkedIn and released to open-source in 2011

Key performance indicator (KPI): A quantifiable measure of performance often used within organizations to set targets and measure progress

Lambda architecture: A data processing architecture designed to balance the requirements of fast data and accurate data storage

Latency: The time taken for data to move between points

Linkage attack: An attempt to de-anonymize private data by linking it to PII

Machine learning (ML): The process through which an AI program self-improves by continuously learning from training data

MapReduce: The programming model used in Hadoop for spreading data processing across a computer cluster

Massively parallel processing (MPP) databases: Databases that spread data across multiple servers or nodes, which communicate via a network but do not share memory or processors

Micro-conversions: Events progressing towards a goal but which do not have significant value in themselves

Minimum viable product (MVP): A functioning product with the minimum features to satisfy early customers and generate feedback for future development

Models: See  analytic model

Model training: An iterative process of adjusting model parameters to improve model fit to available data

Monte Carlo simulations: Repeatedly entering random numbers into the distributions assumed to govern a process and then studying the outcomes

Neural networks: See artificial neural networks

noSQL databases: Databases that allow storage and processing of data which is not necessarily in tabular form

OpEx: Operational Expenditure. An ongoing business cost. See also CapEx

Personally identifiable information (PII): Information that is unique to an individual, such as passport number.

Personas: A hypothesized user group with certain attributes, goals, and/or behaviours

Petabyte (PB): 1015 bytes, or 1000 terabytes

Platform as a Service (PaaS): Cloud services to build and maintain the middleware that runs on the computer hardware and supports software applications

Principal component analysis: A mathematical technique that can be used to reduce the number of variables in a model

Private clouds: A technology cloud maintained by and used within a single organization

Public clouds: A technology cloud maintained by a third party and made available according to some subscription model

RAM: Random access memory. Computer memory that can be accessed without touching preceding bytes

RASCI: A framework for defining project responsibility, divided into Responsible, Authorizing, Supporting, Consulting and Informed individuals

REST (Representational State Transfer) service: A simple, well-defined computer architecture often used to deliver information between computers across the web

Return on investment (ROI): A measure of the benefit of an investment. There are several ways to calculate ROI

Safe Harbour Decision: A ruling by the European Commission in 2000 which allowed US companies complying with certain data governance standards to transfer data from the EU to the US. On 6 October 2015, the European Court of Justice invalidated the EC’s Safe Harbor Decision. A replacement for Safe Harbor, the EU-US Privacy Shield, was approved by the European Commission nine months later (July 2016)

Salesforce (salesforce.com): A popular, cloud-based software for managing customer data and assisting sales efforts

Self-service analytics: When end users are given the data and tools to generate their own basic analysis, pivot tables and charts

Semi-structured data: Unstructured data to which a few structured fields are added, such as adding time and location fields to free-text data

Software as a Service (SaaS): Centrally hosted software that is used on a subscription basis

Software framework: Software providing general, extendible, low-level functionality that can be leveraged by more specialized software

Solr (Apache): An open-source, stand-alone full-text search platform often used by enterprises to manage text search

Spark (Apache): A computing framework developed at Berkeley Labs which runs distributed computations over RAM memory. It has replaced Hadoop’s MapReduce in many applications

Split testing: See A/B testing

Standard Query Language (SQL): The standard language for inserting and retrieving data from relational databases

Technology stack: A collection of software components that interact to form a complete technology solution

Terabyte (TB): 1012 bytes, or 1000 gigabytes

TPU (tensor processing unit): An application-specific integrated circuit developed by Google for machine learning

Training: See model training

Training data: The data used to fit the parameters of an analytic model

Unstructured data: Data such as free text or video that is not divided into predefined data fields

Version control system (VCS): A type of software tool that controls and archives changes to code, as well as other documents

XML (eXtensible Markup Language): A format for encoding data in a document that is both machine and human readable, as defined by certain standard specifications

Yottabyte: 1024 bytes, or 1000 zettabytes

Zettabyte: 1021 bytes, or 1000 exabytes

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.151.32