Glossary

A

Aggregation The process of searching, gathering, and presenting data.

Algorithms Mathematical formulas that can perform certain analyses on data.

Analytics Communication of the discovery of insights from data.

Anomaly detection The search for items in a dataset that do not match a projected pattern or expected behavior. Anomalies are also called outliers, exceptions, surprises, or contaminants, and they often provide critical and actionable information.

Anonymization Making data anonymous; removing all data points that could identify a person.

Application Computer software that enables a device to perform a certain task.

Artificial Intelligence The development of machines and software capable of perceiving the environment, taking corresponding action when required, and even learning from those actions.

B

Behavioral analytics Analytics that inform about how, why, and what, instead of just the who and when. It looks at humanized patterns in the data.

Big Data Scientist Someone able to develop the algorithms that make sense out of Big Data.

Big Data startup A young company that has developed new Big Data technology.

Biometrics The identification of humans by their characteristics.

Brontobytes Approximately 1000 yottabytes; the size of the digital universe tomorrow. A brontobyte contains 27 zeros.

Business Intelligence The theories, methodologies, and processes to make data understandable.

C

Classification analysis A systematic process for obtaining important and relevant information about data, also called metadata; data about data.

Cloud computing A distributed computing system over a network used for storing data off-premises.

Clustering analysis The process of identifying similar objects and clustering them to understand the differences as well as the similarities within the data.

Cold data storage Storing old data that is hardly used on low-power servers. Retrieving the data will take longer.

Comparative analysis A step-by-step procedure for comparisons and calculations that detect patterns within very large datasets.

Complex structured data Data that is composed of two or more complex, complicated, and interrelated parts that cannot be easily interpreted by structured query languages and tools.

Computer generated data Data generated by computers, such as log files.

Concurrency The performance and execution of multiple tasks and processes at the same time

Correlation analysis The analysis of data to determine a relationship between-variables and whether that relationship is negative (–1.00) or positive (+1.00).

Customer Relationship Management Managing sales and business processes. Big Data will affect CRM strategies

D

Dashboard A graphical representation of the analyses performed by algorithms.

Data aggregation tools The process of transforming scattered data from numerous sources into a single new source.

Data analyst Someone who analyzes, models, cleans, or processes data

Database A digital collection of data stored via a certain technique.

Database-as-a-Service A database hosted in the cloud on a pay-per-use basis, as, for example, Amazon Web Services.

Database Management System Collecting, storing, and providing access to data.

Data center A physical location that houses the servers for storing data.

Data cleansing The process of reviewing and revising data to delete duplicates, correct errors, and provide consistency.

Data custodian Someone responsible for the technical environment necessary for data storage,

Data ethical guidelines Guidelines that help organizations be transparent with their data, ensuring simplicity, security, and privacy

Data feed A stream of data, such as a Twitter feed or RSS.

Data marketplace An online environment to buy and sell datasets.

Data mining The process of finding certain patterns or information from datasets.

Data modeling The analysis of data objects using data modeling techniques to create insights.

Dataset A collection of data.

Data virtualization A data integration process to gain more insights. Usually it involves databases, applications, file systems, websites, and Big Data techniques).

Deidentification The same as anonymization; ensuring a person cannot be identified through the data.

Descriptive analytics A form of analysis that helps organizations understand what happened in the past.

Discriminant analysis Cataloguing of the data. Distributing data into groups, classes, or categories. A statistical analysis used when certain groups or clusters in data are known upfront and that applies the information to derive the classification rule.

Distributed File System A system that offers simplified, highly available access to storing, analyzing, and processing data

Document Store Database A document-oriented database especially designed to store, manage, and retrieve documents that are also known as semistructured data.

E

Exploratory analysis The finding of patterns within data without standard procedures or methods. It is a means of discovering the data and identifying the dataset's main characteristics.

Exabytes Approximately 1000 petabytes or 1 billion gigabytes. Today, one exabyte of new information is created globally on a daily basis.

Extract, Transform, and Load (ETL) A process in a database and data warehousing that extracts the data from various sources, transforms it to fit operational needs, and loads it into the database

F

Failover Switching automatically to a different server or node should one fail.

Fault-tolerant design A system designed to continue working even if certain parts fail.

G

Gamification Using game elements in a nongame context; very useful to create data and therefore coined as the friendly scout of Big Data.

Graph Databases Databases that use graph structures (a finite set of ordered pairs or certain entities) with edges, properties, and nodes for data storage. It provides index-free adjacency, meaning that every element is directly linked to its neighbouring element.

Grid computing A method of connecting different computer systems in various locations, often via the cloud, to reach a common goal.

H

Hadoop An open-source framework built to enable the process and storage of Big Data across a distributed file system.

HBase An open source, nonrelational, distributed database running in conjunction with Hadoop.

HDFS (Hadoop Distributed File System A distributed file system designed to run on commodity hardware.

High-Performance Computing (HPC) The use of supercomputers to solve highly complex and advanced computing problems.

I

Industrial Internet The integration of complex physical machinery with networked sensors and software.

In-memory A database management system that stores data on the main memory instead of the disk, resulting is very fast processing, storing, and loading of the data.

Internet of Things Ordinary devices that are connected to the Internet at any time and anywhere through sensors.

J

Juridical data compliance The need to observe the laws of the country in which data is stored. This is particularly relevant when you use cloud solutions.

K

Keyvalue databases Databases that use a primary key, a uniquely identifiable record, that makes it easy and fast to look up information. The data stored in a key value is usually some kind of primitive programming language.

L

Latency A measure of time delayed in a system.

Legacy system An old system, technology, or computer that is no longer supported.

Load balancing The distribution of the workload across multiple computers or servers to achieve optimal results and utilization of the system.

Location data GPS data describing a geographical location.

Log file a file automatically created by a computer to record events that occur while operational.

M

Machine2Machine data Two or more machines that communicate with each other.

Machine data Data created by machines using sensors or algorithms.

Machine learning Part of artificial intelligence in which machines learn from what they are doing and improve over time.

MapReduce A software framework for processing vast amounts of data.

Massively Parallel Processing (MPP) The use of many different processors (or computers) to perform certain computational tasks simultaneously.

Metadata Data about data; it provides information about what the data is about.

MongoDB An open-source NoSQL database.

Multidimensional databases Databases optimized for online analytical processing (OLAP) applications and data warehousing.

MultiValue Databases A type of NoSQL and multidimensional databases that understand three-dimensional data directly. They are primarily giant strings that are perfect for directly manipulating HTML and XML strings.

N

Natural language processing A field of computer science that studies interactions between computers and human languages.

Network analysis The viewing of relationships among nodes in terms of the network or graph theory. It means analyzing connections between nodes in a network and the strength of the ties.

NewSQL An elegant, well-defined database system that is easier to learn and better than SQL. It is even newer than NoSQL.

NoSQL sometimes referred to as “Not only SQL,’” as it is a database that does not adhere to traditional relational database structures. It is more consistent and can achieve higher availability and horizontal scaling.

O

Object databases Databases that store information in the form of objects, as used by object-oriented programming. They are different from relational or graph databases, and most of them offer a query language that allows an object to be found with a declarative programming approach.

Object-based image analysis Analyzing digital images using data from individual pixels, whereas object-based image analysis uses data from a selection of related pixels, called objects or image objects.

Operational databases Databases that perform the regular operations of an organization and are generally very important to a business. They use online transaction processing that allows them to enter, collect, and retrieve specific information about the company.

Optimization analysis The process of optimization during the design cycle of products performed by algorithms. It allows companies to virtually design many different variations of a product and test them against preset variables.

Ontology The representation of knowledge as a set of concepts within a domain and the relationships between those concepts.

Outlier detection An outlier is an object that deviates significantly from the general average within a dataset or a combination of data. It is numerically distant from the rest of the data and, therefore, indicates that something is going on and requires additional analysis.

P

Pattern recognition The identification of patterns in data using algorithms to make predictions about new data coming from the same source.

Petabytes Approximately 1000 terabytes or 1 million gigabytes. The CERN Large Hydron Collider generates approximately 1 petabyte per second

Platform-as-a-Service Services providing all the necessary infrastructure for cloud computing solutions.

Predictive analysis The most valuable analysis within Big Data, as it helps predict what someone is likely to buy, visit, and do or how someone will behave in the (near) future. It uses a variety of different datasets, such as historical, transactional, social, and customer profile, to identify risks and opportunities.

Prescriptive analytics Analysis that foresees not only what will happen and when it will happen, but why it will happen, and provides recommendations about how to take advantage of the predictions.

Privacy The seclusion of certain data/information about oneself that is deemed personal.

Public data Public information or datasets created with public funding.

Q

Quantified self A movement to use applications to track one's every move during the day in order to gain a better understanding about one's behavior.

Query A request for information to answer a certain question.

R

Reidentification The combination of several datasets to find a certain person within anonymized data.

Regression analysis An analysis that defines the dependency between variables. It assumes a one-way causal effect from one variable to the response of another variable.

RFID (Radio Frequency Identification) A type of sensor using wireless noncontact radiofrequency electromagnetic fields to transfer data.

Real-time Data Data that is created, processed, stored, analyzed, and visualized within milliseconds.

Recommendation engine An algorithm that suggests certain products based on previous buying behavior or the buying behavior of others.

Routing analysis The way to find optimized routing using many different variables for a certain means of transport to decrease fuel costs and increase efficiency.

S

Semistructured data A form of structured data that does not have a formal structure. It does however have tags or other markers to enforce a hierarchy of records.

Sentiment Analysis The use of algorithms to learn how people feel about certain topics.

Signal analysis The analysis of measurement of time varying or spatially varying physical quantities to analyze the performance of a product. Especially used with sensor data.

Similarity searches Searches to find the closest object to a query in a database, where the data object can be any type of data.

Simulation analysis The imitation of the operation of a real-world process or system. Simulation analysis helps ensure optimal product performance taking into account many different variables.

Smart grid The use of sensors within an energy grid to monitor what is going on in real time, thereby helping to increase efficiency.

Software-as-a-Service A software tool that is used on the web via a browser.

Spatial analysis The analysis of spatial data, such geographic or topological data, to identify and understand patterns and regularities within data distributed in geographic space.

SQL A programming language for retrieving data from a relational database.

Structured data Data that is identifiable because it is organized in structurelike rows and columns. The data resides in fixed fields within a record or file or the data is tagged correctly and can be accurately identified.

T

Terabytes Approximately 1,000 gigabytes. A terabyte can store up to 300 hours of high-definition video.

Time series analysis The analysis of well-defined data obtained through repeated measurements of time. The data is measured at successive points spaced at identical time intervals.

Topological data analysis This analysis focuses on the shape of complex data and identifies clusters and any statistical significance that is present.

Transactional data Dynamic data that changes over time.

Transparency The policy of keeping consumers informed about what happens with their data.

U

Unstructured data Data that is in general text heavy, but may also contain dates, numbers, and facts.

V

Value A benefit of Big Data for organizations, societies, and consumers. Big Data means big business, and every industry will reap the rewards.

Variability The ability of the data to change (rapidly). In (almost) the same tweets, for example, a word can have a totally different meaning.

Variety The presentation of data in many different formats: structured, semistructured, unstructured, and even complex structured.

Velocity The speed at which the data is created, stored, analyzed, and visualized.

Veracity The correctness of the data. Organizations need to ensure that data and the analyses performed on it are correct.

Visualization Complex graphs that can include many variables of data while still remaining understandable and readable. These are not ordinary graphs or pie charts. With the right visualizations, raw data can be put to use.

Volume The amount of data, ranging from megabytes to brontobytes.

W

Weather data An important open and public data source that can provide organizations with a lot of insights if combined with other sources.

X

XML Databases Databases that allow data to be stored in XML format. XML databases are often linked to document-oriented databases. The data stored in an XML database can be queried, exported, and serialized into any format needed.

Y

Yottabytes Approximately 1000 zettabytes, or the data stored on 250 trillion DVDs. The entire digital universe today is one Yottabyte, and this will double every 18 months.

Z

Zettabytes Approximately 1000 exabytes or 1 billion terabytes. It is expected is that by 2016, more than one zettabyte will cross our networks globally on a daily basis.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.143.4.181