Chapter 8
Role of Collective Intelligence

By now, you may have started receiving the notification e-mails or letters from your utility company regarding your usage and how you may save if you run your laundry or dishwasher after a certain time in the evening during off-peak hours. You may also have come across an advertisement by Progressive Casualty and Insurance Company regarding Snapshot—a sensor that will capture driving patterns of a good driver—and how you may be rewarded with a good driver discount, saving a lot on your policy premium. You shouldn't be surprised to see promotional offers from your favorite retailers on specific merchandise that you care about or frequently shop for based on your buying interests. Have you ever paused and thought about how these vendors or service providers are able to analyze and communicate to you directly to suit your interests and needs? These smart meters used by utility companies, sensor devices used by insurance companies, and web logs analyzed by retailers enable them to capture data at the point of occurrence in real time, store and analyze data to help them understand the behavioral patterns, and guide them as to trends. These data are high in volume, get generated at high velocity, come in a wide variety, and are therefore rightly termed big data.

We are faced with challenges to make decisions every day; some are minor such as paying bills, and some are major such as buying a house, investing in stocks, developing a product, acquiring a company, or growing market share; those major decisions need some relevant information in that context. When we look back many years to when there were no computers or ready access to data, we wonder how people made these major decisions in day-to-day life, businesses, or even administering nations.

In ancient times, rulers based their real-time decision capabilities on intuitive and cognitive intelligence and advice by their council of ministers. They used to visit the street in disguise to gather conversations of citizens in order to get real-time feedback and sentiments to execute effective decisions. In those days, there was no support from technology, and all decisions were based on intuitive judgment. Our brains take in massive streams of sensory data and make the necessary correlations that allow us to make value judgments and decisions, all in real time.

Relating the preceding to our current era, we are given the support of computing power with additional memory and data processing, all on demand when we need it and in the cloud infrastructure (discussed in Chapter 3) to make real-time decisions. Recent technology such as big data analytics helps and supports us with the right information; real-time event messaging provides it at the right time, mobility at the right place anywhere, and social media in the right context to make the right decisions. With computing power and cognitive intelligence, we are in a much better situation to make real-time decisions in the business context.

Big data is a major revolution of the current times and will have a large impact on advanced analytics in the coming years. Big data is becoming relevant in all business cases, and it will help in gaining sustainable competitive advantage. As the technology platform is maturing rapidly, organizations need to give strategic importance to big data sources to gain insights and to offer their products and services based on customer needs. Analytics and business intelligence (BI) based on new big data sources will help business decision makers with greater predictability.

This chapter deals with big data concepts, background, and relevance across industry sectors, and offers some case studies to provide in-depth understanding and many examples of how an organization may deploy and implement big data analytics alongside its existing infrastructure.

Why Should You Care about Big Data?

Big data, one of the most talked-about information technology (IT) solutions, has emerged as a new technology paradigm to create business agility and predictability by analyzing data coming from various sources. The term big data was coined in the 1970s and was used to describe large amounts of data generated by oceanography and meteorological experiments. Big data can be understood as a natural evolution of database management techniques that has changed the way data is analyzed. Early implementations of big data solutions can be found during the 1980s—the era of the first generation of software-based parallel database architecture. However, it was not implemented significantly until the maturity of Internet usage, when web search companies faced the challenges of indexing and querying large aggregations of loosely structured data. Existing database technology was not ideal for the challenging task, and neither was it cost effective. Google developed the first wave of big data tools in the early 2000s, which gave birth to several other frameworks and techniques that make the handling, processing, and interpretation of large data sets more economical. By leveraging big data, companies can extract value and meaningful insights from voluminous data beyond what was previously possible using traditional analytical techniques. This also deals with new phenomena of the volume, velocity, and variability of massive data coming from social media, web logs, and sensors combined with transactional systems. Within these heaps of massive data, we have a treasure trove of information that can be extracted to save us from major disasters or accidents and proactively help the growth of businesses.

Big data is characterized primarily by large and rapidly growing data volumes, varied data structures, and new or newly intensified data analysis requirements. This enables us to deliver our customers in context the right offer, message, recommendation, service, or action, tailored and personalized to deliver unequaled value. With a multichannel customer experience platform for true cross-channel decisions that enables consistent operational decisions for the web channel, in the contact center, at the point of sale, and across all lines of business, we have the technology solutions for cross-channel learning and decisions. The automatic insights derived from one channel are seamlessly used both within and across other channels.

A balanced decision management framework that combines both business rules and self-learning predictive models helps in real-time decisions. This also helps in arbitrating rules and predictive model scores in the context of organizational goals/key performance indicators (KPIs) at the moment of a decision's execution.

With the evolution of social media, we started seeing the emergence of nontraditional, less structured data such as web logs, social media feeds, e-mail, sensors, photographs, and YouTube videos that can be analyzed for useful information. With the reduction of cost in both storage and computing power, it is now feasible to store and analyze this data for meaningful purposes. As a result, it is important for existing businesses and for new businesses to understand and evaluate the relevance of big data for their business intelligence and for decision making. Closed-loop real-time learning becomes immediately available for the next prediction to drive adaptive, high-value interactions. We may be able to discover and highlight important correlations in the data automatically by way of user-friendly reports. Automated data discovery leads user to the right and relevant business insights.

Big data addresses all types of data coming from various data sources, such as enterprise applications data that generally includes data generated from enterprise resource planning (ERP) systems, customer information from customer relationship management (CRM) systems, supply chain management systems, e-commerce transactions, and human resources (HR) and payroll transactions. It also attributes semantic data that comprise call details records (CDRs) from call centers, web logs, smart meters, manufacturing sensors, equipment logs, and trading systems data generated by machine and computer systems. Social media data that include customer feedback streams, microblogging sites like Twitter, and social media platforms like Facebook add up to big data and help in sentiment analysis.

There are four key characteristics—volume, velocity, variety, and value—that are commonly used to characterize different aspects of big data and are widely referred to in major conferences. The McKinsey Global Institute estimates that data volume is growing 40 percent per year, and will grow 44 times by 2020.

What Do Key Characteristics Signal about Big Data?

Big data is characterized by its sheer large volume, high velocity, and variety with low value. All major sources such as web logs, sensors, and social media generate new types of unstructured or semi-structured data that has given rise to a new phenomenon in decision making.

Volume

Social media (Facebook, Twitter, LinkedIn, Foursquare, YouTube, and many more) discussed in Chapter 7 generate a large volume of data that need to be stored and analyzed rapidly in context for the right decision making. The volume of machine-generated data or semantic web data is much larger than the traditional data volume. For instance, a single jet engine can generate 10 TB (terabytes) of data in 30 minutes. With more than 25,000 airline flights per day, the daily volume of just this single data source runs into the petabytes. Smart meters and heavy industrial equipment like oil refineries and drilling rigs generate similar data volumes, compounding the problem. The benefit gained from the ability to process large amounts of information is the main attraction of big data analytics. This volume presents the most immediate challenge to conventional IT structures. It requires scalable storage and a distributed approach to querying.

Velocity

The data comes into the data management system rapidly and often requires quick analysis for decision making. The importance lies in the speed of the feedback loop, taking data from input through to analysis and decision making. The tighter the feedback loop, the greater will be the competitive advantage. It's this need for speed, particularly on the web, that has driven the development of key-value stores and columnar databases, optimized for the fast retrieval of precomputed information. These databases form part of an umbrella category known as NoSQL (not only SQL) used when relational models do not suffice (discussed in detail in the technology platforms section later in this chapter). Social media data streams bring a large input of opinions and relationships that are valuable to customer relationship management. Even at 140 characters per tweet, the high velocity of Twitter data ensures large volumes (over 8 TB per day). Most of these data received may be of low value, and analytical processing may be required in order to transform the data into a usable form or derive meaningful information.

Variety

Big data brings variety of data types. It varies from text from social networks, to image or video data, to a raw feed directly from a sensor source, to semantic web logs generated by machines. These data are not easily integrated in any applications. A common use of big data processing is to take unstructured data and extract meaningful information for consumption either by humans or as a structured input to an application. Big data brings a lot of data that has patterns, sentiments, and behavioral information that need analysis.

Value

The economic value of different data varies significantly. Generally, there is good information hidden within a larger body of nontraditional data. Big data offers greater value to businesses in bringing real-time market and customer insights, enabling improvement in new products and services. Big data analytics can reveal insights such as peer influence among customers, revealed by analyzing shoppers' transactions and social and geographical data. The past decade's successful web start-ups are prime examples of big data used as an enabler of new products and services. For example, by combining a large number of signals from a user's actions and those of the user's friends, Facebook has been able to craft a highly personalized user experience and create a new kind of advertising business.

Does Size of Data Really Matter?

With the proliferation of cloud computing and commoditization of hardware, software, and storage, the growth in data has been explosive in the recent past. This exponential growth is primarily catalyzed by increased activity by digital devices and proliferation of the Internet. Massive volumes of data are generated by digital transactions between companies, machine-generated data (embedded sensors in industrial applications and automobiles), and consumer devices such as laptops, computers, and smartphones. The International Data Corporation (IDC) estimated that 1.8 zettabytes of information were created and replicated in 2011, the equivalent of 200 billion 60-minute high-definition (HD) movies that would take one person 47 million years to watch.

In the past decade, information generated grew at a 38 percent compound annual growth rate (CAGR) versus the world's storage capacity at a 23 percent CAGR. We believe the gap between information and storage will continue to widen, given increased growth in computational power (58 percent 10-year CAGR) as a result of computers, smartphones, and smart sensors that will drive information generation.

With storage on the cloud infrastructure getting cheaper and more affordable, businesses should be able to take advantage of mixing various data types coming from different data sources and analyze them to make effective decisions to manage their enterprises.

How Complex Is Big Data?

Data has traditionally been stored in a structured format, which makes archiving, querying, and analyzing easier. However, with wide usage of various devices, data has become more unstructured. It is estimated that 80 percent of the world's data is unstructured (i.e., unable to conform to traditional relational database structures), which makes analysis and insights on multiple data sets very challenging. Given the pervasiveness of unstructured data, the growth in file-based storage (unstructured data) has outpaced block-based storage (62 percent five-year CAGR versus 24 percent). The sudden rise and usage of social media, machine-generated data, and smart devices has added complexities in managing big data and deriving greater business value from them. These data have emerged recently that may provide greater intelligence and predictability if we can capture, process, and analyze the data in real time or near real time.

  • Social media. Increased usage of social networking sites continues to drive storage requirements for unstructured data: More than 300 million photos are uploaded daily to Facebook; Zynga processes 1 petabyte of gaming content on a daily basis; 72 hours of video are uploaded to YouTube every minute; and Twitter receives nearly 250 million tweets daily.
  • Machine-to-machine (M2M). The increased deployment of M2M devices such as smart meters, telematics, radio frequency identification (RFID) devices, vehicle sensors, and industrial sensors with embedded networking has driven machine-generated data. Data generated from M2M devices is expected to grow at a 35 percent compound annual growth rate (CAGR) by 2015. According to research findings, M2M will create an economic impact of $2.7 trillion to $6.2 trillion annually by 2025. And the World Bank and General Electric are pointing to a $32 trillion opportunity on the premise that a 1 percent improvement from the integration of the industrial Internet into energy, transportation, health care, aviation, and other industries can generate savings of around $200 billion, according to the McKinsey Global Institute (www.netcommwireless.com/information/articles/m2m.-the-numbers-are-big-and-only-getting-bigger).
  • Mobility. The widespread adoption of mobile devices (smartphones, tablets, etc.) has placed the power of the Internet within the reach of a fingertip. The number of global smartphone users recently crossed the 1 billion mark (i.e., one in seven people owns a smartphone), thus driving the consumption, demand, and generation of mobile data. Mobile data traffic is estimated to grow at a 78 percent CAGR from 2011 to 2016, reaching 10.8 exabytes per month by 2016, according to Cisco.
  • Enterprise data. Adoption of enterprise software solutions and greater IT sophistication has increased the data exhaust generated by enterprise firms. Unstructured data continues to garner a greater proportion of enterprise data and is expected to represent 80 percent of total enterprise data by 2015, up from 64 percent in 2006. The torrent of unstructured enterprise data places an additional strain on corporate IT systems. In a survey conducted by Avanade, a business technology consulting and solutions provider, 55 percent of respondents reported a slowdown of IT systems and 47 percent cited data security issues resulting from increased data exhaust (www.netcommwireless.com/information/articles/m2m.-the-numbers-are-big-and-only-getting-bigger).

With the evolution of the cloud deployment model, the majority of big data solutions are offered as software only, an appliance, or cloud-based offerings. As is the case with other applications deployments, big data deployment will also depend on several issues such as data locality, privacy and regulation, human resources, and project requirements. Many organizations are opting for a hybrid solution using on-demand cloud resources to supplement in-house deployments.

The highest value from big data can be achieved by combining data coming from big data sources such as web logs, machine data, and social media data with other transactional data within businesses. Decision makers get the big picture of their customers' behavior, patterns, and preferences over the others.

Therefore, it is highly important that businesses combine their strategy on big data with their comprehensive data analytics strategy. In order to succeed and remain competitive, organizations need to plan for comprehensive data management and analytics.

How Does Big Data Coexist with Existing Traditional Data?

Big data on its own offers great insights for businesses, but it becomes more powerful when it is combined with an organization's existing transactional data and used for analytics.

Web logs or browsing history for example indicates the customer's buying patterns and helps to determine the value of a customer from his or her purchase history in the past.

Big data is messy and requires enormous effort in data cleansing and data quality. The phenomenon of big data is closely tied to the emergence of data science, a discipline that combines math, programming, and scientific instinct. Current data warehousing projects take a long time to offer meaningful analytics to business users.

It depends on extract, transform, and load (ETL) processes from various data sources. Big data analytics, however, can be defined as a process in relationship to or in context to the need to parse large data sets from multiple sources, and to produce information in real time or near real time.

Big data analytics represents a big opportunity. Many large businesses are exploring the analytics capabilities to parse web-based data sources and extract value from social media. However, an even larger opportunity, the Internet of Things (IoT), is emerging as a data source. Cisco Systems estimates that there are approximately 35 billion electronic devices that can connect to the Internet. Any electronic device can be connected to the Internet, and even automakers are building Internet connectivity into vehicles. Connected cars will become commonplace by 2014 and generate millions of transient data streams.

Operational efficiencies, coupled with developments in the technologies and services that make big data a practical reality, will result in a supercharged CAGR of 58 percent between now and 2017.

Big data is the new definitive source of competitive advantage across all industries.

How Big Is the Big Data Market?

The big data market is on the verge of rapid growth to the tune of $50 billion worldwide within the next five years. We already see increased interest in and awareness of the power of big data and related analytic capabilities to gain competitive advantage and to improve business agility.

Of the current market, big data pure-play vendors account for $310 million in revenue. Despite their relatively small percentage of current overall revenue (approximately 5 percent), these vendors such as Vertica, Splunk, and Cloudera are responsible for the vast majority of new innovations and modern approaches to data management and analytics that have emerged over the past several years and made big data the hottest sector in IT.

Marketing and sales organizations are ready for the transformation that big data and predictive analytics bring. This approach is making existing businesses smarter and more efficient by focusing the right resources on customers and prospects that are ready to buy—crushing the competition.

How Would You Manage Big Data on Technology Platforms?

The recent cloud-based technologies and cloud operating environment (discussed in detail in Chapter 3) based on a scalable elastic model have allowed support for a new services deployment model that can be consumed globally from anywhere on any device. The cloud platform has enabled big data storage, processing, and analytics as well.

Let us examine big data tools, platforms, and applications that may offer predictive analytics capabilities to enable effective decision making for sustainable competitive advantage.

Big Data Tools, Platforms, and Applications

Cloud-based applications and services are increasingly allowing small and midsize business to take advantage of big data without needing to deploy on-premises hardware or software. Manufacturing companies deploy sensors in their products to return a stream of telemetry. The proliferation of smartphones and other global positioning system (GPS) devices offers advertisers an opportunity to target consumers when they are in close proximity to a store, coffee shop, or restaurant. This opens up new revenue for service providers and offers many businesses a chance to target new customers.

Use of social media and web log files from their e-commerce sites can help retailers understand their customers' buying patterns, behaviors, likes, and dislikes. This can enable much more effective micro customer segmentation and targeted marketing campaigns, as well as improve supply chain efficiencies.

As with data warehousing, web stores, or any IT platform, an infrastructure for big data has unique requirements. In considering all the components of a big data platform, it is important to easily integrate big data with enterprise data to conduct deep analytics on the combined data set.

In order to make the most meaningful use of big data, businesses must evolve their IT infrastructures to handle the rapid rate of delivery of extreme volumes of data, with varying data types, which can then be integrated with an organization's other enterprise data to be analyzed. When big data is captured, optimized, and analyzed in combination with traditional enterprise data, companies can develop a more thorough and insightful understanding of their business, which can lead to enhanced productivity, a stronger competitive position, and greater innovation to have an impact on the bottom line. For example, in the delivery of health care services, management of chronic or long-term conditions is expensive. Use of in-home monitoring devices to measure vital signs and monitor progress is just one way that sensor data can be used to improve patient health care and reduce both office visits and hospital admittance.

The requirements in a big data infrastructure involve data acquisition, data organization, and data analysis. Because big data refers to data streams of higher velocity and higher variety, the infrastructure required to support the acquisition of big data must deliver low, predictable latency in both capturing data and executing short, simple queries; be able to handle very high transaction volumes, often in a distributed environment; and support flexible, dynamic data structures.

In classic data warehousing terms, organizing data is called data integration. Because there is such a high volume of big data, there is a tendency to organize data at its original storage location, thus saving both time and money by not moving around unnecessarily large volumes of data. The infrastructure required for organizing big data must be able to process and manipulate data in the original storage location; support very high throughput (often in batches) to deal with large data processing steps; and handle a large variety of data formats, from unstructured to structured.

Has Hadoop Solved Big Data Problems?

Apache Hadoop is a new technology that allows large data volumes to be organized and processed while keeping the data on the original data storage cluster. Hadoop Distributed File System (HDFS) is the long-term storage system for web logs. These web logs are turned into browsing behavior (sessions) by running MapReduce programs on the cluster and generating aggregated results on the same cluster. These aggregated results are then loaded into a relational DBMS system. Since data is not always moved during the organization phase, the analysis may also be done in a distributed environment, where some data will stay where it was originally stored and be transparently accessed from a data warehouse. The infrastructure required for analyzing big data must be able to support deeper analytics such as statistical analysis and data mining, on a wider variety of data types stored in diverse systems; scale to extreme data volumes; deliver faster response times driven by changes in behavior; and automate decisions based on analytical models. Most important, the infrastructure must be able to integrate analysis on the combination of big data and traditional enterprise data. New insight comes not just from analyzing new data, but from analyzing it within the context of the old to provide new perspectives on old problems. For example, analyzing inventory data from a smart vending machine in combination with the events calendar for the venue in which the vending machine is located will dictate the optimal product mix and replenishment schedule for the vending machine.

Many new technologies have emerged to address the IT infrastructure requirements just outlined. These new systems have created a divided solutions spectrum comprised of NoSQL solutions that are developer-centric specialized systems and SQL solutions that are typically equated with the manageability, security, and trusted nature of relational database management systems (RDBMSs).

A few niche vendors are developing applications and platforms that leverage the underlying Hadoop infrastructure to provide both data scientists and business users with easy-to-use tools for experimenting with big data. These include Datameer, which has developed a Hadoop-based business intelligence platform with a familiar spreadsheet-like interface; Karmasphere, whose platform allows data scientists to perform ad hoc queries on Hadoop-based data via an SQL interface; and Digital Reasoning, whose Synthesis platform sits on top of Hadoop to analyze text-based communication.

Tresata's cloud-based platform, for example, leverages Hadoop to process and analyze large volumes of financial data and returns results via on-demand visualizations for banks, financial data companies, and other financial services companies.

Additionally, 1010data offers a cloud-based application that allows business users and analysts to manipulate data in the familiar spreadsheet format but at big data scale. And the ClickFox platform mines large volumes of customer touch-point data to map the total customer experience with visuals and analytics delivered on demand.

Non-Hadoop Big Data Platforms

Other non-Hadoop vendors contributing significant innovation to the big data landscape include Splunk, which specializes in processing and analyzing log file data to allow administrators to monitor IT infrastructure performance and identify bottlenecks and other disruptions to service. HPCC (High-Performance Computing Cluster) Systems, a spin-off of LexisNexis, offers a competing big data framework to Hadoop that its engineers built internally over the past 10 years to assist the company in processing and analyzing large volumes of data for its clients in finance, utilities, and government. DataStax offers a commercial version of the open source Apache Cassandra NoSQL database along with related support services bundled with Hadoop.

NoSQL databases are frequently used to acquire and store big data. They are well suited for dynamic data structures and are highly scalable. The data stored in an NoSQL database is typically of a high variety because the systems are intended to simply capture all data without categorizing and parsing the data. For example, NoSQL databases are often used to collect and store social media data. While customer-facing applications frequently change, underlying storage structures are kept simple.

Instead of designing a schema with relationships between entities, these simple structures often just contain a major key to identify the data point and then a content container holding the relevant data. This simple and dynamic structure allows changes to take place without costly reorganizations at the storage layer.

NoSQL systems are designed to capture all data without categorizing and parsing it upon entry into the system, and therefore the data is highly varied. SQL systems, however, typically place data in well-defined structures and impose metadata on the data captured to ensure consistency and validate data types.

Distributed file systems and transaction (key-value) stores are primarily used to capture data and are generally in line with the requirements discussed earlier in this chapter. To interpret and distill information from the data in these solutions, a programming paradigm called MapReduce is used. MapReduce programs are custom-written programs that run in parallel on the distributed data nodes.

The key-value stores or NoSQL databases are the online transaction processing (OLTP) databases of the big data world; they are optimized for very fast data capture and simple query patterns. NoSQL databases are able to provide very fast performance because the data that is captured is quickly stored with a single identifying key rather than being interpreted and cast into a schema. By doing so, NoSQL database can rapidly store large numbers of transactions.

However, due to the changing nature of the data in the NoSQL database, any data organization effort requires programming to interpret the storage logic used. This, combined with the lack of support for complex query patterns, makes it difficult for end users to distill value out of data in an NoSQL database.

To get the most from NoSQL solutions and turn them from specialized, developer-centric solutions into solutions for the enterprise, they must be combined with SQL solutions into a single proven infrastructure that meets the manageability and security requirements of today's enterprises.

How Does Oracle Address Big Data Challenges?

Oracle's big data strategy is centered on current enterprise data architecture to incorporate big data and deliver business value, leveraging the proven reliability, flexibility, and performance of existing systems to address evolving big data requirements.

Oracle offers engineered and integrated systems to meet the big data challenge by including software and hardware into one engineered system. The Oracle Big Data Appliance is an engineered system that combines optimized hardware with the most comprehensive software stack featuring specialized solutions developed by Oracle to deliver a complete, easy-to-deploy solution for acquiring, organizing, and loading big data into Oracle Database 11g. It is designed to deliver extreme analytics on all data types, with enterprise-class performance, availability, supportability, and security. With Big Data Connectors, the solution is tightly integrated with Oracle Exadata and Oracle Database, so you can analyze all your data together with extreme performance.

Oracle Big Data Appliance

Oracle Big Data Appliance comes in a full rack configuration with 18 Sun servers for a total storage capacity of 648 TB. Every server in the rack has two CPUs, each with six cores for a total of 216 cores per full rack. Each server has 48 GB of memory for a total of 864 GB of memory per full rack.

Oracle Big Data Appliance includes a combination of open source software and specialized software developed by Oracle to address enterprise big data requirements.

Big Data Appliance contains Cloudera's Distribution including Apache Hadoop (CDH) and Cloudera Manager. CDH is the leading Apache Hadoop-based distribution in commercial and noncommercial environments. CDH consists of 100 percent open source Apache Hadoop plus the comprehensive set of open source software components needed to use Hadoop. Cloudera Manager is an end-to-end management application for CDH. Cloudera Manager gives a clusterwide, real-time view of nodes and services running; provides a single, central place to enact configuration changes across the cluster; and incorporates a full range of reporting and diagnostic tools to help optimize cluster performance and utilization.

Where Oracle Big Data Appliance makes it easy for organizations to acquire and organize new types of data, Oracle Big Data Connectors enables an integrated data set for analyzing all data. Oracle Big Data Connectors can be installed on an Oracle Big Data Appliance or on a generic Hadoop cluster.

Oracle Loader for Hadoop (OLH) enables users to use Hadoop MapReduce processing to create optimized data sets for efficient loading and analysis in Oracle Database 11g. Unlike other Hadoop loaders, it generates Oracle internal formats to load data faster and use fewer database system resources. OLH is added as the last step in the MapReduce transformations as a separate map–partition–reduce step. This last step uses the CPUs in the Hadoop cluster to format the data into Oracle-understood formats, allowing for a lower CPU load on the Oracle cluster and higher data ingest rates because the data is already formatted for Oracle Database. Once loaded, the data is permanently available in the database, providing very fast access to this data for general database users leveraging SQL or business intelligence tools.

Oracle Direct Connector for Hadoop Distributed File System (HDFS) is a high-speed connector for accessing data on HDFS directly from Oracle Database. Oracle Direct Connector for HDFS gives users the flexibility of querying data from HDFS at any time, as needed by their application. It allows the creation of an external table in Oracle Database, enabling direct SQL access on data stored in HDFS. The data stored in HDFS can then be queried via SQL, joined with data stored in Oracle Database, or loaded into the Oracle Database. Access to the data on HDFS is optimized for fast data movement and parallelized, with automatic load balancing. Data on HDFS can be in delimited files or in Oracle data pump files created by Oracle Loader for Hadoop.

Oracle Data Integrator Application Adapter for Hadoop simplifies data integration from Hadoop and an Oracle Database through Oracle Data Integrator's easy-to-use interface. Once the data is accessible in the database, end users can use SQL and Oracle BI Enterprise Edition to access data. Even enterprises that are already using a Hadoop solution, and don't need an integrated offering like Oracle Big Data Appliance, can integrate data from HDFS using Big Data Connectors as a stand-alone software solution.

Oracle R Connector for Hadoop is an R package that provides transparent access to Hadoop and to data stored in HDFS. R Connector for Hadoop provides users of the open source statistical environment R with the ability to analyze data stored in HDFS, and to run R models at scale against large volumes of data leveraging MapReduce processing—without requiring R users to learn yet another API or language. End users can leverage over 3,500 open source R packages to analyze data stored in HDFS, while administrators do not need to learn R to schedule R MapReduce models in production environments. Connector for Hadoop can optionally be used together with the Oracle Advanced Analytics Option for Oracle Database. The Oracle Advanced Analytics Option enables R users to work transparently with database resident data without having to learn SQL or database concepts but with R computations executing directly in database.

Oracle NoSQL Database is a distributed, highly scalable, key-value database based on Oracle Berkeley DB. It delivers a general-purpose, enterprise-class key-value store adding an intelligent driver on top of distributed Berkeley DB. This intelligent driver keeps track of the underlying storage topology, shards the data, and knows where data can be placed with the lowest latency. Unlike competitive solutions, Oracle NoSQL Database is easy to install, configure, and manage; it supports a broad set of workloads, and delivers enterprise-class reliability backed by enterprise-class Oracle support.

The primary use cases for Oracle NoSQL Database are low latency data capture and fast querying of that data, typically by key lookup. Oracle NoSQL Database comes with an easy-to-use Java API and a management framework.

The product is available in both an open source community edition and in a priced enterprise edition for large distributed data centers. The former version is installed as part of the Big Data Appliance integrated software.

In-Database Analytics

Once data has been loaded from Oracle Big Data Appliance into Oracle Database or Oracle Exadata, end users can use one of the following easy-to-use tools for in-database, advanced analytics:

  • Oracle R Enterprise. Oracle's version of the widely used Project R statistical environment enables statisticians to use R on very large data sets without any modifications to the end user experience. Examples of R usage include predicting airline delays at a particular airport and the submission of clinical trial analysis and results.
  • In-database data mining is the ability to create complex models and deploy these on very large data volumes to drive predictive analytics. End users can leverage the results of these predictive models in their BI tools without the need to know how to build the models. For example, regression models can be used to predict customer age based on purchasing behavior and demographic data.
  • In-database text mining is the ability to mine text from microblogs and CRM system comment fields, and to review sites combining Oracle Text and Oracle Data Mining. An example of text mining is sentiment analysis based on comments. Sentiment analysis tries to show how customers feel about certain companies, products, or activities.
  • In-database semantic analysis is the ability to create graphs and connections between various data points and data sets. Semantic analysis creates, for example, networks of relationships determining the value of a customer's circle of friends. When looking at customer churn, customer value is based on the value of his or her network, rather than on just the value of the customer.
  • In-database spatial is the ability to add a spatial dimension to data and show data plotted on a map. This ability enables end users to understand geospatial relationships and trends much more efficiently. For example, spatial data can visualize a network of people and their geographical proximity. Customers who are in close proximity can readily influence each other's purchasing behavior, an opportunity that can be easily missed if spatial visualization is left out.
  • In-database MapReduce is the ability to write procedural logic and seamlessly leverage Oracle Database parallel execution. In-database MapReduce allows data scientists to create high-performance routines with complex logic. In-database MapReduce can be exposed via SQL. Examples of leveraging in-database MapReduce are sessionization of web logs or organization of call details records (CDRs).

Every one of the analytical components in Oracle Database is valuable. Combining these components creates even more value to the business. Leveraging SQL or a BI tool to expose the results of these analytics to end users gives an organization an edge over others that do not leverage the full potential of analytics in Oracle Database. Connections between Oracle Big Data Appliance and Oracle Exadata are via InfiniBand, enabling high-speed data transfer for batch or query workloads. Oracle Exadata provides outstanding performance in hosting data warehouses and transaction processing databases. Now that the data is in mass-consumption format, Oracle Exalytics can be used to deliver the wealth of information to the business analyst.

Predictive Analytics

Predictive analytics is an area of data mining that deals with extracting information from data and using it to predict trends and behavior patterns. Predictive analytics offers capabilities to analyze and understand the behavior that may lead to the actions in the future. In the current business scenario, it is extremely important to implement predictive modeling, scoring data with predictive models and forecasting for future actions. Predictive analytics is business intelligence technology that produces a predictive score for each customer or other organizational element. Assigning these predictive scores is the job of a predictive model that has, in turn, been trained over data, learning from the experience of the organization. Predictive analytics optimizes marketing campaigns and website behavior to increase customer responses, conversions, and clicks, and to decrease churn. Each customer's predictive score informs actions to be taken with that customer.

Analyzing new and diverse digital data streams can reveal new sources of economic value, provide fresh insights into customer behavior, and identify market trends early on. But this influx of new data creates challenges for IT departments. To derive real business value from big data, businesses need the right tools to capture and organize a wide variety of data types from different sources, and to be able to easily analyze it within the context of all your enterprise data. In this competitive business environment, it will be important for businesses to have a blend of a healthy data-science culture for creating business agility, for staying competitive, and for survivability. Data scientists will play a major role in helping C-level decision makers with the right information based on new big data analytics in the context of the business.

In-Memory Analytics

In-memory analytics is the new revolution in data management and offers greater power to run data business with increased agility and more perfectly. In-memory analytics is a methodology used to solve complex and time-sensitive business scenarios. It works by increasing the speed, performance, and reliability when querying data. In-memory analytics is an approach to querying data when it resides in a computer's random-access memory (RAM), as opposed to querying data that is stored in databases. The software platform is optimized for distributed, in-memory processing, to help run new scenarios or complex analytical computations at a faster pace. Now businesses can instantly explore, visualize, and analyze data and tackle problems that were never considered before due to computing constraints.

In-memory analytics can provide fast access to deeper insights to seize opportunities and mitigate threats in near real time, run more sophisticated queries and models using all data to generate more precise insights that can improve business performance, and get answers to most difficult business questions quickly, with the speed and flexibility to meet business needs today and in the future.

SAP introduced HANA a few years ago to exploit in-memory processing capabilities of database to process large data workloads in order to provide data processing and analytics in real time to help businesses make the right decisions based on critical information. It converges database and application platform capabilities in memory to transform transactions, analytics, text analysis, and predictive and spatial data to enrich decision power in business.

Based on in-memory computing, Oracle has also offered an option to switch the database for in-memory processing, exploiting the memory and caching technology.

Spark cluster computing is yet another revolution exploiting in-memory computing in clusters handling large data sets based on the Hadoop framework. Spark can store data in the memory subsystems of the thousands of servers it pulls together, unlike Hadoop, which stores its data on old-fashioned hard disks. It not only fetches data fast but also provides scale-out deployment on demand based on the large number of nodes in the cluster environment.

The Spark cluster computing framework is an outcome of work by two research scientists, Matei Zaharia, a Romanian-born graduate student who has spent the past few years at Berkeley's AMPLab, a research operation dedicated to software that runs distributed software, and another Romanian, Berkeley professor Ion Stoica.

In the next chapter, we discuss the predictability of your business that drives key decisions and the business wisdom associated with it. When knowledge is power, it is extremely significant for businesses to cultivate a knowledge ecosystem for survival, sustenance, continued growth, and business agility. We have highlighted key elements of building knowledge ecosystems and a knowledge management process that would help a business to harness power for continued growth and leadership.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.58.201.75