4

Big Data Technologies

The decision to move forward with Big Data is a strategic choice each organization has to make. It does, however, have significant IT implications, as Big Data requires the use of new technology. This technology ranges from different ways of storing and processing data to the various analyses that can be performed on the data, and a number of companies have developed the necessary technology. In 2013, worldwide IT spending on Big Data technologies exceeded $31 billion.1

The Big Data ecosystem is growing so rapidly that it is difficult to understand the market and to determine which players can solve what problems. Because Big Data promises so many benefits, many Big Data technology vendors are offering solutions to those problems.

There are the large players, such as Microsoft, SAS, IBM, HP, or Dell, that have made Big Data part of their total offerings. Most of these companies can deliver a complete solution. Smaller companies, however, do not need a total Big Data solution. They require a specialized solution to a specific problem, and many different Big Data startups can fulfill this need. I will discuss a few different categories that are on the market at the moment.

Next to paid solutions, there are many open-source tools that give organizations the possibility to use and experiment with free Big Data technologies. In this rapidly Big Data growing landscape, an open-source tool is available to solve almost any problem. One of the best known is Hadoop, without which Big Data would have taken much longer to reveal its potential.

HADOOP HDFS AND MAPREDUCE

Hadoop, which is named after the elephant toy of the child of the inventor of Hadoop, was developed because existing data storage and processing tools appeared inadequate to handle the large amounts of data that started to appear with the growth of the Internet. First, Google developed the programming model MapReduce to cope with the flow of data that resulted from its mission to organize and make universally accessible the world's information Yahoo, in response, developed Hadoop in 2005 as an implementation of MapReduce. It was released as an open-source tool in 2007 under the Apache license.

Since then, Hadoop has evolved into a large-scale operating system that focuses on distributed and parallel processing of vast amounts of data.2 As is with any “normal” operating system, Hadoop consists of a file system, is able to write and distribute programs, and returns results.

Hadoop supports data-intensive distributed applications that can run simultaneously on large clusters of normal, commodity hardware. A Hadoop network is reliable and extremely scalable; it can be used to query massive datasets. Hadoop is written in the Java programming language, which means it can run on any platform and is used by a global community of distributors and Big Data technology vendors that have built layers on top of Hadoop.

What is particularly useful is the Hadoop Distributed File System (HDFS), which breaks down the data it processes into smaller pieces called blocks. These blocks are subsequently distributed throughout a cluster. This distributing of the data allows the map-and-reduce functions to be executed on smaller subsets instead of on one large dataset. This increases efficiency, decreases processing time, and enables the scalability necessary for processing vast amounts of data.

MapReduce is a software framework and model that can process and retrieve the vast amounts of data stored in parallel on the Hadoop system. The MapReduce libraries have been written in many programming languages and uses two steps to work with structured and unstructured data. The first step is the “Map-phase,” which divides the data into smaller subsets that are distributed over the different nodes in a cluster. Nodes within the system can do this again, resulting in a multi-level tree structure that divides the data in ever-smaller subsets. At these nodes, the data is processed and the answer is passed back to the “master node.” The second step is the “reduce phase,” during which the master node collects all the returned data and combines it into some sort of output that can be used again. The MapReduce framework manages all the various tasks in parallel and across the system. It forms the heart of Hadoop.

With this combination of technologies, massive amounts of data can be easily stored, processed, and analyzed in a fraction of a second. If a top layer, such as Hortonworks or Cloudera, is added to it, real-time analytics becomes possible. Hadoop provides great advantages and makes Big Data analytics possible.

Although Hadoop, HDFS, and MapReduce offer many advantages to organizations, such as linear scaling on commodity hardware and a high degree of fault tolerance, it is not the Holy Grail, as was anticipated. Hadoop also has some substantial disadvantages. It is difficult to get Hadoop operational, it requires specialized engineers who are expensive, cluster management is hard, and debugging is pretty challenging. Organizations will need specially trained IT personnel to install a complete Hadoop cluster. Installing a Hadoop cluster on premises can be a daunting task and therefore companies, and especially smaller companies, should think carefully about whether to start with it. Especially because more and more Big Data startups develop Big-Data-as-a-Service solutions, taking away the need to build and own the Hadoop environment. These companies offer Hadoop clusters in the cloud instead.

OPEN-SOURCE TOOLS

Although Hadoop is the best-known open-source tool, many other open-source tools are available on the market, including some that offer extensive visualizations, drag-and-drop options, and easy-to-install scripts.

They have proven to be efficient and cost effective in storing, analyzing, and visualizing Big Data. Open-source tools are not as risky as they used to be, so more and more companies are adapting and implementing them.

An overview of the landscape can be found at http://bit.ly/16YR9zx. Among the advantages of open-source tools are:

  • They do not require a huge investment to get started; just download and start working. It is a great way to try a product.
  • The community around open-source tools is big and active, meaning that the product is developed and improved quickly when compared to closed tools that tend to have a longer time to market. It also helps when encountering problems, as someone else in the community might have had the same problem and already solved it. This prevents companies from having to reinvent the wheel.
  • Open-source tools have a flexible, scalable architecture that is cost effective when managing huge quantities of data. This is especially desirable for SMEs.
  • Open-source tools are developed in such away that they operate on commodity hardware, making it unnecessary to invest in expensive equipment.

That open-source tools are gaining importance is also shown by the fact that more and more vendors who traditionally relied on proprietary models are embracing this technology. For example, in 2012, VMware launched a new open-source project called Serengeti that is designed to let Hadoop run on top of VMware vSphere Cloud.3 In addition, EMC Greenplum made its new Chorus social framework open-source last year.4

However, just as with Hadoop, open-source tools have some disadvantages. First of all, free open-source tools do not come with support from the developer; companies need to purchase an enterprise edition to receive that service. Community support cannot be assumed and, in some instances, may not be taken as granted.5 Although an open-source tool can be useful experimenting with new tools, it usually does require trained IT personnel who understand the open-source tool. Finally, the original developers of the open-source tool may move to other companies or lose interest in further developing the tool. The result is outdated software that is less well equipped for future Big Data challenges.

So, the decision to use an open-source tool has to be made carefully. Organizations should consider not only the low cost of open-source tools, but also have a detailed understanding and analysis of the pros and cons of the different open-source tools as compared with commercial Big Data technology.

BIG DATA TOOLS AND TYPES OF ANALYSIS

Many commercial Big Data technologies have been developed by startups that have found a way to deal with vast amounts of data. They have developed disruptive technologies that organizations can use to obtain valuable insights and turn data into information and eventually wisdom. Of course, large existing IT players have also created substantial amounts of Big Data technology in recent years. Particularly large corporations that want an all-inclusive Big Data solution use those technologies. In addition, many different types of analysis can be done using these technologies, and each will have a different result. As there are Big Data startups for almost every need and any use in any industry, we cannot discuss them all. Therefore, the focus will be on some of the most important areas. For a larger overview of Big Data startups, go to BigData-Startups.com/Open-Source-Tools

As already mentioned, a disadvantage of Hadoop is that it works with batches and cannot deal with vast amounts of data in real time. However, real-time streaming and processing of data provides many benefits to organizations. Some Big Data technology vendors have built a layer on top of Hadoop or have built completely new tools that can cope with real-time processing, storing, analyzing, and visualizing of data. These tools can analyze unstructured and structured data in real time, significantly improving the functionality of Hadoop and MapReduce.

Some technologies integrate data from different sources directly into a platform. Thus, they avoid the need for additional data warehousing, but still delivering real-time interactive charts that are easy to interact with and understand.

Some Big Data technology vendors focus on delivering the optimal graphical representation of Big Data. Visualizing unstructured and structured data is necessary to turn it into information, but it is also very challenging. New Big Data startups, however, seem to understand the practice of visualizing and have developed different solutions. One is visualization based on the cortex of the human eye, which maximizes the ability of the human brain to recognize patterns. It makes it easy to read and understand massive amounts of relational data. The use of color and different thicknesses perceived within the cortex allow users to easily recognize patterns and discover abnormalities.

Another way of visualizing is a technique called topological data analysis, which focuses on the shape of complex data and can identify clusters and statistical significance. Data scientists can use this to reveal inherent patterns in clusters. This type of analysis is best visualized with three-dimensional clusters that show the topological spaces and can be explored interactively.

It is definitely not always necessary to have complex, innovative, and interactive graphical representations. Infographics are graphic visual representations of information, data, or knowledge that can help make difficult and complex material easily understandable. Dashboards combining different data streams showing “traditional” graphs (column, line, pie, or bar) can also provide valuable insights. Sometimes, real-time updated simple graphs showing the status of processes will actually provide more valuable information to improve decision making then more complex and innovative visualizations. On mobile devices, visualizations get a completely new meaning when a user is able to play intuitively with the data while swiping, pinching, rotating, or zooming.

Although the ability to visualize real-time analyses in a great way is important, it is even more valuable for organizations to be able to predict future outcomes. This is hugely different from existing business intelligence, which usually only looks at what has already happened using analytical tools that say nothing about the future. Predictive analysis can help companies provide actionable intelligence based on that same data.

Therefore, many Big Data startups focus on predictive modeling capabilities that will enable organizations to anticipate what is coming. Collecting as much data as possible while a potential customer is visiting a website can give valuable insights. Information such as products browsed and considered, transactional data, or browser and sessions information can be merged with historical and deep customer information about previous purchases and loyalty program records. This provides a complete picture about the visitor and can help predict the likelihood of the visitor to become a high lifetime value customer. With such information, organizations can take corresponding actions if required.

Predictive analytics is also used in the ecommerce world to help consumers buy electronics, reserve hotel rooms, or purchase airline tickets. These services can help consumers purchase products at the right moment at the right price by telling consumers when prices will drop or what the best period in the week is to buy.

Predictive analytics can be used in any industry, but a great example is the insurance industry. which uses it to determine which policy-holders are likelier to make claims and to predict the risk the company is facing. This type of analysis works better as more data is collected because the algorithm can take more variables into account when making its predictions.

Profiling is used to better target potential customers. The ultimate goal is to develop a 360-degree view of each customer, so that a segment of one can eventually be created. Behavioral analytics can be used to discover patterns in (un)structured data across customer touch points, giving organizations better insights into the different types of customers they have. Consumer patterns derived from data, such as demographic, geographic, psychographic, and economic attributes, will help organizations better understand their customers. Sales and marketing data, such as campaign information, operations data, and conversion data, will also provide organizations with accurate data about customers that can be used to increase customer retention and acquisitions, grow upselling and cross-selling, and increase online conversion.

Profiles can also be used in systems that make recommendations. These recommender systems are one of the most common applications of Big Data. The best-known application is probably Amazon's recommendation engine, which allows users to get a personalized homepage when they visit Amazon.com. However, etailers are not the only companies that use recommendation engines to persuade customers to buy additional products. Recommender systems can also be used in other industries, as well as have different applications.

Recommender systems can be based on two different types of algorithms, which are often combined. The first analyzes vast amounts of data about past choices/purchases of customers and uses this information to suggest new products. This is called collaborative filtering, a system recommending other products based on what other users with the same profile have bought. For example, a user bought A, B, C, and D and another user bought A, B, C, D, and E. The system will then automatically recommend product E to the first user, as both users have the same buying profile. The second approach is content-based filtering, in which the system uses a detailed profile of what a user has previously bought, liked, searched for, tweeted about, blogged about, visited, and so on. Based on that information, a profile is created and products are recommended that best fit that profile based on the attributes of that product.

Most consumers know recommendation engines from online shopping, but recommendation engines can also be used B2B, for example, to recommend potential prospects to salespeople. In a post, Ellis Booker from InformationWeek explains that public data sets, such as credit bureaus company data information, can be combined with a company's own sales and customer database to find new relationships that a salesperson might have missed.6 As such, recommendation systems are becoming common in finance and insurance companies, where they are used to suggest, among other things, investment opportunities or sales strategies.

In fact, recommendation engines can be used anywhere users are looking for products/services or people. LinkedIn uses recommendations to suggest people, jobs, or groups you might want to connect with.7 The “you may like this” functionality on the platform blends content-based and collaborative filtering and uses a algorithmic popularity and graph-based approach for the recommendations. Building a virtual profile of each group and extracting the most representative features of that group's members creates “Groups you may like.” LinkedIn recommends jobs by combining different profile features, such as behavior, location, and attributes, of people similar to you.

Recommendations has become a standard feature for most large, online players, from retailers to online travel websites. For any company working with recommendations, the trick is to deliver relevant recommendations.8 This will improve the buyer experience and increase the conversion rate.

With the ever-increasing amount of data, recommendation engines will only become better in the future. For organizations, this will mean better targeting of products to the right person and, as a result, probably an increase in the conversion rate. For consumers, it will become even easier to find the product they are are looking for. However, this could also have a downside. If the recommendation engines become so good and recommend products/services before you are aware that you need them, how will this affect the possibility of a consumer discovering new products/services that are not in line with his or her profile? Organizations should be aware of this, as otherwise it could backfire.

HOW AMAZON IS LEVERAGING BIG DATA

Amazon has an unrivaled databank about online consumer purchasing that it can mine from its 152 million customer accounts.9 For many years, Amazon has used that data to built a recommender system that suggests products to people who visit Amazon.com. As early as 2003, Amazon used item–item similarity methods from collaborative filtering, which at that time was state of the art. Since then, Amazon has improved its recommender engine and, today, the company has mastered it to perfection. It uses customer click-stream data and historical purchase data from those 152 million customers; each user is shown customized results on customized web pages.

Amazon also uses Big Data to offer superb service to its customers.10 This could be the result of its purchase of Zapos in 2009. It ensures that customer representatives have all the information they need the moment a customer requires support. They can do this because they take all the data collected to build and constantly improve relationships with its customers.11 Many retailers could learn from this example.

But Amazon is expanding its use of Big Data because the competition is closing in. As such, Amazon added remote computing services, via Amazon Web Services (AWS), to its already massive product and service offering. Launched in 2002, AWS recently added Big Data services, and it now offer tools to support data collection, storage, computation, collaboration, and sharing. All are available in the cloud. The Amazon Elastic MapReduce provides a managed, easy-to-use analytics platform built around the powerful Hadoop framework that is used by large companies, including Dropbox, Netflix, and Yelp.12,13

However, there is more. Amazon also uses Big Data to monitor, track, and secure its 1.5 billion items to be found in its 200 fulfillment centers around the world.14 Amazon stores the product catalogue data in S3.15 This is a simple web service interface that can be used to store any amount of data at any time from anywhere on the web. It can write, read, and delete objects up to 5 TB of data. The catalogue stored in S3 receives more than 50 million updates a week, and every 30 minutes all data received is crunched and reported back to the different warehouses and the website.

At AWS, Amazon also hosts public Big Data sets at no cost.16 All available Big Data sets can be used and seamlessly integrated in AWS cloud-based solutions. Everyone can now use this public data, such as data from the Human Genome Project.

MIT Technology Review reported in 2013 on a new Amazon project to package information about consumers and sell it to marketers who can use it to advertise products tailored to what people really want.17 In contrast to Google and Facebook, which might have more overall data about consumers, Amazon has a clearer understanding of what people actually buy and therefore what they are looking for and what they need. This is much more valuable information, and this could definitely grow Amazon's advertising revenue in the coming years.

In the past few years, Amazon has definitely moved away from a pure ecommerce player to a giant online player. It focuses massively on Big Data and is changing from an online retailer into a Big Data company.

When websites start using machine-learning systems for real-time recommendations, the recommendations will improve, as the system will learn from unsuccessful recommendations. The many social networks on the web also generate a lot of data about consumers. Tweets, Likes, blog posts, and check-ins can give companies answers to important questions, such as, What is the sentiment in the market?, How are (new) products/commercials received and perceived?, and How can products or services be improved to suit the needs of customers?

Organizations that use deep machine learning and natural language processing will be able to interpret the meaning of comments placed on social networks and place generic statements into the right context. Such social media analytics help organizations better understanding their customers. When integrated with other tools, such as sales data, surveys, support tickets, usage logs, and other sources of customer intelligence, social media analytics can turn customer retention into a data-driven process that will increase conversion and decrease churn.

Clustering analysis and segmentation is a data-driven approach to look for patterns within Big Data and to group similar data objects, behaviors, or whatever other information can be found within the data. This goes much further than human-created segments, which are mostly based on easily identifiable traits, such a location, age, and gender.

Big Data driven clustering and segmentation performed by algorithms can find segments and patterns that would otherwise remain hidden. When using self-learning algorithms, the segmentation is improved; while doing the segmentation, the algorithm learns about the segmentation it creates. It can, for example, come up with clusters of consumers who are becoming parents in a geographical location in a certain age group and with a certain type of job. The result can be used to drive personalized and targeted marketing efforts. Whatever can be found in the Big Data can be turned into a segment, and this can help companies to better serve their customers.

Where clusters can be found, outliers can also be shown. By finding the outlier within Big Data and identifying the unique exception, an organization can discover unexpected knowledge. Although finding an outlier can be like finding a needle in a haystack, it is less difficult for algorithms. Such anomalies can have exceptional value when they are found. A good example is fraud detection or identifying criminal activities in online banking. With machine learning and self-learning algorithms, outlier detection can find correlations that are too vague for humans to discover because of the huge amount of data necessary to identify the pattern.

In a similarity search, an algorithm tries to find an object that is most similar to the object of interest. The best-known similarity algorithm is the app Shazam, which can find a song in a database of more than 11 million songs after listening for just a few seconds.

In the past, SQL queries were performed to find components that matched certain conditions, such as “find all cars within a certain age range from a certain brand.” Similarity searches can be more like “find all cars like this one.” As these algorithms use Big Data to find similarities, there is a far better chance of success in finding what you are looking for. Most of the time, algorithms can also perform thousands of searches simultaneously in a split second, thereby locating what you are looking for in an instant.

Finally, some Big Data startups focus on the human capital aspect. McKinsey already forecasts that in 2018 there will be a shortage of 140,000 to 190,000 Big Data Scientists in the United States alone. If Big Data is made easier to manipulate and rearrange, this problem can be solved by removing the need for expensive Big Data Scientists.

Accessing and exploring heterogeneous data can be made so simple that users would be able to mix Big Data sources that are stored, for example, on Hadoop with traditional sources and perform analyses on them without the need to be a data scientist. Tools already on the market are especially designed for smaller organizations that want to start discovering the possibilities of Big Data without spending large amounts of money on IT and personnel. The tools provide these organizations with a single platform that incorporates data from any source in any environment and allows them to conduct analyses or create integrated data views. However, the problem is that it is difficult to adapt the solutions to personal needs. For extensive Big Data solutions, Big Data Scientists and engineers will always be necessary.

TAKEAWAYS

Without technology Big Data is just an empty shell. The tools available in the market help organizations turn their Big Data strategy into reality. Many different technologies are available on the market that address different problems in different industries. Therefore, finding the right tool for the right job will be a challenge.

This chapter discussed the best-known Big Data tool, Hadoop, which processes and stores massive amounts of data in a distributed file system. However, many different solutions exist for different tasks. Although Hadoop is the best known, it is definitely not the only storage tool available. For example, NoSQL databases solve some of the problems common with Hadoop. These databases offer scalability, agility, real-time analytics and flexibility. Although it has not yet reached the hype of Hadoop, thanks to these features the NoSQL movement is drawing a lot of attention. In addition, many cloud-based Data-as-a-Service solutions are appearing on the market. Therefore, small or medium-sized enterprises do not immediately have to build a Hadoop cluster on their premises if they want to start using Big Data.

The open-source landscape is also growing rapidly. By 2013, there was an open-source alternative for almost any paid solution. Although open-source tools are free and have the support of a large community, they also have disadvantages, such as the need for technically trained personnel. Organizations that want to start using open-source tools should be aware of the different pros and cons before they dive into the open-source world.

Further, organizations that would rather use a commercial solution that is easier (sometimes) to install and use will find many different options. The Big Data startup scene is growing rapidly. The investments made in this sector have exceeded $2.5 billion. and new Big Data startups are being launched on a daily basis.18 With so many options available, organizations need to choose wisely when moving ahead with a Big Data startup and understand the benefits and the costs of the selected Big Data technology vendor.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.135.80