Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 9

Building a Big Data Infrastructure

In This Chapter

Understanding the key infrastructure elements to invest in

Looking at data collection, storage, analysis and output options

Checking out some of the best software available

Big data can bring huge benefits to businesses, whether small, medium or large. However, as with any project, proper preparation and planning is essential, especially when it comes to infrastructure.

You’ll need to invest in some tools or services in order to achieve the ultimate objective: gleaning insights that lead to better decision making and improved performance.

Until recently it was hard for companies to get into big data without making heavy infrastructure investments – expensive data warehouses, software, analytic staff, and so on. But times have changed. Cloud computing in particular has opened up a lot of options for using big data, as it means businesses can tap into big data without having to invest in massive on-site storage facilities. Other developments, such as big data as a service and the ever-increasing range of big data providers, also make big data a possibility for even the smallest company, allowing you to harness external resources and skills very easily.

In this chapter I look at the key infrastructure considerations and set out the main options for small businesses, including a few ideas for those on a budget.

Making Big Data Infrastructure Decisions

You may think a traditional data warehouse would be a good place to start and, up until a few years ago, you’d have been spot on. But today you have more options than ever, including distributed storage and data lakes (see Chapter 6 for more about these). Besides, data storage isn’t the only element you need to consider.

Understanding the key infrastructure elements

When I talk about infrastructure, I mean the software or hardware necessary to take big data and turn it into insights and action (see Chapter 7 for more about insights and action). There’s no point having masses of data at your disposal if you don’t have the capability to learn something from that data, take action based on what you’ve learned and grow your business as a result.

In order to turn big data into insights and business growth, it’s likely you’ll need to make investments in the following key infrastructure elements:

Data collection
Data storage
Data analysis/processing
Data visualisation/communication

These are generally known as the layers of big data, and I look at each layer in more detail later in the chapter.

Evaluating your existing infrastructure

Before you splash out on any new technology, it’s worth looking at what you’re already using in your business. Some of your existing infrastructure could have a role to play.

Go through each of the four key infrastructure elements listed in the preceding section and note what related technology or skills you already have in house. For example, you may already be collecting useful customer data through your website or customer service department. Or you very likely have a wealth of financial and sales data that could provide insights. Just be aware that you may already have some very useful data that could help answer your strategic business questions (see Chapter 11).

In terms of data storage, you probably already have some data storage capabilities, even if you’re just storing company information on a server or desktop. If you’re a bigger operation, you may already have a data warehouse. Now, what you have may not be enough storage for the data you intend to use, but you need to be aware of what you have now so that you can decide how it can be improved or supplemented to cope with more data.

Your data processing infrastructure may be limited or non-existent at present. Maybe you have systems in place to make sense of sales and financial data, but not much beyond that. That’s okay. The beauty of big data is the ever-increasing range of analytic options opening up for businesses. Even if you’re starting from scratch, the right analytic option is within reach.

Your existing data visualisation and communication basically comes down to how you communicate information across your company at present. If your people are already well versed in communicating information and making decisions based on that information then you have a great starting point for using big data in your business.

If you’re accessing someone else’s data (like Facebook or Twitter, for instance), then the data capture, storage and processing elements may not apply to you – or they may apply to a lesser degree (you may want to partner someone else’s data with some of your own internal data). Data as a service, for example, which I talk about in the next section, allows you to sign up to use someone else’s data. This means you have no need to store or process that data, as you simply access it through the service provider’s web interface. This is a great option if you’re looking to understand more about customers, markets and trends, but it gets trickier if you want to use your own data to improve your processes (see Chapters 7 and 12 for more on this). In that case, it’s much more likely you’ll need to invest in the technology to capture your own data, which then means you’ll need somewhere to store it and a way to analyse that data.

Big Data on a Budget: Introducing Big Data as a Service

The good news is that big data doesn’t have to cost the Earth – although as with most things in life, you usually get what you pay for. Some of the platforms involved can be quite expensive, but there are ways to keep your costs down.

Open source (free) software exists for most of the essential big data tasks (which I talk about in ‘Introducing the Four Layers of Big Data’ later in the chapter). And distributed storage systems are designed to run on cheap, off-the-shelf hardware. The popularity of Hadoop has really opened big data up to the masses – it allows anyone to use cheap off-the-shelf hardware and open source software to analyse data, providing you invest time learning how. That’s the trade-off: it will take some time and technical skill to get free software set up and working the way you want. So unless you have the expertise (or are willing to spend time developing it) it might be worth paying for professional technical help, or ‘enterprise’ versions of the software. These are generally customised versions of the free packages, designed to be easier to use, or specifically targeted at various industries.

Then there’s another, often simpler option for businesses: big data as a service (or BDaaS, which is a fun acronym to say out loud!). In the last few years many businesses have sprung up offering cloud-based big data services to help other companies and organisations solve their data dilemmas. At the moment, BDaaS is a somewhat vague term used to describe a wide variety of outsourcing of various big data functions to the cloud. This can range from the supply of data, to the supply of analytical tools that interrogate the data (often through a web dashboard or control panel) to carrying out the actual analysis and providing reports. Some BDaaS providers also include consulting and advisory services within their BDaaS packages.

BDaaS removes many of the hurdles associated with implementing a big data strategy and vastly lowers the barrier of entry. When you use BDaaS, all of the techy nuts and bolts are, in theory, out of sight and out of mind, leaving you free to concentrate on business issues. BDaaS providers generally take this on for the customer – they have everything set up and ready to go – and you simply rent the use of their cloud-based storage and analytics engines and pay either for the time you use them or the amount of data crunched.

Another great advantage is that BDaaS providers often take on the cost of compliance and data protection – something that can be a real burden for small businesses. When the data is stored on the BDaaS’s servers, generally, the provider is responsible for compliance and protection.

It’s not just new BDaaS companies that are getting in on the act; some of the big corporations like IBM and HP are also offering their own versions of BDaaS. HP has made its big data analytics platform, Haven, available entirely through the cloud. This means that everything from storage to analytics and reporting is handled on HP systems that are leased to the customer via a monthly subscription – entirely eliminating infrastructure costs. And IBM’s Analytics for Twitter service provides businesses with access to data and analytics on Twitter’s 500 million tweets per day and 280 million monthly active users. The service provides analytical tools and applications for making sense of that messy, unstructured data and has trained 4,000 consultants to help businesses put plans into action to profit from them.

As more and more companies realise the value of big data, more services will emerge to support them. And competition between suppliers should help keep subscription prices low, which is another advantage for smaller businesses. I’ve already seen BDaaS making big data projects viable for many businesses that previously would have considered them out of reach – and I think it’s something you’ll see and hear a lot more about in the near future.

Introducing the Four Layers of Big Data

Big data systems are usually made up of what’s called layers, which are the different stages the data has to pass through on its journey from raw statistic or snippet of unstructured data (for example, social media post) to actionable insight. There are four layers that you need to consider, and I give a brief overview of each in the following sections.

Data source layer

This is the data that arrives at your company. It includes everything from your sales records, customer database, feedback, social media channels, marketing lists, email archives and any data gleaned from monitoring or measuring aspects of your operations. One of the steps in setting up a data strategy (Chapter 10 has more on this) is assessing what you have and measuring it against what you need to answer the critical questions you want help with. You might have everything you need already, or you might need to establish new sources.

Data storage layer

This is where you keep your data after it’s gathered from your sources. As the volume of data generated and stored by companies has started to explode, sophisticated but accessible systems and tools have been developed to help with this task.

Data processing/analysis layer

When you want to use the data you have stored to find out something useful, you need to process and analyse it. So, this layer is all about turning data into insights. This is where programing languages and platforms come into play. I set out the key analytics platforms in ‘Turning Data into Insights’ later in the chapter.

Data output layer

This is how the insights gleaned through the analysis are passed on to the people who can take action to benefit from them. Clear and concise communication is essential, and this output can take the form of reports, charts, figures and key recommendations. Ultimately, at this layer, you need a system (however fancy or simple) that shows how decisions and actions based on your analysis can lead to business improvement and growth.

Sourcing Your Data

You may already have the data you need to answer your strategic questions. (If so, lucky you!) But chances are you need to source some or all of the data required.

If you need to source new data, this may require new infrastructure investments. Ask yourself, ‘What do I need to access this data?’ The answer depends on the type of data you need but may include sensors, cameras or systems to collect text or audio data. For example, if you want to collect machine data from your factory operations or vehicles, you need to invest in sensors to collect the data.

Collecting your own

Options for collecting your own data include:

Created data: Asking questions and capturing the answers – from customer surveys, focus groups and/or capturing details when customers are registering for something. This data can be structured or semi-structured and can be internal and external. (I talk about the different types of data in Chapter 1.)
Provoked data: Asking people to express a view, such as rating a product. The data can be structured or semi-structured, internal or external.
Transaction data: Data created every time someone buys something, online or offline, including what he bought and when. It’s usually internal and structured.
Compiled data: Data that comes from giant databases that companies like Experian, a credit-rating agency, hold. This type of data is usually external and structured.
Experimental data: This is a hybrid of created and transaction data. For example, running a marketing campaign and observing the results. It can be structured or semi-structured and can be internal or external.
Captured data: Information gathered from individuals’ behaviour, for example, search terms, GPS (global positioning system) data and so on. It can be structured or unstructured, internal or external and often includes data generated by machines.
User-generated data: Information generated consciously by a person – for example, Facebook posts, tweets, comments on a blog. It’s usually unstructured and can be internal or external.

Infrastructure requirements for capturing data depend on the type or types of data you’re collecting. Key options might include:

Sensors that could sit in devices, machines, buildings, or on vehicles, packaging, or anywhere else you would like to capture data from.
Apps that generate user data – for example, a customer app that allows customers to order more easily.
CCTV (closed-circuit television) video.
Beacons, such as iBeacons from Apple, that allow you to capture and transmit data to and from mobile phones.
Changes to your website that prompt customers for more information.
Social media profiles, if you don’t have them in place already for your business.

I provide a list of the top ten data collection tools in Chapter 16. With a little technical knowledge, you can set many of these systems up yourself, or you can partner with a data company to set up the systems and capture the data on your behalf.

Accessing external sources

There are thousands upon thousands of options for accessing external sources of data, and the options are growing every day. I list my favourite free sources in Chapter 15. Other sources include big players like HP and IBM, as well as smaller, more industry-focused providers.

Accessing external data sources may require little or no infrastructure changes on your part, since you’re accessing data that someone else is capturing and managing. If you have a computer (or a smartphone) and an Internet connection, you’re pretty much good to go.

Keep in mind that you’re looking for the right data for you – the data that best answers your strategic questions. If a provider’s data doesn’t help you do that, then it doesn’t matter how big or impressive its dataset is, it’s not the right one for you.

Storing Big Data

After you have your data, you need to think about where to store it. The main storage options include:

A traditional data warehouse
A data lake (see Chapter 6)
A distributed/cloud-based storage system (Chapter 6, again)
Your company server or a computer hard disk

Regular hard disks are available at very high capacities and for very little cost these days and, if you’re a small business, this may be all you need. But when you start to deal with storing (and analysing) a large amount of data, or if data is going to be a key part of your business going forward, a more sophisticated, distributed (usually cloud-based) system like Hadoop is called for.

Distributed storage is a method of using cheap, off-the-shelf components to rig up your own high-capacity storage solutions, which are then controlled by software that keeps track of where everything is and finds it for you when you need it. Cloud storage really just means that your data is stored, usually remotely, but connected to the Internet and accessible from anywhere you can get online. So you don’t have to worry about physically holding onto it yourself at all. Most distributed storage systems make use of cloud technology and the terms are often used interchangeably.

I think cloud-based storage is a brilliant option for most small businesses. It’s flexible, you don’t need physical systems on-site and it reduces your data security burden. It’s also considerably cheaper than investing in expensive dedicated systems and data warehouses.

As well as a system for storing data that your computer system can understand (the file system) you need a system for organising and categorising it in a way that people can understand (the database). Hadoop has its own database, known as HBase, but other popular options include Amazon’s DynamoDB, MongoDB and Cassandra (used by Facebook).

Understanding Hadoop and MapReduce

Hadoop can be thought of as a set of open source programs and procedures (essentially meaning they’re free for anyone to use or modify, with a few exceptions), which anyone can use as the backbone of his big data operations.

Development of Hadoop began when forward-thinking software engineers realised that it was becoming useful to be able to store and analyse datasets far larger than can practically be stored and accessed on one physical storage device (such as a hard disk). The idea behind it is that many smaller devices working in parallel are more efficient than one large one.

Hadoop was released in 2005 by the Apache Software Foundation, a non-profit organisation that produces open source software that powers much of the Internet behind the scenes. (If you’re wondering where the odd name came from, it was the name given to a toy elephant belonging to the son of one of the original creators!)

Looking under the Hadoop hood

Hadoop is made up of modules, each of which carries out a particular task essential for a computer system designed for big data analytics. The first two are the most important:

Distributed file system: This allows data to be stored in an easily accessible format. A file system is the method used by a computer to store data so that it can be found and used. Normally, the file system is determined by the computer’s operating system; however a Hadoop system uses its own file system that sits ‘above’ the file system of the host computer – meaning it can be accessed using any computer running any supported OS.
Map Reduce: This provides the basic tools for poking around in the data. It’s named after the two basic operations this module carries out – reading data from the database, putting that data into a format suitable for analysis (map) and performing mathematical operations – for example, counting the number of males aged 30 and over in a customer database (reduce).
Hadoop Common: This provides the tools (in Java) needed for the user’s computer systems (Windows, Unix or whatever) to read data stored under the Hadoop file system.
YARN: This manages resources of the systems storing the data and running the analysis.

What makes Hadoop so popular?

The flexible nature of a Hadoop system means companies can add to or modify their data system as their needs change, using cheap and readily-available parts from any IT (information technology) vendor.

Today, Hadoop is the most widely used system for providing data storage and processing across commodity hardware – relatively inexpensive, off-the-shelf systems linked together, as opposed to expensive, bespoke systems custom-made for the job in hand. Thanks to the flexible nature of the system, companies can expand and adjust their data analysis operations as their business expands.

It’s estimated that more than half of the companies in the Fortune 500 make use of Hadoop. Just about all of the big online names use it, and as anyone is free to alter it for his own purposes, modifications made to the software by expert engineers at, for example, Amazon and Google, are fed back to the development community, where they are often used to improve the official product. This form of collaborative development between volunteer and commercial users is a key feature of open source software.

In its raw state, using the basic modules supplied by Apache, Hadoop can be very complex, even for IT professionals, which is why various commercial versions, such as Cloudera, have been developed to simplify the task of installing and running a Hadoop system, as well as offering training and support services.

Understanding Spark

Spark is a framework – in the same way that Hadoop is – that provides a number of interconnected platforms, systems and standards for big data projects.

Like Hadoop, Spark is open source and under the wing of the Apache Software Foundation. It’s seen by techies in the industry as a more advanced product than Hadoop – it’s newer and designed to work by processing data in chunks in memory, which means it transfers data from the physical, magnetic hard disks into far faster electronic memory where processing can be carried out far more quickly – up to 100 times faster in some operations.

Spark has proven very popular and is used by many large companies for huge, multi-petabyte data storage and analysis. This is partly because of its speed. Last year, Spark set a world record by completing a benchmark test involving sorting 100 terabytes of data in 23 minutes – beating the previous world record of 71 minutes that was held by Hadoop.

Additionally, Spark has proven itself to be highly suited to machine learning applications (which I explore in Chapter 7), where computers are being taught to spot patterns in data and adapt their behaviour accordingly.

Spark is designed to be easy to install and use – if you have a background in computer science! In order to make it accessible to more businesses, many vendors provide their own versions geared towards particular industries, or custom-configured for individual clients’ projects, as well as associated consultancy services to get it up and running.

Spark uses cluster computing for its computational (analytics) power as well as its storage. This means it can use resources from many computer processors linked together for its analytics. It’s also a scalable solution, meaning that if more oomph is needed, you can simply introduce more processors into the system. You can also add more storage when needed, and the fact that it uses commonly available commodity hardware (any standard computer hard disks) keeps down infrastructure costs.

Unlike Hadoop, Spark does not come with its own file system – instead it can be integrated with many file systems, including Hadoop’s HDFS, MongoDB and Amazon’s S3 system.

Another element of the framework is Spark Streaming, which allows applications to be developed that perform analytics on streaming, real-time data – such as automatically analysing video or social media data on the fly. In fast-changing industries such as marketing, real-time analytics has huge advantages; for example, ads can be served based on a user’s behaviour at a particular time, rather than on historical behaviour, increasing the chance of prompting an impulse purchase.

Other considerations: Data ownership and security

Remember that if the data is going to form a key part of your ongoing operations, then it’s really important that you own that particular data. If you’re reliant on another party’s data in order to performs key business functions, and the supplier ups its prices or denies access for any reason, you’re scuppered.

There are also some big things to consider in terms of data security. Depending on the sort of data you’re storing, there may well be security and privacy regulations to follow, particularly when it comes to personal data. Wherever possible, try to use anonymised data that doesn’t identify individuals’ details. When this isn’t possible, you need to ensure that data is kept safe and secure. Even if this isn’t a legal requirement in your country, there are reputational and moral reasons to ensure your customers’ data is kept safe. I talk more about big data ethics and the need for being transparent with your customers in Chapter 2.

People used to worry about the security of data stored in the cloud but, these days, it’s often safer there than with companies that store their own data in-house. Often the cloud security systems are much more up to date and the fact that the data is stored in more than one place provides an extra safety net. Personally, I recommend cloud storage as a safe and secure option for small businesses.

Turning Data into Insights

After you have your data, the next step is to analyse it. By analysing data you can extract the insights you need to answer your strategic questions and meet your business goals. The chapters in Part IV set out the strategic process in more detail.

Processing and analysing data

There are three basic steps in processing and analysing data:

Preparing the data.

Preparation includes identifying the data crucial to the task at hand, cleaning it to get rid of unnecessary background noise and putting it into a format that’s accessible to the software or people who need to understand it.
Building models and validating data.

You need to adjust variables and see how this impacts on the data. Then you need to assess how the changes you’re making work towards achieving the goals you set yourself at the start.
Drawing a conclusion.

You assess the insights you gleaned during Step 2 and decide what changes you’re going to make as a result.

Software from vendors such as IBM, Oracle and Google is designed to help you do all of this: turn raw data into insights. Google has BigQuery, which is designed to let anyone with a bit of data science knowledge run queries against vast datasets. And many start-ups are piling into the market, offering simple solutions that claim to let you simply feed it with all of your data and sit back while it highlights the most important insights and suggests actions for you to take.

A common method for analysing data is using a MapReduce tool (see ‘Looking under the Hadoop hood’ earlier in the chapter). Essentially, this is used to select the elements of the data that you want to analyse and put them into a format from which insights can be gleaned.

Understanding Python

Python is a programming language frequently used to create algorithms for sorting through and analysing large amounts of data. It’s another open source program so it integrates very well with other open source technologies like Hadoop and Spark.

Python is a high-level language – meaning that the code that the programmer types in to create the program is more like natural human language than code written to control machines. This not only makes things simpler for the programmer, it means others are more likely to understand the code if they want to use it themselves. The high-level, human-like code can be converted into machine code through a piece of software known as an interpreter.

This means that programs written in Python can be run on any computer operating system that has an interpreter for it – which is pretty much all of the operating systems you’re ever likely to come across! It also means that code can be ported between projects and organisations even if the people running it are using completely different hardware.

Aside from its ease of use and portability, one of the features that has made Python particularly popular is the powerful libraries available for it, which basically means it’s great at manipulating very large amounts of data.

It’s also great for creating scalable systems – in fact it’s used for creating much of the back-end, data-processing functions of Google, YouTube and Facebook. As well as constantly increasing in size, these services need to be constantly updating and adding to their functionality. With giant operations such as these, programmers need an environment where new code (features) can be integrated on the fly without disruption of the service to users. Python is ideal for this as it’s designed for use in agile environments where new features need to be added all the time.

Popular data analytics platforms

The past few years have seen an explosion in the number of platforms available for big data analytical tasks. These platforms are commercial offerings, meaning you pay an ongoing service charge. Most use the Hadoop framework as the basis and build on it for analysis.

The following sections contain a rundown, in no particular order, of the best and most widely used of these services. Like any commercial product in a competitive market, each has its advantages and disadvantages, so what’s best for one company may not be best for another.

Cloudera CDH

Cloudera was formed by former employees of Google, Yahoo!, Facebook and Oracle and offers open source as well as commercial Hadoop-based big data solutions. Its distributions make use of its Impala analytics engine that has also been adopted and included in packages offered by competitors such as Amazon and MapR.

Hortonworks Data Platform (HDP)

Unlike every other big analytics platform, HDP is entirely comprised of open source code, with all of its elements built through the Apache Software Foundation. HDP makes its money offering services and support for getting it running and providing the results you’re after.

Microsoft HDInsight

Microsoft’s flagship analytical offering, HDInsight is based on Hortonworks Data Platform but is tailored to work with its own Azure cloud services and SQL Server database management system. A big advantage for businesses is that it integrates with Excel, meaning even staff with only basic IT skills can dip their toes into big data analytics.

IBM Big Data Platform

IBM offers a range of products and services designed to make complex big data analysis more accessible to businesses. IBM offers its own Hadoop distribution known as InfoSphere BigInsights.

Splunk Enterprise

This platform is specifically geared to businesses that generate a lot of their own data through their own machinery. Splunk Enterprise’s stated goal is ‘machine data to operational intelligence’ and the Internet of Things (which I talk about in Chapter 5) is key to this strategy. Its analytics drive Dominos Pizza’s US coupon campaigns.

Amazon Web Services

Although everyone thinks of Amazon as an online store, it also makes money by selling the magic that makes its business run so smoothly to other companies. The Amazon business model was based on big data from the start – using personal information to offer a personalised shopping experience. Amazon Web Services includes its Elastic Cloud Compute and Elastic MapReduce services to offer large-scale data storage and analysis in the cloud.

Pivotal Big Data Suite

Pivotal’s big data package is comprised of its own Hadoop distribution, Pivotal HD, and its analytics platform, Pivotal Analytics. It’s business model allows consumers to store an unlimited amount of data and pay a subscription fee that varies according to how much Pivotal analyses. The company is strongly invested in the data lake philosophy of a unified, object-based storage repository for all of an organisation’s data.

Infobright

This is another database management system that’s available as both an open source, free edition and a paid-for proprietary version. This product is geared towards users looking to get involved with the Internet of Things. It offers three levels of service for paid users, with higher-tier customers given access to the helpdesk and quicker email support response times.

MapR

MapR offers its own distribution of Hadoop, notably different from others as it replaces the commonly used Hadoop File System with its alternative MapR Data Platform, which it claims offers better performance and ease of use.

Kognito Analytical Platform

Like many of the other systems here, this takes data from your Hadoop or cloud-based storage network and gives users access to a range of advanced analytical functions. Kognito is used by British Telecom to help set call charges and by loyalty program Nectar for its customer analytics.

Presenting the Insights

All too often I see businesses bury the nuggets of information that could really impact strategy in a 50-page report or a complicated graphic that no one understands. It’s clearly unrealistic to expect busy people to wade through mountains of data with endless spreadsheet appendices and extract the key messages.

If the key insights aren’t clearly presented, they won’t result in action.

In these sections, I set out the main data output options in terms of the tools required. You can find more on communicating data in Chapter 7 (for honing in on insights) and Chapter 11 (for advice on communicating data).

Getting to grips with the main data output options

There are a range of methods for getting data to the people or machines that need them. The key options are:

Algorithms that help machines perform certain functions. For example, an algorithm that tells your website that if someone buys X to recommend Y.
Dashboards that provide your people with the information they require, whenever they require it.
Commercial data visualisation platforms that make the data attractive and easy to understand.
Simple graphics (like bar charts, for instance) and reports that communicate key insights.

In my experience, for most small businesses looking to improve their decision making, simple graphics or visualisation platforms are more than enough to present insights from data. Therefore, I focus on the best visualisation tools in the next section.

Looking at the visualisation tools available

Big data analytics have created a wave of new visualisation tools capable of making the outputs of the analytics look pretty, and improving understanding and speed of comprehension. Many of these tools are open source, free applications that can be used independently or alongside your existing design applications, often using simple drag-and-drop functionality.

There are enough data visualisation tools to warrant a book on their own, but as the technology is evolving and developing all the time, I just want to give you a flavour of what exists right now.

Many of the analytics platforms mentioned earlier in the chapter have some sort of visualisation function included. If a platform doesn’t meet your needs, some excellent cloud-based visualisation tools are relatively easy to use. Two of my favourites are QlikView and Tableau (which is free).

Some other tools and ideas you might like to check out:

D3 charts: D3.js is a JavaScript library for manipulating documents based on data and helps to bring that data to life. This free software can manipulate data in a mind-boggling array of ways, including box plots and dendrograms.
Word clouds: These offer a great way to illustrate sentiment or weighted opinion in text without getting into the nitty-gritty of what individual people or sub-sets said. This can be particularly useful for illustrating the qualitative information contained within a customer survey or employee engagement survey. The weighting (for example, most popular words and phrases in bigger type) allows you to see what people think about your product, service, brand or company, without reading every response. Many free software programs convert text data into word clouds, including Wordle and Tagul.
Maps: These can be presented in a variety of different ways with additional information overlaid across the map. Google Maps offers a range of tools that enable developers to build interactive visual mapping programs for any application or website.
Displaying emotions and behaviour: There are ways to display behaviour or emotion data that weren’t possible a few years ago. Crazy Egg allows you to track visitor clicks on your website, see where visitors stop scrolling down the page, connect clicks with traffic types and pinpoint hotspots using their heat map tool. This type of tool can easily and very quickly illustrate user or customer behaviour online.

Now you have an understanding of the infrastructure elements required, you’re ready to start putting these building blocks in place and put data to work in your business (I delve into this in Part IV). As with any aspect of big data, if you’re still unsure where to start, then I recommend working with a big data consultant. A consultant can help you identify your data strategy and narrow down the infrastructure elements so you can ensure you have the technology that’s right for you.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 9: Building a Big Data Infrastructure

Create new playlist

Sign In

Sign Up